A common problem is that you receive a file in a format that you
cannot easily read because you don’t have an appropriate application.
This is particularly irritating in the case of binary files that are
intended to be read only by a particular application but that you know
actually contain text and formatting instructions. The most common case
of this problem is that you want to retrieve the text from a Microsoft
Word file.
But equally, you may want to extract the text from a file that has
been sent to you in PostScript or PDF format; you can display the file
beautifully on the screen, but it’s not always obvious how to retrieve
the text.
antiword
The typical Windows user has no idea what a Microsoft Word
file contains. It is a binary file with bits of text mixed in with very
strange stuff; try viewing a .doc file with something like emacs or
(better) a hex editor such as ghex2. Among other things, it may often
contain a lot of stuff the author does not suspect is there, things she
thought she had deleted, for example.
Quite a few people have been surprised by this feature, having
unsuspectingly distributed .doc files, and then been confronted with
contents that they didn’t know were there.
From the point of view of Linux users, what is more important is that
when people send you .doc files, you don’t necessarily want to go
through opening them with OpenOffice.org or a similar program. You may
just want to extract the text.
Fortunately antiword does this very well. All you need to do is type:
antiword filename.doc
You will see the file in text format.
ps2ascii
The ps2ascii command tries to extract the full text from a PostScript (or PDF) file.
In general this works quite well, but there may be problems in the
output with missing spaces where newlines were, and (depending on how
the PostScript file was created) there may be some unrecognized
characters.
For example:
user@bible:~ > ps2ascii filename.ps
will write to standard output, while
user@bible:~ > ps2ascii filename.ps outfile.txt
will write the output to a file.
ps2pdf
If you want to convert PostScript files to the PDF format so that
people who use Windows can easily view them, then ps2pdf file.ps is all
you need.
This command creates the PDF version with the name file.pdf.
dvi2tty
DVI (device independent) files are files produced by the TeX and LaTeX
typesetting system (explained in the next section) that can then be
printed using a suitable driver to an output device. Most typically on
Linux they are converted to PostScript using the command dvips and then
printed directly. DVI files can be viewed directly using a program such
as kdvi.
You can extract the text from a DVI file with the command dvi2tty.
Similar caveats to those mentioned for ps2ascii apply: The text you get
out might not be exactly the text that was put in.
A command such as
user@bible:~ > dvi2tty filename.dvi
extracts the text to standard output. You can, of course, redirect it to a file.
detex
TeX is a text formatting system developed by Donald Knuth. LaTeX is an
extension of TeX. These systems are widely used for typesetting
mathematical and scientific books and also in creating printable
versions of open source documentation.
A TeX or LaTeX source file is a plain text file with added markup.
The detex command tries to remove all markup from a TeX or LaTeX source file. It can also be called as delatex.
For example:
user@bible:~ > detex filename.tex
outputs the stripped text to standard output.
acroread and xpdf
acroread and xpdf are PDF viewers:
✦ acroread —Has a text selection tool on its toolbar that
enables you to select text with the cursor and copy it and paste it
into another application.
✦ xpdf —Has similar functionality; you can select
rectangles of text with the mouse cursor and paste them elsewhere. This
can be a very convenient way of getting text out of a PDF file,
particularly if it is a complex one with a number of columns or
separate boxes of text.
html2text
If you have an HTML file and you just want the text without markup,
you can of course display the file in Konqueror and copy the text and
paste it into a new file.
However, if you want to do a similar thing for a large number of files, a commandline tool is more useful.
The html2text command reads an HTML file and outputs plain text, having stripped out the HTML tags.
You can even run it against a URL:
user@bible:~ > html2text http://webdotdev.com