Linux Getting Text out of Other File Formats  Hot PDF Print E-mail
Tag it:
Delicious
Furl it!
Digg
NewsVine
Reddit
YahooMyWeb
Technorati
Articles Reviews Linux
Written by Phil Harrison   
Tuesday, 12 December 2006

{mos_sb_discuss:41} 

A common problem is that you receive a file in a format that you cannot easily read because you don’t have an appropriate application. This is particularly irritating in the case of binary files that are intended to be read only by a particular application but that you know actually contain text and formatting instructions. The most common case of this problem is that you want to retrieve the text from a Microsoft Word file.


 
But equally, you may want to extract the text from a file that has been sent to you in PostScript or PDF format; you can display the file beautifully on the screen, but it’s not always obvious how to retrieve the text.

antiword

The typical Windows user has no idea what a Microsoft Word file contains. It is a binary file with bits of text mixed in with very strange stuff; try viewing a .doc file with something like emacs or (better) a hex editor such as ghex2. Among other things, it may often contain a lot of stuff the author does not suspect is there, things she thought she had deleted, for example.

Quite a few people have been surprised by this feature, having unsuspectingly distributed .doc files, and then been confronted with contents that they didn’t know were there.

From the point of view of Linux users, what is more important is that when people send you .doc files, you don’t necessarily want to go through opening them with OpenOffice.org or a similar program. You may just want to extract the text.

Fortunately antiword does this very well. All you need to do is type:

            antiword filename.doc

You will see the file in text format.

ps2ascii

The ps2ascii  command tries to extract the full text from a PostScript (or PDF) file.

In general this works quite well, but there may be problems in the output with missing spaces where newlines were, and (depending on how the PostScript file was created) there may be some unrecognized characters.

For example:

            user@bible:~ > ps2ascii filename.ps
            will write to standard output, while

            user@bible:~ > ps2ascii filename.ps outfile.txt
            will write the output to a file.


ps2pdf

If you want to convert PostScript files to the PDF format so that people who use Windows can easily view them, then ps2pdf file.ps is all you need.

This command creates the PDF version with the name file.pdf.


dvi2tty

DVI (device independent) files are files produced by the TeX and LaTeX typesetting system (explained in the next section) that can then be printed using a suitable driver to an output device. Most typically on Linux they are converted to PostScript using the command dvips and then printed directly. DVI files can be viewed directly using a program such as kdvi.

You can extract the text from a DVI file with the command dvi2tty. Similar caveats to those mentioned for ps2ascii apply: The text you get out might not be exactly the text that was put in.

A command such as

            user@bible:~ > dvi2tty filename.dvi

extracts the text to standard output. You can, of course, redirect it to a file.

 

detex

TeX is a text formatting system developed by Donald Knuth. LaTeX is an extension of TeX. These systems are widely used for typesetting mathematical and scientific books and also in creating printable versions of open source documentation.

A TeX or LaTeX source file is a plain text file with added markup.

The detex command tries to remove all markup from a TeX or LaTeX source file. It can also be called as delatex.

For example:

            user@bible:~ > detex filename.tex

outputs the stripped text to standard output.

 

acroread and xpdf

acroread and xpdf are PDF viewers:

acroread —Has a text selection tool on its toolbar that enables you to select text with the cursor and copy it and paste it into another application.

xpdf —Has similar functionality; you can select rectangles of text with the mouse cursor and paste them elsewhere. This can be a very convenient way of getting text out of a PDF file, particularly if it is a complex one with a number of columns or separate boxes of text.


html2text

If you have an HTML file and you just want the text without markup, you can of course display the file in Konqueror and copy the text and paste it into a new file.

However, if you want to do a similar thing for a large number of files, a commandline tool is more useful.

The html2text command reads an HTML file and outputs plain text, having stripped out the HTML tags.

You can even run it against a URL:

            user@bible:~ > html2text http://webdotdev.com


User reviews

There are no user reviews for this item.

Add new review




Powered by jReviews

Last Updated ( Wednesday, 20 June 2007 )
 
< Prev   Next >