quick and dirty

Posted on March 1, 2007
Filed Under Thursday Thamauturgy |

For a current project, I must convert a number of PDF documents to plain text. I could do this manually using a free conversion program, but that would not be nearly adventurous enough.

You might recall that I mentioned xpdf in a previous post. It is a good tool, but unfortunately it does not support batches of files at once, and it requires command-line execution. Here is my solution in 6 lines of Python:

import os, glob

from subprocess import call

batch = glob.glob('c:\convert\*.pdf')

for item in batch:
    textName = item + ".txt" #make a name for the txt file output
    call(('c:\\xpdf\\pdftotext.exe', '-layout', item, textName), shell=True)

Mmmm, quick and dirty, but who cares? Now I can collect all of my PDF files in one directory (c:\convert), execute my little script, and they're all nicely converted to text! That is much better than typing the alternative:

C:\>c:\xpdf\pdftotex.exe -layout pdffilename1.pdf newtxtfilename1.txt

for every document I want to convert. Unacceptable!

Two micro-projects that I plan to launch off of this one are: (1) add a script to combine text documents, for those annoying situations where manuals are broken into 100 sectional PDF files. This will give me a nice plain text manual when I don't care about font and headers, but want to preserve layout. (2) make a script to compare similar documents converted to text. This will come in handy when those manuals change over time! But I might instead choose to convert to xml or some other format so that I can do more with it in Python.

Comments

6 Responses to “quick and dirty”

  1. derrick on March 2nd, 2007 10:35 pm

    thats pretty cool im looking for a network admin job if anybodys hiring i work in PA right now thanks

  2. Janissary on March 2nd, 2007 10:37 pm

    CAN I SUGGEST WE KILL DERRICK? I WATCH THIS BLOG AND EVERY TIME HE POSTS HE’S BEGGING FOR A JOB. WOULD YOU WANT TO HIRE A MORON?

  3. Rab on March 2nd, 2007 10:40 pm

    Thank you for the suggestion, Jan. I had noticed that he manages to append that particular bit of data to the end of every comment. Shouldn’t you have found something by now, derrick? Seriously, please do not use my site as a job board. There are way too many readers who don’t care whether derrick is employed or not.

  4. PiPpy on March 2nd, 2007 10:44 pm

    lol..

  5. musicfeind on March 2nd, 2007 10:46 pm

    Whoa everyone’s active… must be Friday night. Rab, I’m still waiting for that Baroque piece on the ukulele. Have you bought that Mac yet?

  6. Rab on March 2nd, 2007 10:47 pm

    Ha ha.. not yet. But I am taking donations ;)

Leave a Reply

You must be logged in to post a comment.

  • starting your head on fire

    Delivers oxygen to your brain faster than all other methods! Proven in double blind placebo controlled studies.

  • Places to Go

  • Blogroll

    • Cato @ Liberty - Cato is a libertarian research foundation. If you wish to stay informed on public policy, you can not ignore the Cato blog. Pay them a visit.
    • Greg Lincoln
    • Modo Vernant Omnia - Tampa local with many topics of interest to my readers. As you might imagine, I could not resist blogrolling a site that has tags as diverse as: economy (with views similar), spinning (as in wool), ancient Greek stuff, and occult.
    • Nassim Nicholas Taleb - Fooled by Randomness - The Economist turned me on to Mr. Taleb, a trader turned philoisopher (as described by Forbes). His writing and his thinking are quite interesting, and certainly more worthwhile than anything you will find here.
    • Ouralexander.org - Site discussing informed consent in pediatric medicine, by a family that experienced the worst possible tragedy.
    • RiskProf - An Insurance Blog - A truly marvelous blog. I know risk & insurance are less interesting to most of you than organizing your sock drawer, but this is excellent writing. Besides, if you understand insurance you have the right to complain about it.
    • Schneier on Security - This will be your favorite security blog. Bruce Schneier exposes “security theatre” and proposes realistic problems and solutions. Stay up to date on security policy and IT risk management with CRYPTOGRAM, his email newsletter.
    • The Goodly Mr. Plotkin - My excellent friend Richard Plotkin. Read his musings and insights, or wander through the social network of MySpace users. Several of his blog-friends are people I went to school or knew other ways… I wonder if they will Google me and say ‘hi.’
    • Worlds Healthiest Foods - I love this website. It is all about the nutritional value of foods that are considered ‘healthy,’ with tons of sources cited.