quick and dirty
Posted on March 1, 2007
Filed Under Thursday Thamauturgy |
For a current project, I must convert a number of PDF documents to plain text. I could do this manually using a free conversion program, but that would not be nearly adventurous enough.
You might recall that I mentioned xpdf in a previous post. It is a good tool, but unfortunately it does not support batches of files at once, and it requires command-line execution. Here is my solution in 6 lines of Python:
import os, glob
from subprocess import call
batch = glob.glob('c:\convert\*.pdf')
for item in batch:
textName = item + ".txt" #make a name for the txt file output
call(('c:\\xpdf\\pdftotext.exe', '-layout', item, textName), shell=True)
Mmmm, quick and dirty, but who cares? Now I can collect all of my PDF files in one directory (c:\convert), execute my little script, and they're all nicely converted to text! That is much better than typing the alternative:
C:\>c:\xpdf\pdftotex.exe -layout pdffilename1.pdf newtxtfilename1.txt
for every document I want to convert. Unacceptable!
Two micro-projects that I plan to launch off of this one are: (1) add a script to combine text documents, for those annoying situations where manuals are broken into 100 sectional PDF files. This will give me a nice plain text manual when I don't care about font and headers, but want to preserve layout. (2) make a script to compare similar documents converted to text. This will come in handy when those manuals change over time! But I might instead choose to convert to xml or some other format so that I can do more with it in Python.
thats pretty cool im looking for a network admin job if anybodys hiring i work in PA right now thanks
CAN I SUGGEST WE KILL DERRICK? I WATCH THIS BLOG AND EVERY TIME HE POSTS HE’S BEGGING FOR A JOB. WOULD YOU WANT TO HIRE A MORON?
Thank you for the suggestion, Jan. I had noticed that he manages to append that particular bit of data to the end of every comment. Shouldn’t you have found something by now, derrick? Seriously, please do not use my site as a job board. There are way too many readers who don’t care whether derrick is employed or not.
lol..
Whoa everyone’s active… must be Friday night. Rab, I’m still waiting for that Baroque piece on the ukulele. Have you bought that Mac yet?
Ha ha.. not yet. But I am taking donations