I ran into a curious problem for a side problem of mine where I had some information in PDF files, both text and images. What I want to do is display the information from the PDF’s on a mobile (Android) device. PDF isn’t exactly a mobile friendly format, so I got the idea use HTML. The next trick then becomes how to get the content out of the PDF’s I want into HTML. Tux to the rescue!
As luck would have the, the utilities pdftotext and pdfimage will allow you to extract your text and images from a PDF (respectively). pdftotext was even nice enough to extract the text from the PDF and put it into an HTML document for me. (To get these on your Ubuntu box: sudo apt-get install xpdf-utils). Once I had these apps installed, I used a bit of Ruby to automate the process – I had 65 PDF’s to convert and wasn’t crazy about the keystrokes involved for all 65 files. Net time to do all this was a comfortable couple of hours in front of my TV catching up on the backlog of shows on the PVR. How is that for multi-tasking?
Here is the Ruby script I wrote. I welcome suggestions / improvements / enhancements / comments / cash donations / bottles of scotch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #!/usr/bin/ruby Dir.glob("*.txt") do |file| File.delete(file) end Dir.glob("*.jpg") do |file| File.delete(file) end Dir.glob("*.pdf") do |file| basename = File.basename(file,'.*') if (File.directory?(basename)) Dir.glob("./#{basename}/*.*") do |file2| File.delete(file2) end else Dir.mkdir basename unless File.directory?(basename) end system("pdftotext", "-htmlmeta", file, "./#{basename}/#{basename}.html") system("pdfimages", "-j", file, "./#{basename}/") puts "Converted #{file} to text and extracted images to #{basename}." end |
Mike says:
Thanks for this. I am about to embark on a similar requirement for some old files while still learning Linux and Ruby. This will help lots.
March 19, 2011, 00:00