# Picking Apart PDF with Ruby and Linux

I ran into a curious problem for a side problem of mine where I had some information in PDF files, both text and images.  What I want to do is display the information from the PDF's on a mobile (Android) device.  PDF isn't exactly a mobile friendly format, so I got the idea use HTML.  The next trick then becomes how to get the content out of the PDF's I want into HTML.  Tux to the rescue!

As luck would have the, the utilities pdftotext and pdfimage will allow you to extract your text and images from a PDF (respectively).  pdftotext was even nice enough to extract the text from the PDF and put it into an HTML document for me. (To get these on your Ubuntu box:   sudo apt-get install xpdf-utils).  Once I had these apps installed, I used a bit of Ruby to automate the process - I had 65 PDF's to convert and wasn't crazy about the keystrokes involved for all 65 files.  Net time to do all this was a comfortable couple of hours in front of my TV catching up on the backlog of shows on the PVR.  How is that for multi-tasking?

Here is the Ruby script I wrote.  I welcome suggestions / improvements / enhancements / comments / cash donations / bottles of scotch:

#!/usr/bin/ruby
Dir.glob("*.txt") do |file| File.delete(file) end
Dir.glob("*.jpg") do |file| File.delete(file) end
Dir.glob("*.pdf") do |file|
basename = File.basename(file,'.*')
if (File.directory?(basename))
Dir.glob("./#{basename}/*.*") do |file2| File.delete(file2) end
else
Dir.mkdir basename unless File.directory?(basename)
end
system("pdftotext", "-htmlmeta", file, "./#{basename}/#{basename}.html")
system("pdfimages", "-j",  file, "./#{basename}/")
puts "Converted #{file} to text and extracted images to #{basename}."
end