PDF manipulation

From: Yong Tang <yongtang10_at_nyob> Date: Mon, 6 Aug 2012 22:23:53 -0700 To: CODE4LIB_at_LISTSERV.ND.EDU

Hi,

I am a full time information science student and a part time LAMP server 
administrator. I was recently thrown into a file dumpster containing 
hundreds of old PDF files. My job is to clearn the dumpster up by 
putting right files into right folders.  I am facing some difficulties 
when writing a Perl script to get the job done. I would appreciate it if 
you could help.

First of all, what tool /tools do you use to manipulate PDF file 
directly in a script? I tried some Perl modules such as CAM::PDF and 
PDF::API2. The results were not pretty. The original text format was lost.

I am regret that I did not take a XML class last semester, for I just 
get an intuition that the best way to do this job is to save the PDFs 
into XMLs, and then work on the XMLs with script. Instead, I have to 
save the PDFs into plain texts. I found PDFedit and Adobe Acrobat X Pro 
were good because both of them kept original text format after the 
conversion. However, I have no idea how to use them to save multiple 
PDFs into plain texts at once.  I googled for the answers but found no 
luck.  Anybody knows how to do it?

I am new to text processing. Maybe I am heading in a wrong direction for 
this project? Any input is appreciated.

Yong Tang
A student