Re: PDF manipulation

From: Eric Lease Morgan <emorgan_at_nyob>
Date: Tue, 7 Aug 2012 09:54:06 -0400
To: CODE4LIB_at_LISTSERV.ND.EDU
On Aug 7, 2012, at 1:23 AM, Yong Tang <yongtang10_at_GMAIL.COM> wrote:

> First of all, what tool /tools do you use to manipulate PDF file 
> directly in a script? I tried some Perl modules such as CAM::PDF and 
> PDF::API2. The results were not pretty. The original text format was lost.

Yong, what type of manipulation do you want to do? What is your goal? Extract the plain text of a PDF document? Read the PDF document's metadata? Group the PDF documents into similar piles? While I haven't done any PDF document metadata reading, I'm sure there are Perl modules supporting these functions. Regarding the extraction of plain text, you have already gotten a number suggestions. Personally, I use a binary called pdftotext (a part of the venerable Xpdf -- http://www.foolabs.com/xpdf/download.html) and use Perl's system command execute it.

-- 
Eric Morgan
Received on Tue Aug 07 2012 - 09:55:29 EDT