Re: PDF->text extraction

From: Thomas Dowling <tdowling_at_nyob>
Date: Tue, 21 Jun 2011 10:37:42 -0400
To: CODE4LIB_at_LISTSERV.ND.EDU
On 06/21/2011 10:28 AM, Eric Lease Morgan wrote:

>> We've tried iText but had issues with quality
>> We moved to PDFBox but are having performance issues

> I have been satisfied with pdftotext which is a part of the Xpdf suite of tools -- http://bit.ly/kIHD1x

Same here.  For wrapped TIFFs or flaky LaTeX->PDF conversions, you can
also string together pdftoppm, ImageMagick convert (ppm->TIFF), and
Tesseract OCR.


-- 
Thomas Dowling
tdowling_at_ohiolink.edu
Received on Tue Jun 21 2011 - 10:40:56 EDT