Re: PDF->text extraction

From: Thomas Dowling <tdowling_at_nyob> Date: Tue, 21 Jun 2011 10:37:42 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

On 06/21/2011 10:28 AM, Eric Lease Morgan wrote:

>> We've tried iText but had issues with quality
>> We moved to PDFBox but are having performance issues

> I have been satisfied with pdftotext which is a part of the Xpdf suite of tools -- http://bit.ly/kIHD1x

Same here.  For wrapped TIFFs or flaky LaTeX->PDF conversions, you can
also string together pdftoppm, ImageMagick convert (ppm->TIFF), and
Tesseract OCR.

-- 
Thomas Dowling
tdowling_at_ohiolink.edu