Scanned PDF to text

From: Kyle Banerjee <kyle.banerjee_at_nyob>
Date: Tue, 9 Dec 2014 05:25:09 -0800
To: CODE4LIB_at_LISTSERV.ND.EDU
Howdy all,

I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and
tesseract to pull out the OCR -- is more system intensive than I hoped it
would be.

Is there an easier/faster process that I'm missing? Perl friendly solutions
are preferred because this fits in as part of a larger process. If I am
already using my best option, what kind of image parameters are recommended
if I want to hit the point of diminishing returns but not necessarily go
for the best possible? Thanks,

kyle
Received on Tue Dec 09 2014 - 08:26:15 EST