Re: Scanned PDF to text

From: David J. Fiander <david_at_nyob> Date: Thu, 11 Dec 2014 11:24:20 -0500 To: CODE4LIB_at_LISTSERV.ND.EDU

Art Rhyno talked about doing this with scans of old community newspapers
a few years ago (https://www.youtube.com/watch?v=gcjCiS9pJ3A)

Yes, it's very compute intensive and slow. He set up Hadoop to farm jobs
out to the PCs in the library's public lab while the library was closed
at night.

- David

On 2014/12/11 03:59, Chris Fitzpatrick wrote:
> Tesseract is going to be slow, and there might not much you can do about
> that.
> 
> You can do a couple of things, like set up a processes that run on AWS EC2
> spot instances, so you can put a standing bid order on AWS instances and
> only run your OCR when the price drops.
> 
> Or you can buy ABBYY , which is much faster.
> 
> b,chris.
> 
> b,chris.
> 
> 
> On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee <kyle.banerjee_at_gmail.com>
> wrote:
> 
>>> I’m not quite sure if I understand the question, but if all you want to
>> do is pull the text out of an OCR’ed PDF file, then I have found both Tika
>> and PDFtotext to be useful tools....
>>>
>>> On the other hand, if you need to do the OCR itself, then employing
>> Tesseract is probably the way to go.
>>
>> For clarity, I have to do the OCR itself. I've been using CAM::PDF to
>> extract existing text.
>>
>> Kyle
>>