Re: Scanned PDF to text

From: Chris Fitzpatrick <chrisfitzpat_at_nyob>
Date: Thu, 11 Dec 2014 09:59:21 +0100
To: CODE4LIB_at_LISTSERV.ND.EDU
Tesseract is going to be slow, and there might not much you can do about
that.

You can do a couple of things, like set up a processes that run on AWS EC2
spot instances, so you can put a standing bid order on AWS instances and
only run your OCR when the price drops.

Or you can buy ABBYY , which is much faster.

b,chris.

b,chris.


On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee <kyle.banerjee_at_gmail.com>
wrote:

> > I’m not quite sure if I understand the question, but if all you want to
> do is pull the text out of an OCR’ed PDF file, then I have found both Tika
> and PDFtotext to be useful tools....
> >
> > On the other hand, if you need to do the OCR itself, then employing
> Tesseract is probably the way to go.
>
> For clarity, I have to do the OCR itself. I've been using CAM::PDF to
> extract existing text.
>
> Kyle
>
Received on Thu Dec 11 2014 - 04:00:16 EST