Re: Scanned PDF to text

From: Mads Villadsen <mv_at_nyob> Date: Tue, 9 Dec 2014 14:41:49 +0100 To: CODE4LIB_at_LISTSERV.ND.EDU

On 2014-12-09 14:25, Kyle Banerjee wrote:
> Howdy all,
>
> I've just started a project that involves harvesting large numbers of
> scanned PDF's and extracting information from the text from the OCR output.
> The process I've started with -- use imagemagick to convert to tiff and
> tesseract to pull out the OCR -- is more system intensive than I hoped it
> would be.
>

I asked around the office and the process seems sensible overall. One 
suggestion was to use pdfimages instead of imagemagick as that should be 
faster.

However I would guess that most of the processing time is actually spent 
in tesseract so I don't know how much this suggestion will improve the 
overall performance.

Regards.

-- 
Mads Villadsen <mv_at_statsbiblioteket.dk>
Statsbiblioteket
It-udvikler