Re: PDF->text extraction

From: Owen Stephens <owen_at_nyob> Date: Wed, 22 Jun 2011 08:57:44 +0100 To: CODE4LIB_at_LISTSERV.ND.EDU

Thanks to all for the info and suggestions - we'll have a look at them.

Via another route I've had http://snowtide.com/PDFTextStream recommended (commercial, but looks like they are generally open to offering academic licenses for free at least for a limited period) - anyone tried that?

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: owen_at_ostephens.com
Telephone: 0121 288 6936

On 22 Jun 2011, at 03:43, Bill Janssen wrote:

> Simon Spero <ses_at_UNC.EDU> wrote:
> 
>> Another option is to use the  ABBYY FineReader
>> SDK<http://www.abbyy.com/ocr_sdk_linux/overview/>.
>> Annoyingly, the linux version is one release behind the windows SDK (which
>> has improved support for multi core processing of single document).  Since
>> Owen's problem  is embarrassingly parallel, multi-core tuning isn't as
>> useful as being able to run on a local cluster or regional grid.   ABBYY
>> software tends to be a little pricey, but the results are usually very good.
> 
> If you're going to OCR, Nuance OmniPage is also very good, and I believe
> costs about the same as FineReader.  We also use tOCR, from Transym,
> which is Windows-only, but very accurate and cheap.  I have yet to see
> decent results on complicated pages (technical papers) from either
> OCRopus or Tesseract with the default models that they come with; I
> believe they're both still aimed at book page OCR.
> 
> Bill