Re: PDF->text extraction

From: Bill Janssen <janssen_at_nyob> Date: Tue, 21 Jun 2011 19:43:17 PDT To: CODE4LIB_at_LISTSERV.ND.EDU

Simon Spero <ses_at_UNC.EDU> wrote:

> Another option is to use the  ABBYY FineReader
> SDK<http://www.abbyy.com/ocr_sdk_linux/overview/>.
> Annoyingly, the linux version is one release behind the windows SDK (which
> has improved support for multi core processing of single document).  Since
> Owen's problem  is embarrassingly parallel, multi-core tuning isn't as
> useful as being able to run on a local cluster or regional grid.   ABBYY
> software tends to be a little pricey, but the results are usually very good.

If you're going to OCR, Nuance OmniPage is also very good, and I believe
costs about the same as FineReader.  We also use tOCR, from Transym,
which is Windows-only, but very accurate and cheap.  I have yet to see
decent results on complicated pages (technical papers) from either
OCRopus or Tesseract with the default models that they come with; I
believe they're both still aimed at book page OCR.

Bill