Re: OCR To ALTO without ABBYY

From: Bridger Dyson-Smith <bdysonsmith_at_nyob> Date: Thu, 6 Sep 2012 09:09:47 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

You might take a look at Tesseract [1]. On a typical Linux box:

$ tesseract input.tif outputName hocr

renders html with some coordinate information. You might be able to process
from that output to ALTO.

Cheers,
Bridger
[1] http://code.google.com/p/tesseract-ocr/

On Thu, Sep 6, 2012 at 8:29 AM, Michael Beccaria
<mbeccaria_at_paulsmiths.edu>wrote:

> I inadvertently purchase ABBYY Finereader 11 Corporate thinking that it
> would be capable of outputting to ALTO XML. I was wrong. ABBYY Finereader
> Engine does:/
>
> Ultimately, I want to OCR some newspaper images and export them to ALTO
> XML and, until the proof of concept is done, I want to try to do it on the
> cheap. My plan this morning was to write some scripts to OCR them using
> Microsoft Office Document Imaging (MODI) and then export the results to
> ALTO XML which could be a big project. Has anyone done this before or know
> of a quick and dirty way to get some OCR data?
> Thanks,
> Mike Beccaria
> Systems Librarian
> Paul Smith's College
> 518.327.6376
>