Re: Free/Open OCR solutions?

From: Ethan Gruber <ewg4xuva_at_nyob> Date: Wed, 28 Jul 2010 11:50:01 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

Google's Tesseract is pretty good.  I believe that is what they use for
dirty OCR in the Google Books project.

Ethan

On Wed, Jul 28, 2010 at 11:46 AM, Andy Kelly <a.m.kelly_at_gmail.com> wrote:

> I'm working on scanning some documents in a collection and then preforming
> OCR on the documents. Thus far, I've used Adobe Acrobat Pro's OCR function
> with some success but the machines I'm working on are fairly old Pentium 4
> Dell boxes, this makes opening 600 DPI scans painful and preforming OCR an
> entirely valid excuse for a long coffee break.
>
> As you might expect, I'm looking for a way to speed up this process at the
> OCR end of things, since the scanning can only move so quickly. I'm
> wondering if any of you have experience with any open OCR solutions such
> as:
> Tesseract-OCR <http://code.google.com/p/tesseract-ocr/> or
> ocropus<http://code.google.com/p/ocropus/>.
> At a glance, Tesseract seems to be further along in development. Any other
> suggestions on how best to approach this sort of task would be appreciated
> if you've done similar work.
>
> I've got my own Ubuntu Server I'm planning on evaluating one or both of
> these on, as much for my own interest as the project's or the
> organization's. Since I'm an unpaid part-time intern and the only one who's
> working on this project, I'm willing to learn to do things the hard way so
> they're easier in the long run.
>
> Thanks for any suggestions or advice you may be able to offer.
>
> --
> ~Andrew M. Kelly
> MLIS Degree Candidate, Simmons GSLIS 2011
> Archives & Librarianship Intern, Boston University: African Presidential
> Archive & Research Center
> Evening Library Assistant, Bay State College
> twitter: @a_m_kelly
>