Re: Comparing OCR output to dictionary

From: Sarah Swanz <spswanz_at_nyob> Date: Fri, 3 Sep 2021 10:20:52 -0500 To: CODE4LIB_at_LISTS.CLIR.ORG

This Jupyter notebook from the National Library of Scotland has a section on how to evaluate OCR accuracy under the Data Cleaning chapter.

You might also check out the 'fastwer' package described in this article. I have not used myself so cannot attest to it.

Sarah Swanz
University of Michigan, School of Information (2018)fast
On Sep 2, 2021, 3:09 PM -0500, Kimberly Kennedy <kimberlymkennedy_at_gmail.com>, wrote:
> Hello!
>
> I was wondering if anyone has created a script or tool to compare the words
> in a text file to a dictionary? I'm looking for a way to quantify the
> quality of OCR output. I've heard that counting the number of words that
> are in the dictionary is a good quick and dirty way to do this, but I would
> like to be able to run this script on larger batches of text files so I can
> compare OCR engines (not count words manually).
>
> Let me know if you have any existing tools or thoughts about how to go
> about this!
>
> Thanks,
>
> Kim
>
>
>
> Kimberly Kennedy
> Digital Production Coordinator
> Northeastern University Library
> ki.kennedy_at_northeastern.edu
> kimberlymkennedy_at_gmail.com