Looking for Ideas on Line Breaks in OCR Text

From: Matt Sherman <matt.r.sherman_at_nyob>
Date: Mon, 3 Aug 2015 22:29:12 -0400
To: CODE4LIB_at_LISTSERV.ND.EDU
Hi Code4Lib folks,

I was wondering if anyone had some experience cleaning up OCR text.
Particularly I am trying to figure out how I can deal with the random
line breaks that come from OCR.  I am trying to parse out a
bibliography with regex.  I think I've figured out which queries I
need to run to break it up so I can make it into a tab delimited text
file but I noticed that the text does the classic thing of OCR
inserting line breaks where they physically are on the page.  This
will obviously be a bit of an issue since it would break the
annotation into a bunch of lines rather than leaving it one block so I
can manipulate it into a database.  So I am wondering if anyone who
has worked with OCR text before has a suggested way to clean up those
line breaks without doing 300 + pages by hand?  Any thoughts would be
welcome.

Matt Sherman
Received on Mon Aug 03 2015 - 22:30:08 EDT