Re: Looking for Ideas on Line Breaks in OCR Text

From: Matt Sherman <matt.r.sherman_at_nyob> Date: Tue, 4 Aug 2015 09:39:19 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

Hm, doing a little looking on someone's suggestion it turns out I was
wrong, they are not line breaks, they are paragraph marks.

On Tue, Aug 4, 2015 at 9:21 AM, Scancella, John <jsca_at_loc.gov> wrote:
> Matt,
>
> A word document does funny things to the text since it is actually html (try opening a .doc in a plain text editor and you will see it is html). I would try and get the plain ASCII text instead, and then install Cygwin which contains Sed and a bunch of other usful Unix/Linux commands.
> see http://stackoverflow.com/a/127567/2896744 for more info.
> ________________________________________
> From: Code for Libraries [CODE4LIB_at_LISTSERV.ND.EDU] On Behalf Of Matt Sherman [matt.r.sherman_at_GMAIL.COM]
> Sent: Tuesday, August 04, 2015 9:09 AM
> To: CODE4LIB_at_LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>
> I am on Windows machines, so I don't have quite the easy access to
> that useful command.  Someone had earlier put the OCR in a doc file so
> I've been playing with that more than with the raw PDF OCR.
>
> On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John <jsca_at_loc.gov> wrote:
>> Matt,
>>
>> There are probably a dozen ways to do this, but it would be really helpful to know what operating system you are on? For example, if you are using Linux, you can run it through sed using
>>   cat <OCR_FILE> | sed 's/\n//' >> <STRIPPED_OCR_FILE>
>> see http://stackoverflow.com/a/800644/2896744 for more info
>> ________________________________________
>> From: Code for Libraries [CODE4LIB_at_LISTSERV.ND.EDU] On Behalf Of Matt Sherman [matt.r.sherman_at_GMAIL.COM]
>> Sent: Monday, August 03, 2015 10:29 PM
>> To: CODE4LIB_at_LISTSERV.ND.EDU
>> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>>
>> Hi Code4Lib folks,
>>
>> I was wondering if anyone had some experience cleaning up OCR text.
>> Particularly I am trying to figure out how I can deal with the random
>> line breaks that come from OCR.  I am trying to parse out a
>> bibliography with regex.  I think I've figured out which queries I
>> need to run to break it up so I can make it into a tab delimited text
>> file but I noticed that the text does the classic thing of OCR
>> inserting line breaks where they physically are on the page.  This
>> will obviously be a bit of an issue since it would break the
>> annotation into a bunch of lines rather than leaving it one block so I
>> can manipulate it into a database.  So I am wondering if anyone who
>> has worked with OCR text before has a suggested way to clean up those
>> line breaks without doing 300 + pages by hand?  Any thoughts would be
>> welcome.
>>
>> Matt Sherman