Re: Looking for Ideas on Line Breaks in OCR Text

From: Kyle Banerjee <kyle.banerjee_at_nyob>
Date: Tue, 4 Aug 2015 09:17:27 -0700
To: CODE4LIB_at_LISTSERV.ND.EDU
On Tue, Aug 4, 2015 at 6:09 AM, Matt Sherman <matt.r.sherman_at_gmail.com>
wrote:

> I am on Windows machines, so I don't have quite the easy access to
> that useful command.  Someone had earlier put the OCR in a doc file so
> I've been playing with that more than with the raw PDF OCR.
>
>
Versions of the unix utilities that run on Windows are available, but you
can just use Microsoft Word to do what you want. Just use the find/replace
function. In Word, you can search for a paragraph marker by looking for
"^p" (caret p)

Because you undoubtedly have real paragraphs in the document which you
don't want to remove, I'd recommend substituting double paragraph marks
with something unique (e.g. "@ZZZ@") before replacing all the other
paragraph marks with a space. Then replace your unique marker with a
paragraph.

HTH,

kyle
Received on Tue Aug 04 2015 - 12:27:28 EDT