Re: pdf2txt

From: David Friggens <friggens_at_nyob> Date: Mon, 14 Oct 2013 11:21:50 +1300 To: CODE4LIB_at_LISTSERV.ND.EDU

> For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8

Looks very good, and thanks for sharing it. (It's certainly not the
first piece of software called pdf2txt, but that probably doesn't
matter.)

> PDF2TXT extracts the text from an OCRed PDF document

The file I tried was digital native (probably from Word) so perhaps
outside your intended scope. The text output was fairly similar to
that from pdftotext (in Ubuntu poppler-utils package), perhaps better
in losing the arbitrary line breaks, but fell over on macrons. There
were a lot of Māori words and the vowels with macrons disappeared -
e.g. Pākehā => Pkeh.

I assume Unicode issues were also at the heart of %3Cunknown%3E being
one of the "most frequent verbs".  The link for this [1] gives a regex
error.

Cheers
David

[1] http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=verbs&id=1381700598&lemma=%3Cunknown%3E