Re: pdf2txt [foreign documents]

From: Eric Lease Morgan <emorgan_at_nyob> Date: Sat, 12 Oct 2013 10:02:12 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

On Oct 11, 2013, at 6:39 PM, Mark Pernotto <mark.pernotto_at_GMAIL.COM> wrote:

> Putting my devil's advocate hat on, it doesn't parse foreign documents well
> (I got it to break!).  I also got inconsistent results feeding it PDF files
> with tables embedded (but haven't been able to figure out what it is about
> them it doesn't like).

Mark, foreign documents. Good point. Using a (Perl) module called… Well, I can't find it right now. It is possible to guess the language of a text. It does this by looking for and tabulating the number of various language stop words in a document. Once a language is determined, then different stop word lists can be applied to the document and the results ought to be better. 

Also, please remember, parsing the document into sentences and words is directly related to the quality of the underlying OCR. Such is a limitation I am not able to overcome.

--
Eric