Re: pdf2txt [tika]

From: Eric Lease Morgan <emorgan_at_nyob> Date: Wed, 30 Oct 2013 09:00:07 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

On Oct 15, 2013, at 10:44 AM, Eric Lease Morgan <emorgan_at_nd.edu> wrote:

> For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8

On Oct 14, 2013, at 7:56 AM, Nicolas Franck <Nicolas.Franck_at_UGENT.BE> wrote:

> Could this also be done by Apache Tika? Or do I miss a crucial point?
> 
> http://tika.apache.org/1.4/gettingstarted.html

To some great degree I have replaced the text extraction routine in my PDF2TXT script with Tika allowing the tool to read a much wider number of types of documents (PDF, Word, Mac Pages, Powerpoint (maybe), etc.) "Thank you Nicolas." I have also created the barest of Git repositories hosting the (Perl) code:

  * PDF2TXT - http://bit.ly/1bJRyh8
  * Git repository - https://github.com/ericleasemorgan/pdf2txt

Just a reminder, PDF2TXT extracts plain text from a file, and does some rudimentary text mining against the result.

—
Eric Lease Morgan
University of Notre Dame