Re: PDF->text extraction

From: Bill Janssen <janssen_at_nyob> Date: Tue, 21 Jun 2011 12:16:31 PDT To: CODE4LIB_at_LISTSERV.ND.EDU

Boheemen, Peter van <Peter.vanBoheemen_at_WUR.NL> wrote:

> The most used open source software for this (and many other mime
> types) is tika: http://tika.apache.org/

While I'm sure it's widely used, it's also relatively immature.  For
PDF, it just punts to PDFBox (which is also relatively immature).

The most widely used commercial package for extracting text from PDF,
which does an excellent job, is probably TET, from pdflib.com.  TET has
lots of plug-ins for various contexts.

Bill

> ________________________________________
> Van: Code for Libraries [CODE4LIB_at_LISTSERV.ND.EDU] namens Bill Janssen [janssen_at_PARC.COM]
> Verzonden: dinsdag 21 juni 2011 19:19
> Aan: CODE4LIB_at_LISTSERV.ND.EDU
> Onderwerp: Re: [CODE4LIB] PDF->text extraction
> 
> Owen Stephens <owen_at_OSTEPHENS.COM> wrote:
> 
> > The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info).  The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories
> >
> > We've tried iText but had issues with quality
> > We moved to PDFBox but are having performance issues
> >
> > Any other suggestions/experience?
> 
> UpLib uses xpdf's pdftotext, which works well.  There's also code in
> UpLib to find similarities between papers :-).
> 
> Bill