Re: PDF->text extraction

From: Bill Janssen <janssen_at_nyob> Date: Tue, 21 Jun 2011 10:19:04 PDT To: CODE4LIB_at_LISTSERV.ND.EDU

Owen Stephens <owen_at_OSTEPHENS.COM> wrote:

> The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info).  The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories
> 
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
> 
> Any other suggestions/experience?

UpLib uses xpdf's pdftotext, which works well.  There's also code in
UpLib to find similarities between papers :-).

Bill