Re: PDF->text extraction

From: Andreas Walker <andreas.walker_at_nyob> Date: Tue, 21 Jun 2011 16:34:16 +0200 To: CODE4LIB_at_LISTSERV.ND.EDU

I'm using Docsplit (http://documentcloud.github.com/docsplit/), due to 
its Ruby bindings. It includes OCR if it fails at extracting the text, 
but it also requires you to install a bunch of other (open source) 
software. Results seem fine to me so far.

Best,
Andreas

Am 21.06.2011 16:23, schrieb Owen Stephens:
> The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info).  The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories
>
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
>
> Any other suggestions/experience?
>
> Thanks,
>
> Owen
>
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: owen_at_ostephens.com
> Telephone: 0121 288 6936