Re: pdf2txt

From: Robert Haschart <rh9ec_at_nyob> Date: Wed, 16 Oct 2013 10:56:52 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

On 10/15/2013 12:25 PM, Eric Lease Morgan wrote:
> On Oct 14, 2013, at 4:49 PM, Robert Haschart<rh9ec_at_VIRGINIA.EDU>  wrote:
>
>>> For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8
>> Although based on some subsequent messages where you mention tesseract
>> maybe I misunderstood and your tool only handles pdfs that have already
>> been OCR'ed which would explain why the second document (which only
>> contains page images) fails.
> Robert, that's correct. As of right now the document needs to have been previously OCRed. --Eric
The abstract extraction routine I have been working on does use 
tesseract internally for doing OCR when it encounters a document that 
doesn't have usable full-text.  I agree that tesseract is not that easy 
to install, especially if (as in my case) you do not have root/sudo 
access to the machine.  Since I have gone through installing tesseract 
quite recently, perhaps my experience can be helpful to you.

-Bob Haschart