Re: indexing pdf files

From: Mark A. Matienzo <mark_at_nyob> Date: Tue, 15 Sep 2009 09:56:48 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

Eric,

>  5. Use pdttotext to extract the OCRed text
>    from the PDF and index it along with
>    the MyLibrary metadata using Solr. [3, 4]
>

Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library