Re: Desiring Advice for Converting OCR Text into Metadata and/or a Database

From: Owen Stephens <owen_at_nyob> Date: Thu, 18 Jun 2015 20:34:29 +0100 To: CODE4LIB_at_LISTSERV.ND.EDU

It may depend on the format of the PDF, but I’ve used the Scraperwiki Python Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is a write up (not by me) at http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/ <http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/>, and an example of how I’ve used it at https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py <https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py>

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: owen_at_ostephens.com
Telephone: 0121 288 6936

> On 18 Jun 2015, at 17:02, Matt Sherman <matt.r.sherman_at_GMAIL.COM> wrote:
> 
> Hi Code4Libbers,
> 
> I am working with colleague on a side project which involves some scanned
> bibliographies and making them more web searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects we
> need, but I am at a bit of a loss on how to automate the process of putting
> the bibliography in a more structured format so that we can avoid going
> through hundreds of pages by hand.  I am pretty sure regular expressions
> are needed, but I have not had an instance where I need to automate
> extracting data from one file type (PDF OCR or text extracted to Word doc)
> and place it into another (either a database or an XML file) with some
> enrichment.  I would appreciate any suggestions for approaches or tools to
> look into.  Thanks for any help/thoughts people can give.
> 
> Matt Sherman