Re: scraping or extracting structured data from a pdf

From: Hammer, Erich F <erich_at_nyob> Date: Thu, 12 May 2022 19:44:22 +0000 To: CODE4LIB_at_LISTS.CLIR.ORG

Danielle,

.DOCX files are just a collection of zipped xml and image files.  You can see this by changing the extension (on a copy) on the file and then exploring.  It should be possible to parse out the data from the XML file(s) and build a structure from it.

Erich

On Thursday, May 12, 2022 at 14:39, Danielle Reay eloquently inscribed:

> Hello,
> 
> We have a faculty member looking to create a dataset from an annotated
> bibliography she compiled. Right now it exists as a word file and as a
> pdf. The entries are relatively structured with a citation and an
> abstract, but the document is about 150 pages long with multiple entries
> per page. Rather than manually copy and paste everything to create the
> spreadsheet/csv, I wanted to ask for suggestions or approaches to doing
> this by either scraping or extracting structured data from the pdf.
> Thanks very much in advance!
> 
> Danielle Reay
> 
> Digital Scholarship Technology Manager
> Drew University