Re: scraping or extracting structured data from a pdf

From: Kevin Hawkins <kevin.s.hawkins_at_nyob> Date: Thu, 12 May 2022 22:12:34 -0500 To: CODE4LIB_at_LISTS.CLIR.ORG

And for going beyond the bibliographic citations to include abstracts as 
well, https://grobid.readthedocs.io/en/latest/ might be useful.  --Kevin

On 5/12/22 1:49 PM, Julia Bauder wrote:
> Hi, Danielle,
>
> Have you taken a look at https://text2bib.economics.utoronto.ca/ ? If it
> works for you, that's likely to be one of the easiest methods to convert
> the list into structured data.
>
> Best,
> Julia
>
> _____________________________________________________
> Julia Bauder
> Social Studies and Data Services Librarian
> Director, Data Analysis and Social Inquiry Lab
> Grinnell College Libraries
> 1111 6th Ave.
> Grinnell, IA 50112
>
> On Thu, May 12, 2022 at 1:40 PM Danielle Reay <dreay_at_drew.edu> wrote:
>
>> Hello,
>>
>> We have a faculty member looking to create a dataset from an annotated
>> bibliography she compiled. Right now it exists as a word file and as a pdf.
>> The entries are relatively structured with a citation and an abstract, but
>> the document is about 150 pages long with multiple entries per page. Rather
>> than manually copy and paste everything to create the spreadsheet/csv, I
>> wanted to ask for suggestions or approaches to doing this by either
>> scraping or extracting structured data from the pdf. Thanks very much in
>> advance!
>>
>> Danielle Reay
>>
>> Digital Scholarship Technology Manager
>> Drew University
>>