Re: scraping or extracting structured data from a pdf

From: MJ Ray <mjr_at_nyob>
Date: Fri, 13 May 2022 10:47:03 +0100
To: CODE4LIB_at_LISTS.CLIR.ORG
Le 12 mai 2022 20:44:22 GMT+01:00, "Hammer, Erich F" <erich_at_ALBANY.EDU> a écrit :
>Danielle,
>
>.DOCX files are just a collection of zipped xml and image files.  You can see this by changing the extension (on a copy) on the file and then exploring.  It should be possible to parse out the data from the XML file(s) and build a structure from it.

Yes, the key one is document.xml but it is very noisy and seems only
semantic if the author used styles instead of bold, italics and so on.

--
MJR
https://www.software.coop
Received on Fri May 13 2022 - 05:42:40 EDT