Re: web scraping to train LLM

From: Géraldine Anne Geoffroy <000001639a7d7850-dmarc-request_at_nyob> Date: Sat, 27 Apr 2024 08:35:34 +0000 To: CODE4LIB_at_LISTS.CLIR.ORG

Hello,

Maybe you can take a look at Openalex<https://openalex.org/>, which has a very broad, open and multi-disciplinary knowledge base of bibliographic metadata of scholary outputs and a very robust well-documented API. The entity-relationship model<https://help.openalex.org/how-it-works> behind the metadata catalog contains concept-type entities aligned with wikidata concepts which can help you build your corpus of metadata.

Depending on what you want to train an LLM for and if you need the fulltext you can then use the doi to scrape the full text online.

Géraldine Geoffroy

[1710955607815]

Géraldine Geoffroy
Bibilothèque de l'EPFL
Rolex Learning Center
Station 20
1015 Lausanne
go.epfl.ch/bibliotheque<https://www.epfl.ch/campus/library/fr/bibliotheque/>
+41 21 693 87 34
geraldine.geoffroy_at_epfl.ch<mailto:geraldine.geoffroy_at_epfl.ch>
Follow @EPFLlibrary
[X]<https://www.instagram.com/epfllibrary/>[X]<https://twitter.com/epfllibrary>[X]<https://www.facebook.com/EPFLlibrary/>[X]<https://www.linkedin.com/showcase/epfllibrary>[X]<https://www.youtube.com/user/epfllibrary>

________________________________
De : Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> de la part de Deen, Julia <jdeen_at_MIDDLEBURY.EDU>
Envoyé : vendredi, 26 avril 2024 19:29
À : CODE4LIB_at_LISTS.CLIR.ORG
Objet : Re: [CODE4LIB] web scraping to train LLM

The Open Science Foundation has an API<https://developer.osf.io/#> - not sure what their rules are for data mining but maybe worth looking into!

Julia Deen (they/them)
Data Services Librarian
Davis Family Library 209
jdeen_at_middlebury.edu
Schedule an appointment with me<https://middlebury.libcal.com/appointments/jdeen>
https://www.data-is-plural.com/

________________________________
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> on behalf of Joe Hourclé <oneiros_at_ANNOYING.ORG>
Sent: Friday, April 26, 2024 11:15 AM
To: CODE4LIB_at_LISTS.CLIR.ORG <CODE4LIB_at_LISTS.CLIR.ORG>
Subject: Re: [CODE4LIB] web scraping to train LLM

>
> On Apr 26, 2024, at 9:36 AM, Pino, Janine <0000013e2b94d7f7-dmarc-request_at_lists.clir.org> wrote:
>
> Hello,
>
> Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.

You might not need to do any scraping.  Searching for “materials science corpus” led me to:

https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41524-022-00784-w&data=05%7C02%7Cjdeen%40MIDDLEBURY.EDU%7Cd9bb468485e74ab8234e08dc6603c611%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C638497413600831591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=WR%2F%2BEV6I6DnWx8X4mtdwrFAMxa5b7iDcftwyB8lJsmo%3D&reserved=0<https://www.nature.com/articles/s41524-022-00784-w>

I’m not sure what exactly Nature’s rules are for this sort of work, but for science articles in there, you have to freely share your data.

(They might not be able to share the whole thing, depending on what they agreed to when getting access to their training corpus, but any open publications should be fair game)

-Joe

(Not affiliated)