Re: web scraping to train LLM

From: Deen, Julia <jdeen_at_nyob> Date: Fri, 26 Apr 2024 17:29:09 +0000 To: CODE4LIB_at_LISTS.CLIR.ORG

The Open Science Foundation has an API<https://developer.osf.io/#> - not sure what their rules are for data mining but maybe worth looking into!

Julia Deen (they/them)
Data Services Librarian
Davis Family Library 209
jdeen_at_middlebury.edu
Schedule an appointment with me<https://middlebury.libcal.com/appointments/jdeen>
https://www.data-is-plural.com/

________________________________
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> on behalf of Joe Hourclé <oneiros_at_ANNOYING.ORG>
Sent: Friday, April 26, 2024 11:15 AM
To: CODE4LIB_at_LISTS.CLIR.ORG <CODE4LIB_at_LISTS.CLIR.ORG>
Subject: Re: [CODE4LIB] web scraping to train LLM

>
> On Apr 26, 2024, at 9:36 AM, Pino, Janine <0000013e2b94d7f7-dmarc-request_at_lists.clir.org> wrote:
>
> Hello,
>
> Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.

You might not need to do any scraping.  Searching for “materials science corpus” led me to:

https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41524-022-00784-w&data=05%7C02%7Cjdeen%40MIDDLEBURY.EDU%7Cd9bb468485e74ab8234e08dc6603c611%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C638497413600831591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=WR%2F%2BEV6I6DnWx8X4mtdwrFAMxa5b7iDcftwyB8lJsmo%3D&reserved=0<https://www.nature.com/articles/s41524-022-00784-w>

I’m not sure what exactly Nature’s rules are for this sort of work, but for science articles in there, you have to freely share your data.

(They might not be able to share the whole thing, depending on what they agreed to when getting access to their training corpus, but any open publications should be fair game)

-Joe

(Not affiliated)