Re: web scraping to train LLM

From: Pikas, Christina K. <Christina.Pikas_at_nyob>
Date: Fri, 26 Apr 2024 14:02:52 +0000
To: CODE4LIB_at_LISTS.CLIR.ORG
There be dragons!  In particular don't mention "scraping" anywhere within distance of A. C. S.  Open collections are probably your best bet. Maybe something from NIST for reference data and then things like Semantic Scholar. 

Many/most publishers have hastily constructed "NO AI" rules ... which forbid everything, even things which are clearly fair use. 

-----Original Message-----
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of Pino, Janine
Sent: Friday, April 26, 2024 9:40 AM
To: CODE4LIB_at_LISTS.CLIR.ORG
Subject: [EXT] [CODE4LIB] web scraping to train LLM

APL external email warning: Verify sender owner-code4lib_at_LISTS.CLIR.ORG before clicking links or attachments 

Hello,

Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.

Thanks,

Janine Pino (she/her)
Data Librarian
Research Library & Information Services
Office of Institutional Planning
Oak Ridge National Laboratory
Email: pinojc_at_ornl.gov<mailto:pinojc_at_ornl.gov>
Phone: 865.341.2465
Received on Fri Apr 26 2024 - 09:21:59 EDT