web scraping to train LLM

From: Pino, Janine <0000013e2b94d7f7-dmarc-request_at_nyob>
Date: Fri, 26 Apr 2024 13:39:44 +0000
To: CODE4LIB_at_LISTS.CLIR.ORG
Hello,

Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.

Thanks,

Janine Pino (she/her)
Data Librarian
Research Library & Information Services
Office of Institutional Planning
Oak Ridge National Laboratory
Email: pinojc_at_ornl.gov<mailto:pinojc_at_ornl.gov>
Phone: 865.341.2465
Received on Fri Apr 26 2024 - 08:58:56 EDT