Re: web scraping to train LLM

From: Amy Kirchhoff <000001646ec00051-dmarc-request_at_nyob>
Date: Sun, 28 Apr 2024 23:30:29 +0000
To: CODE4LIB_at_LISTS.CLIR.ORG
Janine, Jen,

I'm with Constellate and I'm happy to discuss what may be possible with the content in our care. As you might expect, its complicated, but there may be some options.

It may be best for us to take a Constellate discussion off-list. Feel free to drop me a note or even put yourself right on my calendar. https://calendly.com/amy-kirchhoff/chat-with-amy-about-constellate-and-your-campus

Amy
________________________________
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> on behalf of Yu, Jen-chien <jyu_at_ILLINOIS.EDU>
Sent: Friday, April 26, 2024 11:15:52 AM
To: CODE4LIB_at_LISTS.CLIR.ORG <CODE4LIB_at_LISTS.CLIR.ORG>
Subject: Re: [CODE4LIB] web scraping to train LLM

>>>>>Caution: This message did not originate from within ITHAKA's email system. Please use caution when opening attachments and following links within this message.<<<<<

Janine - I know this is not what you are asking for, but I'm wondering if tools like Constellate (https://constellate.org/) could help your researchers? I think sometimes we think there must be "a lot" of publication of a topic of our interest, and therefore LLMs must be able to discover something new. But what is "a lot"? "400,000" seem like a number that was decided by algorithms but is it appropriate? And what can you really discover?

I attended a Constellate demo and it is using JSTOR as its core corpus for text analysis. I don't think it would have a lot of science content, but might be a good tool to test ideas and hypotheses.

Jen


JEN-CHIEN YU
DIRECTOR OF LIBRARY ASSESSMENT

University of Illinois Urbana-Champaign
Library Administration
Library Administration
436 Library | M/C 522
Urbana, IL 61801
217-300-0400 | jyu_at_illinois.edu
www.library.illinois.edu<http://www.library.illinois.edu>



Under the Illinois Freedom of Information Act any written communication to or from university employees regarding university business is a public record and may be subject to public disclosure.



-----Original Message-----
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of Abner, Kayla
Sent: Friday, April 26, 2024 10:03 AM
To: CODE4LIB_at_LISTS.CLIR.ORG
Subject: Re: [CODE4LIB] web scraping to train LLM

Pre-AI mania, vendors might share that data upon request for research. So you could ask WOS or Scopus, or check their text and data mining policy to see what their required steps are to get the data. However as others have mentioned, vendors have been very finicky about data mining since AI has become such a hot topic.


----

Kayla Abner

(she/her)

Digital Scholarship Librarian

Digital Initiatives and Preservation

Library, Museums and Press

University of Delaware

kabner_at_udel.edu<mailto:kabner_at_udel.edu>

Book time to meet with me<https://urldefense.com/v3/__https://calendly.com/kabner-gx9j/consultation__;!!DZ3fjg!6DGxLwqjhyYgydZEyRfxJrwO4jl8t8jD8WyzQf1DvzGh1eBf18_sXbPwIqvURVxFVJbw5YK0qky9Vr0Z$ >



**The University of Delaware, a land grant institution, is located on land that was and continues to be vital to the web of life of the Nanticoke and Lenni-Lenape people. We express gratitude and honor the people who have inhabited, cultivated, and nourished this land for thousands of years, even after their attempted forced removal during the colonial era and early federal period. The University of Delaware also financially benefitted from the expropriation of Indigenous territories in the region colonially known as Montana. View the full Living Land Acknowledgement<https://urldefense.com/v3/__https://sites.udel.edu/antiracism-initiative/committees/american-indian-and-indigenous-relations/living-land-acknowledgement/*Living_Land_Acknowledgement__;Iw!!DZ3fjg!6DGxLwqjhyYgydZEyRfxJrwO4jl8t8jD8WyzQf1DvzGh1eBf18_sXbPwIqvURVxFVJbw5YK0quEW97S2$ >.**

[cid:12c2dc0f-7d43-4c66-82f6-e726436595d4]

________________________________
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> on behalf of Pino, Janine <0000013e2b94d7f7-dmarc-request_at_LISTS.CLIR.ORG>
Sent: Friday, April 26, 2024 10:57 AM
To: CODE4LIB_at_LISTS.CLIR.ORG <CODE4LIB_at_LISTS.CLIR.ORG>
Subject: Re: [CODE4LIB] web scraping to train LLM

Yeah, I'm a little nervous about providing advice in this situation. I do not want to recommend Scopus or Web of Science; we've had vendor complaints about people going over the data limit. I am going to emphasize open data sources and crediting the data to be safe. They are using Beautiful Soup and APIs to get the data.

-----Original Message-----
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of Pikas, Christina K.
Sent: Friday, April 26, 2024 10:03 AM
To: CODE4LIB_at_LISTS.CLIR.ORG
Subject: [EXTERNAL] Re: [CODE4LIB] web scraping to train LLM

There be dragons!  In particular don't mention "scraping" anywhere within distance of A. C. S.  Open collections are probably your best bet. Maybe something from NIST for reference data and then things like Semantic Scholar.

Many/most publishers have hastily constructed "NO AI" rules ... which forbid everything, even things which are clearly fair use.

-----Original Message-----
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of Pino, Janine
Sent: Friday, April 26, 2024 9:40 AM
To: CODE4LIB_at_LISTS.CLIR.ORG
Subject: [EXT] [CODE4LIB] web scraping to train LLM

APL external email warning: Verify sender owner-code4lib_at_LISTS.CLIR.ORG before clicking links or attachments

Hello,

Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.

Thanks,

Janine Pino (she/her)
Data Librarian
Research Library & Information Services
Office of Institutional Planning
Oak Ridge National Laboratory
Email: pinojc_at_ornl.gov<mailto:pinojc_at_ornl.gov>
Phone: 865.341.2465
Received on Sun Apr 28 2024 - 18:49:43 EDT