Re: embeddings to query full text

From: Eric Lease Morgan <00000107b9c961ae-dmarc-request_at_nyob>
Date: Tue, 22 Jul 2025 08:39:19 -0400
To: CODE4LIB_at_LISTS.CLIR.ORG
On Jul 21, 2025, at 5:57 PM, Wolfe, Erin <edw_at_ku.edu> wrote:

> I did a little bit of work in this direction last year, where I tokenized a text into sentences and used a BERT model to create embeddings for each sentence. Then I took a predefined large dictionary of related terms (i.e., all related to the same general topic) and embedded each of these terms. I then used a cosine similarity check to try to identify sentences that were related to the topic based on embedding similarity.
> 
> The results were interesting and often correct, but not nearly accurate enough to use them in a meaningful way. Granted, this was using a zero-shot untrained match (“bert-base-uncased”). Likely fine tuning this on a training set of data would have yielded better results. However, I ended up going a different route for this project that gave me more precise results, so I didn’t explore the embeddings approach much further.
> 
> It’s an interesting topic for discussion, though, and I think there’s definitely some promise there!
> 
> --
> Erin


Interesting! 

With my recent interest in lexicons, I thought about applying zero-shot classification to content. Here's how:

 1. articulate a lexicon
 2. create a collection content (documents, paragraphs, sentences, etc.)
 3. use zero-shot classification to classify the content using the
    lexicon as the classification system

I have done this a few times, and most recently applied it to a set of 4,500 reference questions; working with a colleague we classified reference questions with the purposes of understanding the types of questions being asked. Too  some degree, the same process could be applied to title/abstract combinations from journal articles. 

Like alll things, the process was not perfect. That said, it was very insightful, and IMHO, can be seen as a supplement to more traditional analysis processes. 

--
Eric Morgan <emorgan_at_nd.edu>
Received on Tue Jul 22 2025 - 08:38:15 EDT