Re: embeddings to query full text

From: Tara Calishain <researchbuzz_at_nyob> Date: Fri, 18 Jul 2025 12:41:22 -0400 To: CODE4LIB_at_LISTS.CLIR.ORG

I have zero experience but I love this idea. I  have been thinking a lot
lately about "atomizing" topics into clouds of contextual items for purpose
of exploring/assembling with different search methods, and your breaking
things down into sentences fits right in.

On Fri, Jul 18, 2025 at 12:35 PM Eric Lease Morgan <
00000107b9c961ae-dmarc-request_at_lists.clir.org> wrote:

> To what degree have people here explored the use of embeddings to query
> full text? I have done some work in this regard, and I have found the
> results to be very informative.
>
> Kinda sorta, I want to address questions of a corpus and get back answers.
> For example, given Jane Austen's Emma, I want to know, "Who is Emma?"
> Alternatively, I want to create a corpus of books on a given topic -- such
> as epistemology -- and ask the question, "What is knowledge?"
>
> To address such things, I have created a system that:
>
>   1. extracts all the sentences from each item in a given corpus
>   2. saves the sentences as records in a database
>   3. loops through each sentence, vectorizes ("indexes") them,
>      and saves the results back to the database
>
> I can then:
>
>   1. garner a query
>   2. vectorize the query
>   3. search the database
>   4. return the N closest matching sentences
>
> The result is a paragraph N sentences long, and now I use any combination
> of the following to make sense -- read and understand -- the results:
>
>   1. consume the paragraph using the traditional reading process
>   2. reformat the paragraph into smaller paragraphs which is akin
>      to data science clustering
>   3. apply a large-language model to summarize the paragraph
>   4. apply retrieval-augmented generation (RAG) to the results and
>      ask a specific question
>
> The whole thing has been a whole lot of fun. For example, here is an
> abbreviated interaction I had with my system regarding Emma:
>
>   # search Emma, and return 16 sentences closest to the query "emma"
>   $ ./bin/search.sh emma emma 16
>
>   With all dear Emma's little faults, she is an excellent creature.
>   I have a very sincere interest in Emma.  Emma will be happy to
>   entertain you.  Emma spoke for her.  Emma could not forgive her.
>   repeated Emma.  No more is Emma.  Poor little Emma!  " Emma had
>   done.  So Emma thought, at least.  (turning to Emma.) Emma was in
>   no danger of forgetting.  " Emma seriously hoped she would.  "
>   Emma was most sincerely interested.  " Emma could say no more.  "
>   Emma could not doubt.
>
>
>   # use an LLM to summarize the result
>   $ ./bin/summarize.sh
>
>   Overall, this passage highlights the complexities of human
>   relationships and the importance of sincerity and genuine
>   interest in building meaningful connections with others.
>
>
>   # use the result as the content for a RAG query
>   $ ./bin/elaborate.sh 'who is emma'
>
>   Based on the quotes provided, it seems that Emma is a unique and
>   fascinating individual. Here are some possible characteristics of
>   Emma:
>
>     1. caring and empathetic
>     2. interested in others
>     3. willing to entertain
>     4. passionate
>     5. forgiving
>     6. memorable
>     7. reflective
>     8. optimistic
>     9. polite
>
>   Of course, these are just some possible interpretations based on
>   the given quotes. The true nature of Emma may be much more
>   complex and multifaceted!
>
>
> While I do not assert the results are correct, I do assert the results are
> more than plausible. They are excellent pieces of food for thought. They
> are hints and pointers for further investigation.
>
> I have used this system to read all sort of things on topics like
> philosophy, science, religion, government, and medicine. I have used this
> system to read, understand, and introduce myself to Jung, Marx, Plato,
> Twain, and Locke. Through the process I have learned of different
> defintions of knowledge, the many forms of justice, and how the definition
> of art has changed over time.
>
> Now, imagine this. Imagine all the books in your library have been
> digitized. Imagine each book is associated with a database, and the
> database is a list of each sentence in the book. Now imagine querying the
> book and getting back all the sentences -- not page numbers -- matching the
> query. In my mind, such a thing is very much like a back-of-the-book index
> but taken to the next level.
>
> Finally, I do not advocate this sort of things as a replacement for
> traditional reading. Just like any tool, it can be used improperly. On the
> other hand, it could address the problem of information overload. I can
> just hear students saying, "I have done the most correct bibliographic
> database search, and I have identified two hundred relevant articles on my
> topic. How do I read them!?"
>
> What experiences do y'all have with this sort of technology, and to what
> degee do you believe it is something feasible for libraries to implement?
>
> --
> Eric Morgan <emorgan_at_nd.edu>
>