Re: text mining [cataloging]

From: Eric Lease Morgan <emorgan_at_nyob> Date: Fri, 13 May 2011 08:58:33 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

On May 11, 2011 Eric Lease Morgan wrote:

> Text mining, loosely defined, is a computerized process for extracting information from documents. Think of it as if it were a concordance on steroids. Text mining goes beyond search and enables people do things such as count the number of words in a document and thus determine its relative length, calculate a documents's various readability scores, list the most frequent n-grams, and graph where in a document n-grams occur. Text mining is only possible if one has the full text of a document.

As described previously -- http://bit.ly/l4Lsos -- text mining interfaces can be integrated into library "discovery systems". But text mining can also be exploited to enhance traditional cataloging techniques. Here are just a few examples:

  1. size of book - Current cataloging practice dictates that the number of pages in a book be recorded. Unfortunately, the number of pages of a book is not very accurate when it comes to determining length. What if I want a long (thorough) book? What if I want a short book? The number of pages in a book is very ambiguous. The number of words in a book is still not a 100% accurate predictor of size but it is a whole lot more accurate than number of pages. I suggest the number of words in an item be recorded in our bibliographic records. Once done, we would be able to calculate an average length and determine whether a given book was long or short.

  2. readability - A book's readability is determined by many factors. Number of words. Number of unique words. Length of words. Number of sentences. Length of sentences. Etc. In general, shorter books with fewer unique words (Dr. Seuss) are easier to read than longer books employing a specialized language (dissertations). Using text mining techniques, readability can be calculated, given a score, and mapped to relative scales. These scores can then be recorded in our catalogs enabling to patron to limit by easy or advanced materials. A Perl module can do this work -- http://bit.ly/mzfmFS

  3. keywords - Counting and tabulating the words in a text is all but a trivial computer operation. The process of determining the statistical significance of each word is well-known and well-established. Creating a list of such words and inserting them into our cataloging records would only enhance the current use of controlled vocabulary terms. Much of this is done using TFIDF -- http://bit.ly/aanvc6

  4. summaries - Once a list of statistically significant key words is generated, it is only a tiny step to create a list of sentences containing those words. Such sentences and their immediate neighbors, especially if they contain many of the keywords, probably indicate the text's core message. Such sentences could be included as computed summaries in our cataloging records and allow the patron to get a better idea of what the book is about. For example, see Open Text Summarizer -- http://bit.ly/j74tEt

Given the full text of books in electronic form, there is so much a library can do to enhance and provide useful information services, and we have only begun to scratch the surface. Much of our time is spent automating existing workflows. I think more of our time ought to be spent learning how to exploit the environment instead of being driven by it.

-- 
Eric Lease Morgan, Digital Projects Librarian
University of Notre Dame
(574) 631-8604

Great Books Survey -- http://bit.ly/auPD9Q