text mining

From: Eric Lease Morgan <emorgan_at_nyob>
Date: Wed, 11 May 2011 07:06:51 -0400
To: NGC4LIB_at_LISTSERV.ND.EDU
Here at the University of Notre Dame we are beginning to add text mining links to our catalog and "discovery system".

Text mining, loosely defined, is a computerized process for extracting information from documents. Think of it as if it were a concordance on steroids. Text mining goes beyond search and enables people do things such as count the number of words in a document and thus determine its relative length, calculate a documents's various readability scores, list the most frequent n-grams, and graph where in a document n-grams occur. Text mining is only possible if one has the full text of a document.

Here in the Hesburgh Libraries at the University of Notre Dame we have begun to digitized our collection of Catholic pamphlets, short documents written in layman's terms, used to describe all things Catholic. After digitization the pamphlets they will be OCRed and placed on a Web server. We will then update our local bibliographic MARC records with the URLs and make the pamphlets available for downloading from our catalog as well as discovery system. Easy. But in addition, a text mining interface will be linked from our catalog and discovery system. In the end we hope to have close to 5,000 pamphlets online.

We have begun to implement this as a part of the Catholic Research Resources Alliance (CRRA), affectionately known as the "Catholic Portal". Bibliographic data is harvested from CRRA member libraries and locally indexed in VuFind. The Portal then provides an interface to the index. Some of the individual records are Catholic pamphlets, and they illustrate the text mining interface:

  * Archbishop Purcell outdone!
    bib' record: http://www.catholicresearch.net/Record/undmarc_000537132
    concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000537132
  
  * Is the Pope always right?
    bib' record: http://www.catholicresearch.net/Record/undmarc_000743445
    concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000743445
  
  * Pastoral instruction for the application...
    bib' record: http://www.catholicresearch.net/Record/undmarc_000841024
    concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000841024
  
  * The Catholic factor in urban welfare
    bib' record: http://www.catholicresearch.net/Record/undmarc_000885039
    concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000885039
  
  * Abortion : will there be civil war?
    bib' record: http://www.catholicresearch.net/Record/undmarc_000922359
    concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000922359
  
  * Participation of Catholics in mixed groups
    bib' record: http://www.catholicresearch.net/Record/undmarc_000941495
    concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000941495

In the Portal we have done some similar work by harvesting content and metadata from the Internet Archive and integrated it into the Portal. These records represent content from the University of Toronto -- http://bit.ly/iaFdGj

The text mining interface as it is presently implemented is rudimentary at best, but it does enable the reader to pursue individual documents quickly. It provides information services similar to tables-of-contents and back-of-the-book indexes -- things providing overviews of the book. Text mining does not really provide answers but rather provides guidance. It is a form of "services against texts" -- http://bit.ly/lZfp7j

While the problem of find will never be completely solved, it is not a problem most people feel they have when it comes data and information. The more acute problem is understanding. How does one put found information into context? By digitizing content and providing services like text mining against it, the library profession can evolve, demonstrate leadership, and fill a niche.

-- 
Eric Lease Morgan
Hesburgh Libraries, University of Notre Dame

Great Books Survey -- http://bit.ly/auPD9Q
Received on Wed May 11 2011 - 07:07:12 EDT