Re: text mining

From: Jonathan Rochkind <rochkind_at_nyob> Date: Wed, 11 May 2011 11:00:22 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

This is an interesting idea.

It occurs to me that HathiTrust already has the data (scanned full text) 
for many items neccesary to provide concordances.

It would be interseting and a useful service if HT were to pre-compute 
these concordances, and then provide an API service where you could look 
em up by ISBN/ISSN/OCLCnum/LCCN. They already have an API that lets you 
look up HT records by those identifiers, the new thing would just be 
computing the concordances and advertising them in the API.  Then 
individual libraries could use them to supplement their displays with 
this added meta-data, as Eric suggests -- without the individual 
libraries (programmer time and CPU time) having to do all the work Eric 
did over again,  having the work done once in a central location.  (HT 
could of course also provide a bulk download in addition to a per-item 
API).

I think these type of concordances would reveal no more information than 
HT's current "search where results are only page numbers" service, which 
they provide for in-copyright works too, so presumably the same legal 
analysis that justified HT's current service could justify concordances 
for in-copyright works.

Jonathan

On 5/11/2011 7:06 AM, Eric Lease Morgan wrote:
> Here at the University of Notre Dame we are beginning to add text mining links to our catalog and "discovery system".
>
> Text mining, loosely defined, is a computerized process for extracting information from documents. Think of it as if it were a concordance on steroids. Text mining goes beyond search and enables people do things such as count the number of words in a document and thus determine its relative length, calculate a documents's various readability scores, list the most frequent n-grams, and graph where in a document n-grams occur. Text mining is only possible if one has the full text of a document.
>
> Here in the Hesburgh Libraries at the University of Notre Dame we have begun to digitized our collection of Catholic pamphlets, short documents written in layman's terms, used to describe all things Catholic. After digitization the pamphlets they will be OCRed and placed on a Web server. We will then update our local bibliographic MARC records with the URLs and make the pamphlets available for downloading from our catalog as well as discovery system. Easy. But in addition, a text mining interface will be linked from our catalog and discovery system. In the end we hope to have close to 5,000 pamphlets online.
>
> We have begun to implement this as a part of the Catholic Research Resources Alliance (CRRA), affectionately known as the "Catholic Portal". Bibliographic data is harvested from CRRA member libraries and locally indexed in VuFind. The Portal then provides an interface to the index. Some of the individual records are Catholic pamphlets, and they illustrate the text mining interface:
>
>    * Archbishop Purcell outdone!
>      bib' record: http://www.catholicresearch.net/Record/undmarc_000537132
>      concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000537132
>
>    * Is the Pope always right?
>      bib' record: http://www.catholicresearch.net/Record/undmarc_000743445
>      concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000743445
>
>    * Pastoral instruction for the application...
>      bib' record: http://www.catholicresearch.net/Record/undmarc_000841024
>      concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000841024
>
>    * The Catholic factor in urban welfare
>      bib' record: http://www.catholicresearch.net/Record/undmarc_000885039
>      concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000885039
>
>    * Abortion : will there be civil war?
>      bib' record: http://www.catholicresearch.net/Record/undmarc_000922359
>      concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000922359
>
>    * Participation of Catholics in mixed groups
>      bib' record: http://www.catholicresearch.net/Record/undmarc_000941495
>      concordance: http://www.catholicresearch.net/concordances/?id=undmarc_000941495
>
> In the Portal we have done some similar work by harvesting content and metadata from the Internet Archive and integrated it into the Portal. These records represent content from the University of Toronto -- http://bit.ly/iaFdGj
>
> The text mining interface as it is presently implemented is rudimentary at best, but it does enable the reader to pursue individual documents quickly. It provides information services similar to tables-of-contents and back-of-the-book indexes -- things providing overviews of the book. Text mining does not really provide answers but rather provides guidance. It is a form of "services against texts" -- http://bit.ly/lZfp7j
>
> While the problem of find will never be completely solved, it is not a problem most people feel they have when it comes data and information. The more acute problem is understanding. How does one put found information into context? By digitizing content and providing services like text mining against it, the library profession can evolve, demonstrate leadership, and fill a niche.
>