Re: descriptive cataloging

From: Eric Lease Morgan <emorgan_at_nyob> Date: Tue, 3 Nov 2009 16:09:15 -0500 To: NGC4LIB_at_LISTSERV.ND.EDU

>
>> http://serials.infomotions.com/ngc4lib/archive/2009/200911/1815.html
>
> Your post contains some nice fresh air and imagination. Is this post  
> in reply to something?  Related to some project?  Or, just a nice,  
> fresh, Tuesday morning musing?!
>
> I have to second the comment, some excellent examples of how we can  
> make our data work harder are given here. They are practical and  
> doable, assuming enough access to the full text.

Thank you for your interest.

A post in reply to something? No. I'm just trying to do my job as a  
moderator by facilitating discussion and raising awareness.

Related to some project? Well, yes, sort of. Thank you for asking. As  
you may or may not know, I have been maintaining a thing called the  
Alex Catalogue of Electronic Texts since 1994 or so. It currently  
includes approximately 14,000 etexts (a small number, if not tiny),  
mostly from places such as a defunct Virginia Text etext project,  
Project Gutenberg, and the Internet Archive. I have full text indexed  
the content complete with faceted browse, snippet display, and  
relevancy ranked search results. All of these things are features of  
the indexer, Solr. [1] (The underlying content is managed with  
MyLibrary.)

More importantly, I have begun to: 1) analyze my data in new and  
different ways, and 2) provide services beyond simply display.  
Regarding data analysis, I have begun to count/compute things. For  
example, for each document I have counted/computed:

   1. number of words in the text
   2. Fog score (akin to grade level)
   3. Kincaid score (again, akin to grade level)
   4. Flesch score (a readability value between 0 - 100)
   5. Great Ideas Coefficient (a thing of my own design)
   6. Big Names Coefficient (another thing of my own design)

These scores provide me with a dataset looking something like this:

   key                  words  fog kincaid flesch great names
   abbott-flatland-361  33653   16      13     48   142     1
   alcott-little-261   186021   11       9     69    72     6
   alcott-flower-619    35759   13      12     64    56     0
   alger-cast-544       55618    7       5     77    36    35
   alger-ragged-545     47831    8       6     74     7     5
   ...

Given this data I can make judgements about individual works as well  
as my collection as a whole, judgements such as:

   * this book is shorter, average, or longer than others
   * these books are intended for these types of readers
   * this book contains a relatively large number of
     "great ideas" or "big names"

My next step is two-fold. First I will update the index to include  
these numbers, and thus allow the user to narrow their search to  
longer or shorter books, "heavy" material versus "light" reading,  
advanced education needed or only grade school reading skills  
necessary. Second, I will update the metadata for each book to include  
an image illustrating these characteristics. This image will probably  
be a "spider plot" or "radar chart" similar to the graphs used in  
LibQUAL reports. [2] Thus, at a glance, a person will be able to  
determine some of the book's characteristics.

Given the full text of materials, there are services one can provide  
against texts. Concordances are my favorite. [3] Enter a word. Get  
back a list of textual snippets containing that word. Select a letter  
and get back a list of all the words and their counts that begin with  
that letter. Enter an integer (n) and get back the most common n words  
from the text. Enter an integer (p) and get back the most common p two- 
word phrases. Enter phrases, such as "love is" or "men are", and get a  
few quick-and-dirty definitions. These services are absolutely  
wonderful for quickly "reading" a text. They are great for rudimentary  
compare & contrast operations. None of this is really novel to me; all  
of this is representative of the work done by "digital humanists" such  
as TAPor or MONK. [4, 5]

Yes, these things are only possible given a critical mass of full  
text, but I'd be willing to bet we have this. Create a list of the  
items in your existing catalog. Search for those items in the Internet  
Archive and/or Project Gutenberg. Download those items. Do analysis  
against them. Update your metadata records. Re-index. Provide access  
to your index. Do the same thing with open access journals. Identify  
subject areas of interest. Use the DOAJ, OpenDOAR, and ROAR, NDLTD to  
identify content fitting the subject area. Harvest the content. Index  
it. Evaluate it. Re-index it. Provide access to the index. Provide  
services against the content. Repeat.

IMHO, these are the things of a "next generation" library catalog.

[1] Solr - http://lucene.apache.org/solr/
[2] radar chart - http://en.wikipedia.org/wiki/Radar_chart
[3] sample concordances - http://infomotions.com/sandbox/concordance/
[4] TAPor - http://portal.tapor.ca/portal/portal
[5] MONK - http://www.monkproject.org/

-- 
Eric Lease Morgan
University of Notre Dame