>
>> http://serials.infomotions.com/ngc4lib/archive/2009/200911/1815.html
>
> Your post contains some nice fresh air and imagination. Is this post
> in reply to something? Related to some project? Or, just a nice,
> fresh, Tuesday morning musing?!
>
> I have to second the comment, some excellent examples of how we can
> make our data work harder are given here. They are practical and
> doable, assuming enough access to the full text.
Thank you for your interest.
A post in reply to something? No. I'm just trying to do my job as a
moderator by facilitating discussion and raising awareness.
Related to some project? Well, yes, sort of. Thank you for asking. As
you may or may not know, I have been maintaining a thing called the
Alex Catalogue of Electronic Texts since 1994 or so. It currently
includes approximately 14,000 etexts (a small number, if not tiny),
mostly from places such as a defunct Virginia Text etext project,
Project Gutenberg, and the Internet Archive. I have full text indexed
the content complete with faceted browse, snippet display, and
relevancy ranked search results. All of these things are features of
the indexer, Solr. [1] (The underlying content is managed with
MyLibrary.)
More importantly, I have begun to: 1) analyze my data in new and
different ways, and 2) provide services beyond simply display.
Regarding data analysis, I have begun to count/compute things. For
example, for each document I have counted/computed:
1. number of words in the text
2. Fog score (akin to grade level)
3. Kincaid score (again, akin to grade level)
4. Flesch score (a readability value between 0 - 100)
5. Great Ideas Coefficient (a thing of my own design)
6. Big Names Coefficient (another thing of my own design)
These scores provide me with a dataset looking something like this:
key words fog kincaid flesch great names
abbott-flatland-361 33653 16 13 48 142 1
alcott-little-261 186021 11 9 69 72 6
alcott-flower-619 35759 13 12 64 56 0
alger-cast-544 55618 7 5 77 36 35
alger-ragged-545 47831 8 6 74 7 5
...
Given this data I can make judgements about individual works as well
as my collection as a whole, judgements such as:
* this book is shorter, average, or longer than others
* these books are intended for these types of readers
* this book contains a relatively large number of
"great ideas" or "big names"
My next step is two-fold. First I will update the index to include
these numbers, and thus allow the user to narrow their search to
longer or shorter books, "heavy" material versus "light" reading,
advanced education needed or only grade school reading skills
necessary. Second, I will update the metadata for each book to include
an image illustrating these characteristics. This image will probably
be a "spider plot" or "radar chart" similar to the graphs used in
LibQUAL reports. [2] Thus, at a glance, a person will be able to
determine some of the book's characteristics.
Given the full text of materials, there are services one can provide
against texts. Concordances are my favorite. [3] Enter a word. Get
back a list of textual snippets containing that word. Select a letter
and get back a list of all the words and their counts that begin with
that letter. Enter an integer (n) and get back the most common n words
from the text. Enter an integer (p) and get back the most common p two-
word phrases. Enter phrases, such as "love is" or "men are", and get a
few quick-and-dirty definitions. These services are absolutely
wonderful for quickly "reading" a text. They are great for rudimentary
compare & contrast operations. None of this is really novel to me; all
of this is representative of the work done by "digital humanists" such
as TAPor or MONK. [4, 5]
Yes, these things are only possible given a critical mass of full
text, but I'd be willing to bet we have this. Create a list of the
items in your existing catalog. Search for those items in the Internet
Archive and/or Project Gutenberg. Download those items. Do analysis
against them. Update your metadata records. Re-index. Provide access
to your index. Do the same thing with open access journals. Identify
subject areas of interest. Use the DOAJ, OpenDOAR, and ROAR, NDLTD to
identify content fitting the subject area. Harvest the content. Index
it. Evaluate it. Re-index it. Provide access to the index. Provide
services against the content. Repeat.
IMHO, these are the things of a "next generation" library catalog.
[1] Solr - http://lucene.apache.org/solr/
[2] radar chart - http://en.wikipedia.org/wiki/Radar_chart
[3] sample concordances - http://infomotions.com/sandbox/concordance/
[4] TAPor - http://portal.tapor.ca/portal/portal
[5] MONK - http://www.monkproject.org/
--
Eric Lease Morgan
University of Notre Dame
Received on Tue Nov 03 2009 - 16:10:39 EST