fun with internet archive content

From: Eric Lease Morgan <emorgan_at_nyob>
Date: Wed, 10 Dec 2008 08:12:51 -0500
To: NGC4LIB_at_LISTSERV.ND.EDU
I've been having some fun with Internet Archive content.

More specifically, I have created a tiny system for copying scanned  
materials locally, enhancing it with a word cloud, indexing it, and  
providing access to whole thing. There is how it works:

   1. Identify materials of interest from the Archive and copy their
      URLs to a text file.

   2. Feed the text file to wget (wget.sh) which copies the plain
      text, PDF, XML metadata, and GIF cover art locally.

   3. Create a rudumentary word cloud (cloud.pl) against each full
      text version of a document in an effort to suppliment the MARC
      metadata.

   4. Index each item using the MARC metadata and full text
      (index.pl). Each index entry also includes the links to the word
      cloud, GIF image, PDF file, and MARC data.

   5. Provide a simple one-box, one-button interface to the index
      (search.pl & search.cgi). Search results appear much like the
      Internet Archive's but also include the word cloud.

   6. Go to Step #1; rinse, shampoo, and repeat.

You can try the demonstration at the following URL:

   http://dewey.library.nd.edu/hacks/ia/search.cgi

But remember, there are only about two dozen items presently in the  
index.

There are many ways the system can be improved, and they can be  
divided into two types: 1) servcies against the index, and 2) services  
against the items. Services against the index include things like  
paging search results, making the interface "smarter", adding things  
like faceted browse, implementing an advaced search, etc.

Services against the items interest me more. Given the full text it  
might be possible to do things like: compare & contrast documents,  
cite documents, convert documents into many formats, trace idea  
forward & backward, do morphology against words, add or subtract from  
"my" collection, search "my" collection, share, annotate, rank &  
review, summarize, create relationships between documents, etc. These  
sort of features I believe to be a future direction for the library  
profession. It is more than just get the document; it is also about  
doing things with them once they are acquired. The creation of the  
word clouds is a step in that direction. It assists in the compare &  
contrast of documents.

The Internet Archive makes many of these things possible because they  
freely distribute their content -- including the full text.

InternetArchive++

-- 
Eric Lease Morgan
University of Notre Dame
Received on Wed Dec 10 2008 - 07:30:00 EST