I've been having some fun with Internet Archive content.
More specifically, I have created a tiny system for copying scanned
materials locally, enhancing it with a word cloud, indexing it, and
providing access to whole thing. There is how it works:
1. Identify materials of interest from the Archive and copy their
URLs to a text file.
2. Feed the text file to wget (wget.sh) which copies the plain
text, PDF, XML metadata, and GIF cover art locally.
3. Create a rudumentary word cloud (cloud.pl) against each full
text version of a document in an effort to suppliment the MARC
metadata.
4. Index each item using the MARC metadata and full text
(index.pl). Each index entry also includes the links to the word
cloud, GIF image, PDF file, and MARC data.
5. Provide a simple one-box, one-button interface to the index
(search.pl & search.cgi). Search results appear much like the
Internet Archive's but also include the word cloud.
6. Go to Step #1; rinse, shampoo, and repeat.
You can try the demonstration at the following URL:
http://dewey.library.nd.edu/hacks/ia/search.cgi
But remember, there are only about two dozen items presently in the
index.
There are many ways the system can be improved, and they can be
divided into two types: 1) servcies against the index, and 2) services
against the items. Services against the index include things like
paging search results, making the interface "smarter", adding things
like faceted browse, implementing an advaced search, etc.
Services against the items interest me more. Given the full text it
might be possible to do things like: compare & contrast documents,
cite documents, convert documents into many formats, trace idea
forward & backward, do morphology against words, add or subtract from
"my" collection, search "my" collection, share, annotate, rank &
review, summarize, create relationships between documents, etc. These
sort of features I believe to be a future direction for the library
profession. It is more than just get the document; it is also about
doing things with them once they are acquired. The creation of the
word clouds is a step in that direction. It assists in the compare &
contrast of documents.
The Internet Archive makes many of these things possible because they
freely distribute their content -- including the full text.
InternetArchive++
--
Eric Lease Morgan
University of Notre Dame
Received on Wed Dec 10 2008 - 07:30:00 EST