internet archive content

From: Eric Lease Morgan <emorgan_at_nyob>
Date: Tue, 2 Jun 2009 10:33:15 -0400
Here is an recipe for adding Internet Archive (Open Content Alliance)  
content to library "catalogs":

   1. Get keys - The first step is to get a set of keys describing the  
content you desire. This can be acquired through the Internet  
Archive's advanced search interface. [1]

   2. Convert keys - The next step is to convert the keys into sets of  
URLs pointing to the content you want to download. Fortunately, all  
the URLs have a similar shape: 
,, etc.

   3. Download - Feed the resulting URLs to your favorite spidering/ 
mirroring application. I use wget.

   4. Update - Enhance the downloaded MARC records with 856$u values  
denoting the location of your local PDF copy as well as the original  
(cononical) version.

   5. Index - Add the resulting MARC records to your "discovery" system.

I have hacked together some shell and Perl scripts that do this for me  
in a VuFind context. [2] The next steps would be to use text mining  
techniques to extract summaries, named entities, and statistically  
relevant terms/phrases from the OCRed text(s) to enhance the 5xx, 7xx,  
and 6xx fields of the downloaded MARC records, respectively.

Thank heavens for open content, whether it be full text books or  

[1] Advanced search -
[2] hacks -

Eric Lease Morgan
University of Notre Dame
Received on Tue Jun 02 2009 - 10:33:42 EDT