Re: Book-scanning projects - a question

From: B.G. Sloan <bgsloan2_at_nyob> Date: Thu, 1 Jul 2010 10:37:51 -0700 To: NGC4LIB_at_LISTSERV.ND.EDU

But my main point was: don't most freely-accessible digitized collections consist of books published prior to 1923? 

Of course Google is the big exception. But until the Google books settlement is finalized, we won't know how many post-1923 books we'll be able to get full text for from Google.

Bernie Sloan

--- On Thu, 7/1/10, Eric Lease Morgan <emorgan_at_ND.EDU> wrote:

From: Eric Lease Morgan <emorgan_at_ND.EDU>
Subject: Re: [NGC4LIB] Book-scanning projects - a question
To: NGC4LIB_at_LISTSERV.ND.EDU
Date: Thursday, July 1, 2010, 1:25 PM

On Jul 1, 2010, at 11:40 AM, B.G. Sloan wrote:

> Most of the book-scanning projects are focusing on digitizing works in the public domain, right? And the public domain is basically books published before 1923, right? So, aren't most of these projects the equivalent of building a physical library collection of pre-1923 books?

Along the lines of what is outlined above, I have done a bit of an experiment to see how difficult it would be to supplement our physical holdings with the digital holdings of the Internet Archive. After all, the content there is free. Here's how:

  1. Dump - Export all of your bibliographic MARC
     records to a file.

  2. Parse - Extract the authors, titles, and
     other identifying information from a MARC
     record.

  3. Search - Use the result of Step #2 to create
     REST-like searches of the Internet Archive
     making sure results are returned as XML (or
     some other machine-readable format).

  4. Verify - Validate the search results making
     sure they correctly match the MARC. There may
     be false hits, or there may be multiple hits.

  5. Download - For each Internet Archive records
     that adequately matches the MARC record,
     mirror the remote Internet Archive version of
     the data locally. The PDF as well as the plain
     text.

  6. Update - For each downloaded record, update the
     MARC record with two additional URLs. One
     pointing to the Internet Archive, and another
     pointing to your local mirror.

  7. Go to Step #2 - Continue the process for each
     record in your set of MARC records.

  8. Reindex - Make searchable your MARC records as
     well as the full text that has been mirrored.

  9. Provide services - Enable search against the index.
     Search results can point to your local physical
     copy, your local mirrored copy, as well as the
     remote (canonical) Internet Archive copy. Provide
     services against the results enabling users to do
     things like: print-on-demand, bind, do concordance
     against, generate word cloud, put on reserve, add
     to a syllabus, annotate, rank, review, graphically
     illustrate the use of frequently used n-grams, etc.

As alluded to above, some of this work has been done. More specifically, using the MARC records from a thing called the "Catholic Portal", a graduate student and I did Steps #1 through #7. The hardest part is Step #4. The coolest part is Step #9.

If we, as a profession, were to get to Step #9, then we would be seen as providing truly cutting edge and valuable services to our constituents. Step #9 represents the growth opportunity. 

-- 
Eric Lease Morgan
University of Notre Dame