But my main point was: don't most freely-accessible digitized collections consist of books published prior to 1923?
Of course Google is the big exception. But until the Google books settlement is finalized, we won't know how many post-1923 books we'll be able to get full text for from Google.
Bernie Sloan
--- On Thu, 7/1/10, Eric Lease Morgan <emorgan_at_ND.EDU> wrote:
From: Eric Lease Morgan <emorgan_at_ND.EDU>
Subject: Re: [NGC4LIB] Book-scanning projects - a question
To: NGC4LIB_at_LISTSERV.ND.EDU
Date: Thursday, July 1, 2010, 1:25 PM
On Jul 1, 2010, at 11:40 AM, B.G. Sloan wrote:
> Most of the book-scanning projects are focusing on digitizing works in the public domain, right? And the public domain is basically books published before 1923, right? So, aren't most of these projects the equivalent of building a physical library collection of pre-1923 books?
Along the lines of what is outlined above, I have done a bit of an experiment to see how difficult it would be to supplement our physical holdings with the digital holdings of the Internet Archive. After all, the content there is free. Here's how:
1. Dump - Export all of your bibliographic MARC
records to a file.
2. Parse - Extract the authors, titles, and
other identifying information from a MARC
record.
3. Search - Use the result of Step #2 to create
REST-like searches of the Internet Archive
making sure results are returned as XML (or
some other machine-readable format).
4. Verify - Validate the search results making
sure they correctly match the MARC. There may
be false hits, or there may be multiple hits.
5. Download - For each Internet Archive records
that adequately matches the MARC record,
mirror the remote Internet Archive version of
the data locally. The PDF as well as the plain
text.
6. Update - For each downloaded record, update the
MARC record with two additional URLs. One
pointing to the Internet Archive, and another
pointing to your local mirror.
7. Go to Step #2 - Continue the process for each
record in your set of MARC records.
8. Reindex - Make searchable your MARC records as
well as the full text that has been mirrored.
9. Provide services - Enable search against the index.
Search results can point to your local physical
copy, your local mirrored copy, as well as the
remote (canonical) Internet Archive copy. Provide
services against the results enabling users to do
things like: print-on-demand, bind, do concordance
against, generate word cloud, put on reserve, add
to a syllabus, annotate, rank, review, graphically
illustrate the use of frequently used n-grams, etc.
As alluded to above, some of this work has been done. More specifically, using the MARC records from a thing called the "Catholic Portal", a graduate student and I did Steps #1 through #7. The hardest part is Step #4. The coolest part is Step #9.
If we, as a profession, were to get to Step #9, then we would be seen as providing truly cutting edge and valuable services to our constituents. Step #9 represents the growth opportunity.
--
Eric Lease Morgan
University of Notre Dame
Received on Thu Jul 01 2010 - 13:39:02 EDT