Re: Integrating Google Book Search and OPACs

From: Kent Fitch <kent.fitch_at_nyob> Date: Sun, 16 Mar 2008 15:37:02 +1100 To: NGC4LIB_at_LISTSERV.ND.EDU

Technically, it is very easy to get this data retrieved by the
client's browser *back* to your server - the javascript invoked by to
process the JSON data returned by the google books visibility api can
"summarise" this data and invoke a service request on your server to
"remember" bits of it.

I'm not sure what summarisation would be permitted by Google, but one
I think many libraries would find extremely useful is simply that
given an ISBN or LCCN, the library would like to know whether Google
has full text, partial text, or just metadata available.

When ranking search results at least partially on the online
availability of full or partial text, it isn't  necessary to store
anything else (such as a GBS url), as these can be gotten at display
time using the GBS API; but a simple flag indicating availability of
full or partial text is incredibly useful in relevance ranking and
clustering, and is something which is impractical to do at display
time with a web-client based API ("Hey Google: I got these 20,000 hits
- which ones have you got full text for so that I can order
them/cluster them here in the browser?")

The Google "static link to GBS content" approach (
http://code.google.com/apis/books/static-links.html ) implies that the
library /website constructing the page and link must have
accumulated/remembered/stored the data behind these links somehow
anyway.  Are Google saying it s OK for humans to manually gather these
links, but not OK for them to be gathered as a side-effect normal
browsing which imposes no extra overhead on Google and causes no
disadvantage to Google's business model?  It seems likely that Google
would want to encourage libraries to drive traffic to their sites, and
hence Google would encourage libraries to rank results with links to
Google Books higher.

Libraries and others freely contribute lots of bibliographic data to
Google, either directly or by letting them crawl their websites, let
alone the millions of books purchased by libraries being scanned by
Google.  It would be hypocritical of Google to not let the most basic
of information flow the other way, and ultimately, counterproductive
and futile, as LibraryThing demonstrated showed with their distributed
harvesting of Google Book data last year.

Alternatively, if each library whose books have been scanned by Google
simply maintained a web based list of those books ISBNs or LCCN's
anyone could construct a list of books in Google's repository,
although knowing which were full text, partial text and metadata only
would be harder (based on likely copyright restrictions) or impossible
(because many books in copyright have extensive text visibility in
Google Books based on agreements with rights holders).

It is worth remembering that Google Books isn't the only game in town
- Microsoft Live Books is has lots of partial and full text, excellent
scanned image quality and a pretty nifty user-interface; we
(libraries) should be making best use of all resources.

Kent Fitch

On Sat, Mar 15, 2008 at 3:56 PM, Tim Spalding <tim_at_librarything.com> wrote:
> LibraryThing was in the first batch, and I brought in a number of the
>  libraries they picked to be with us. I think NGC4LIB might be a good
>  place to discuss the Google Book Search API and its limitations. It's
>  a very knotty thing.
>
>  The knot arises from the fact that the Google API isn't XML, like
>  Amazon's AWS. It's Javascript JSON. This is at the root of both GBS's
>  power and its limitations.
>
>  Power. Because it's JavaScript, everything happens client-side. If you
>  can extract an ISBN, OCLC or LCCN from the page, the rest is up to
>  Google and JavaScript. This means it can be added to almost any OPAC,
>  without any back-end systems integration. For libraries, this is
>  critical.
>
>  Limitation: Because it's happening in JavaScript, your library doesn't
>  "get" the data. It happens in the patron's browser and your server
>  never sees it. For this reason and because Google says so, you can't
>  *store* any of the data. You can't integrate more deeply. You can't
>  search against it, etc.
>
>  Lastly, the API doesn't expose ANY bibliographic data. (Google says
>  this is for license reasons—which I think means OCLC won't let them.)
>  You send it an identifier and you get that back, together with URLs to
>  its place on Google and whether the have a full version, a partial or
>  just an info page. You don't even get a title back. This made it very
>  hard for LibraryThing, which always works on a "work" level. For many
>  of our books, we're sending 100-200 identifiers, and getting back a
>  mess of URLs. Without titles and with cover coverage spotty, we have
>  no way to identify all these hits.
>
>  Best,
>  Tim
>