Re: "So, Can Google Use OCLC Records? Yes, But"

From: B.G. Sloan <bgsloan2_at_nyob> Date: Fri, 11 Sep 2009 20:25:07 -0700 To: NGC4LIB_at_LISTSERV.ND.EDU

Jon Gorman said:

"Except, of course, it seems like plenty of people have known.  I mean, I'd have to do some digging but I know I've read about OCLC data in Google before.  It's not a huge secret."

Maybe I didn't phrase my questions correctly. But I think if Karen Coyle has questions about Google's use of OCLC metadata, then maybe it's not as obvious as Jon Gorman thinks it is.

Bernie Sloan

--- On Fri, 9/11/09, Jon Gorman <jonathan.gorman_at_GMAIL.COM> wrote:

From: Jon Gorman <jonathan.gorman_at_GMAIL.COM>
Subject: Re: [NGC4LIB] "So, Can Google Use OCLC Records? Yes, But"
To: NGC4LIB_at_LISTSERV.ND.EDU
Date: Friday, September 11, 2009, 8:32 PM

On Fri, Sep 11, 2009 at 6:04 PM, B.G. Sloan <bgsloan2_at_yahoo.com> wrote:
>
> There's an interesting quote in this LJ article:
>
> "LJ queried whether Google is using WorldCat data. Google metadata point man Jon Orwant responded, '...We get a nearly-full OCLC feed and it substantially improves the quality of our metadata. We've been using it for years and are happy with it. We also get individual library catalogs and commercial data feeds. We have over 100 metadata sources...'"
>
> This begs a couple of questions:
>
> 1. If Google has been getting "a nearly-full OCLC feed " of metadata from OCLC "for years" why is the library world only just now finding out about it? Why did it take a persistent Karen Coyle to pry this information out of them?
>

Except, of course, it seems like plenty of people have known.  I mean,
I'd have to do some digging but I know I've read about OCLC data in
Google before.  It's not a huge secret.

I think your questions here are a bit misleading goes a bit far and
it's feeding into this weird clumping of assumptions and assertions
that seem to be snowballing here.  Geoff Nunberg was the one who
claimed that they weren't using LC or OCLC data.  It doesn't seem like
he did a huge amount of investigative work.  I haven't seen either
Google or OCLC deny or "hide" their relation and I know I've seen and
heard others mention it.

I didn't do an exhaustive search, but I know I've read references to
them in some books and came up with these with a quick search:

* "OCLC and Google to exchange data, link digitized books to WorldCat"
http://www.oclc.org/news/releases/200811.htm.

*  Another article from 2003 "OCLC Project Opens WorldCat Records to
Google "  http://newsbreaks.infotoday.com/nbreader.asp?ArticleID=16592

Phrasing the questions you did starts seeming a bit like that old
rhetorical trick of making it so either a yes or no is incriminating.
("So Mr. Sloan, have you stopped pirating books on Bibapster?")  Not
saying you did it on purpose, but I think things are snowballing a bit
and people might be responding before digging into actual evidence.
(I'm probably just as guilty of that as anyone, oh well.)

Here's a  couple of questions that comes to my mind about this whole thing:

What is the accuracy rate among different groups metadata?  Is there
someone with less rates?

How much of the bibliographic universe does Worldcat cover?

Is there enough information in the average Worldcat record to populate
all the metadata we see in a Google Books Page?  How could we obtain
any metadata it couldn't provide?  Can we do it for reasonable time
and expense?

What are the error rates in the various metadata sources?  How is
OCLC?  What is the extent of OCLC records?

> 2. If Google has been getting "a nearly-full OCLC feed " of metadata from OCLC "for years" and "it substantially improves the quality of [their] metadata" why do we see the problems that folks like Geoff Nunberg have pointed out?
>

Good question, but there's a danger that it can be interpreted as
implying that if Google was somehow synthesising Worldcat's data
perfectly the quality of Google Books overall would be drastically
better.  There's been plenty of comments that have started to shine
some light on the very complex issues of handling that scale of
information and metadata from that many sources.

> Also Google's Jon Orwant is quoted as saying: "We have over 100 metadata sources."  To me that says that non-librarians don't hold library metadata in the same high regard that librarians do. Our metadata is only one of many sources used in Google Book Search.
>

Honestly, I don't know how high of regard I have for our data.  I deal
with it every day.  It proves to be frustrating frequently.  I'm not
saying that other sources are always better, but they frequently are.
Where's the reading level of books?  I can get that from Amazon.  How
frequently do I find that figuring out a monographic series is a mess
and end up turning to LibraryThing?  Frequently enough that when I'm
standing in my public library I get annoyed that the terminals only
let me connect to the catalog.  When I could not remember the title to
the book "Our Gods wear spandex", it was Amazon that found it for me.
(I tried a couple of things, for some reason I was convinced it was
"Our Heroes Wear Capes". I was searching several sources in parallel.
Amazon won on the search "history of superheros".  It matched a phrase
in a review and I knew I was searching for a new release. Why didn't
the library catalog record match?  It had as a subject heading that
was something like "Comic books, strips, etc. -- History".  It didn't
actually have the word superhero in it, I believe.  I could go into a
whole rant about how the concept of superheroes occur in other media,
but that's just because I'm a geeky fanboy ;).

The easiest way to re-assure people of the quality of our data would
be to demonstrate it by showing rates of errors.  Or even better, high
traffic because people are choosing to use it.  My friends seem to
find all sorts of weird places to find information, but many have
become disenfranchised with library records compared to other media.
I increasingly have a hard time believing it's simply a matter of
ignorance or marketing spin.

I think Google Books has significant problems and challenges facing
it.  However, I also think we shouldn't get too comfortable and assume
we've figured them out already as well.

Jon Gorman