Harvesting of data by Google

From: Warwick Cathro <wcathro_at_nyob>
Date: Wed, 18 Mar 2009 12:08:56 +1100
To: NGC4LIB_at_LISTSERV.ND.EDU
Actually, we have found at the National Library of Australia that it isn't that easy to have our data harvested and indexed by Google.

We have been using properly constructed Google Site Maps for our various discovery services (such as Picture Australa), but currently Google is harvesting only a proportion of our records, and the proportion that is harvested has fluctuated wildly during the past 12 months.

Currently Google is harvesting 49% of our Picture Australia records, but a few months ago it was only 5%.

We have contacted Google many times about this - and even had a face-to-face meeting with them - but the problem doesn't get fixed.

We have contacted colleagues in the National Library of New Zealand and they are having the same problems.

There is a further issue, in that we would like to preference Australian collections (libraries, museums, archives, university repositories, etc) in the relevance ranking for our discovery services.  Even if Google was reliably harvesting 100% of our data, we may not like their relevance ranking.  This is a possible argument for maintaining specialised portials, even if we want Google to drive traffic to those portals.

Two examples of our discovery services are:
http://www.pictureaustralia.org/index.html
http://ndpbeta.nla.gov.au/ndp/del/home

We have 8 of these services, but we are currently integrating them into One Big Service.

Warwick

Warwick Cathro
Assistant Director-General, Resource Sharing and Innovation
National Library of Australia
Ph: 02 6262 1403
Fax: 02 6273 1133
Mob: 0411 868 411


-----Original Message-----
From: Next generation catalogs for libraries [mailto:NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Alexander Johannesen
Sent: Wednesday, 18 March 2009 11:46 AM
To: NGC4LIB_at_LISTSERV.ND.EDU
Subject: Re: [NGC4LIB] What do users understand?

On Wed, Mar 18, 2009 at 02:20, Weinheimer Jim <j.weinheimer_at_aur.edu> wrote:
> It would be nice if it were that simple, but Google's algorithm (the
> entire strength of
> Google) is based on trillions of links to all different sites (the
> page with most links to it by the most linked = #1). There's nothing
> like that option in the library, and even Google's algorithm isn't so hot in Google Books.

Actually, this was true a few years ago. They've moved on, and other things are at play now. Besides, all it takes for this to work in libraries if links are (indeed still) the main stew booster, is for libraries to properly share their stuff! Not hard at all. C'mon, make it easier for Google to help you out.

> Google's ranking by "relevance" (a semi-propagandistic term since it
> means something quite different from the normal sense of "relevance")

No it doesn't; It means whatever it means in the context of where you are, just like in real-life. Within Google it is relevant to the words you typed in. Don't like the relevance? Switch your words, just like in real-life.

> would need to be recreated in the catalog, but how? By items most
> checked out (most popular?) By getting into publisher databases and
> trying to arrange by printing statistics? Or by retail statistics and best-sellers?

Ah, well *now* we're cookin'! :) I've got heaps of stuff about this, mostly prototypes and hacks before I quit the library world, things like "Heat Engine" which uses inverse cumulative histograms to track real popularity of books (without the dreaded short-term effects of 'peak', and deals with normative decline as opposed to pure statistics), or the "Memory Peak" (dealing with books borrow history, tracking subject headings over time and match it against keywords people search with in the OPAC), another system for mapping website searches against catalog searches and finding corrolations, or if you try http://ll01.nla.gov.au Kent Fitch (bless his heart!) played with the ABC news feed, pulling it down and try to find resources that somewhat matches the news items in question (right-hand side box).
Funky and fun, and sometimes really helpful and relevant.

In fact, library developers (and not just programmers) should be spending a lot of their time trying this stuff out and thinking about new ways to eal with what you've got, because, well, it's what you've got, and you won't get much else by the sound of it. :(

> Or by "rate this book!" Let's say that Nietzsche's "Thus spoke Zarathustra"
>  got 200 votes while Kant's "Critique of the Pure Reason" only got 50.
> What would somebody conclude?!

Well, there's other and better ways. For example, make your OPACs and catalogs more in the vein of social websites, and introduce roles on it where librarians can overlay an expert layer over the data. By that, reference librarians can surf and search around, tagging their books, make lists of recomendations and so forth. Make your systems with more roles in them, the *same* system, and this will open up opportunities you just don't have right now.

> While some of these tools are interesting, I'm not sure which ones really belong in a library....

Again, a friendly reminder that your users are ... *everyone*. So yes, they probably belong in the library.


Regards,

Alex
--
---------------------------------------------------------------------------
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
------------------------------------------ http://shelter.nu/blog/ --------
Received on Tue Mar 17 2009 - 21:10:25 EDT