Re: Harvesting of data by Google

From: Stephens, Owen <o.stephens_at_nyob> Date: Wed, 18 Mar 2009 09:38:25 +0000 To: NGC4LIB_at_LISTSERV.ND.EDU

I've experienced the vagaries of Google even on a straightforward website - I was responsible for a UK Uni web presence and one day our entire site was dropped from the Google index for reasons we never discovered. After a lot of investigation, and many attempts to raise the issue with Google (which never got us anywhere), the site reappeared as suddenly as it had disappeared.

I don't think we should necessarily rely solely on Google. However, publishing on the web in a crawlable way is still fundamental to being found by any search engine, and by users. I don't think this stops the use of vertical search approaches at all - Amazon provides its own search as well as regularly appearing in search results from Google - but we need to ensure these approaches go together, not one without the other - and currently most library catalogues provide vertical search but have no chance of appearing in any broader web searches - from Google, or anyone else.

Owen

Owen Stephens
Assistant Director: eStrategy and Information Resources
Central Library
Imperial College London
South Kensington Campus
London
SW7 2AZ

t: +44 (0)20 7594 8829
e: o.stephens_at_imperial.ac.uk
> -----Original Message-----
> From: Next generation catalogs for libraries
> [mailto:NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Warwick Cathro
> Sent: 18 March 2009 01:09
> To: NGC4LIB_at_LISTSERV.ND.EDU
> Subject: [NGC4LIB] Harvesting of data by Google
> 
> Actually, we have found at the National Library of Australia that it
> isn't that easy to have our data harvested and indexed by Google.
> 
> We have been using properly constructed Google Site Maps for our
> various discovery services (such as Picture Australa), but currently
> Google is harvesting only a proportion of our records, and the
> proportion that is harvested has fluctuated wildly during the past 12
> months.
> 
> Currently Google is harvesting 49% of our Picture Australia records,
> but a few months ago it was only 5%.
> 
> We have contacted Google many times about this - and even had a face-
> to-face meeting with them - but the problem doesn't get fixed.
> 
> We have contacted colleagues in the National Library of New Zealand and
> they are having the same problems.
> 
> There is a further issue, in that we would like to preference
> Australian collections (libraries, museums, archives, university
> repositories, etc) in the relevance ranking for our discovery services.
> Even if Google was reliably harvesting 100% of our data, we may not
> like their relevance ranking.  This is a possible argument for
> maintaining specialised portials, even if we want Google to drive
> traffic to those portals.
> 
> Two examples of our discovery services are:
> http://www.pictureaustralia.org/index.html
> http://ndpbeta.nla.gov.au/ndp/del/home
> 
> We have 8 of these services, but we are currently integrating them into
> One Big Service.
> 
> Warwick
> 
> Warwick Cathro
> Assistant Director-General, Resource Sharing and Innovation
> National Library of Australia
> Ph: 02 6262 1403
> Fax: 02 6273 1133
> Mob: 0411 868 411
> 
> 
> -----Original Message-----
> From: Next generation catalogs for libraries
> [mailto:NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Alexander Johannesen
> Sent: Wednesday, 18 March 2009 11:46 AM
> To: NGC4LIB_at_LISTSERV.ND.EDU
> Subject: Re: [NGC4LIB] What do users understand?
> 
> On Wed, Mar 18, 2009 at 02:20, Weinheimer Jim <j.weinheimer_at_aur.edu>
> wrote:
> > It would be nice if it were that simple, but Google's algorithm (the
> > entire strength of
> > Google) is based on trillions of links to all different sites (the
> > page with most links to it by the most linked = #1). There's nothing
> > like that option in the library, and even Google's algorithm isn't so
> hot in Google Books.
> 
> Actually, this was true a few years ago. They've moved on, and other
> things are at play now. Besides, all it takes for this to work in
> libraries if links are (indeed still) the main stew booster, is for
> libraries to properly share their stuff! Not hard at all. C'mon, make
> it easier for Google to help you out.
> 
> > Google's ranking by "relevance" (a semi-propagandistic term since it
> > means something quite different from the normal sense of "relevance")
> 
> No it doesn't; It means whatever it means in the context of where you
> are, just like in real-life. Within Google it is relevant to the words
> you typed in. Don't like the relevance? Switch your words, just like in
> real-life.
> 
> > would need to be recreated in the catalog, but how? By items most
> > checked out (most popular?) By getting into publisher databases and
> > trying to arrange by printing statistics? Or by retail statistics and
> best-sellers?
> 
> Ah, well *now* we're cookin'! :) I've got heaps of stuff about this,
> mostly prototypes and hacks before I quit the library world, things
> like "Heat Engine" which uses inverse cumulative histograms to track
> real popularity of books (without the dreaded short-term effects of
> 'peak', and deals with normative decline as opposed to pure
> statistics), or the "Memory Peak" (dealing with books borrow history,
> tracking subject headings over time and match it against keywords
> people search with in the OPAC), another system for mapping website
> searches against catalog searches and finding corrolations, or if you
> try http://ll01.nla.gov.au Kent Fitch (bless his heart!) played with
> the ABC news feed, pulling it down and try to find resources that
> somewhat matches the news items in question (right-hand side box).
> Funky and fun, and sometimes really helpful and relevant.
> 
> In fact, library developers (and not just programmers) should be
> spending a lot of their time trying this stuff out and thinking about
> new ways to eal with what you've got, because, well, it's what you've
> got, and you won't get much else by the sound of it. :(
> 
> > Or by "rate this book!" Let's say that Nietzsche's "Thus spoke
> Zarathustra"
> >  got 200 votes while Kant's "Critique of the Pure Reason" only got
> 50.
> > What would somebody conclude?!
> 
> Well, there's other and better ways. For example, make your OPACs and
> catalogs more in the vein of social websites, and introduce roles on it
> where librarians can overlay an expert layer over the data. By that,
> reference librarians can surf and search around, tagging their books,
> make lists of recomendations and so forth. Make your systems with more
> roles in them, the *same* system, and this will open up opportunities
> you just don't have right now.
> 
> > While some of these tools are interesting, I'm not sure which ones
> really belong in a library....
> 
> Again, a friendly reminder that your users are ... *everyone*. So yes,
> they probably belong in the library.
> 
> 
> Regards,
> 
> Alex
> --
> -----------------------------------------------------------------------
> ----
>  Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic
> Maps
> ------------------------------------------ http://shelter.nu/blog/ ----
> ----