Re: Relevance ranking: was Aqua Brow

From: Stephens, Owen <o.stephens_at_nyob>
Date: Fri, 4 Jan 2008 13:03:55 -0000
To: NGC4LIB_at_listserv.nd.edu
I'm slightly reluctant to get involved with this, as I think this is well trodden ground on the list, but it is Friday, and it's almost lunch time, so...

I think the point that Rob was making is that Google does more than just use a link as a 'vote' - it also uses the link to infer information about the thing being linked to. So, if I link to the Wikipedia page on Dostoyevsky using the text "here is some useful information on Dostoievski", then this would mean the Wikipedia page would start to appear in Google results under searches for Dostoievski as well as Dostoyevsky.

By exploting the 'network effect' Google starts to build up 'concepts' as opposed to just text, as each web page is effectively 'tagged' by the pages linking to it and the text used to link to it. This is clearly an informal mechanism for building concepts as opposed to the more formal authority files used in libraries - but if every library in the world with a Web enabled catalogue containing references to Dostoyevsky (any spelling) hyperlinked this to the wikipedia page, Google would eventually exploit this and get 'better' at searching across all alternatives. Also, as noted, this is open to exploitation using 'Google Bombs'.

James states that this has 'nothing to do with concept searching' - perhaps this is where the disagreement lies. When a librarian adds a subject heading to a book they are saying 'it is about this concept'. I believe that when a web author links to a page with a text label, they are often also saying 'it is about this concept'. When people use meaningless text to link to pages, this is lost (hence 'click here' is really bad text to use for a link, and not good practice), but over a large enough population, with enough people using reasonably good practice, it seems to work.

I'm not saying this works perfectly - it doesn't - just trying to clarify that Google can search more than simply the text in the page it is finding.

With the example of WWI, then it is clearly not true that searching for WWI won't find primary documentation - although again, I'm not suggesting that searching for WWI is a particularly good search, or that the results you get from Google are particularly good. The first hit for WWI is the wikipedia article (surprise), and it contains two films which I would certainly describe as primary material. Note also that a search for the 'great war' also turns up the same wikipedia article. Going further and searching for "WWI Primary" takes you to http://wwi.lib.byu.edu/index.php/Main_Page which contains more 'primary' sources. The point is that this works because of the way the web works - it links stuff together.

I should leave comments on the semantic web to Rob, as I'm sure he knows far more than me :), but in theory the "Semantic Web" (note capitalisation) would allow us to start linking together disparate terminologies and formerly say 'this is the same as that', whereas at the moment we can only infer it from looking at the network of links and saying 'links that link to here are likely to encapsulate the same concept, which is represented by this page'.

I'm tempted to launch into a discussion about the 'professional' status of librarians vs DRs, censorship and China, and a whole load of other points raised in this thread, but now it is lunchtime, so perhaps not today,

Owen

Owen Stephens
Assistant Director: e-Strategy and Information Resources
Imperial College London Library
Imperial College London
South Kensington
London SW7 2AZ


Tel: 020 7594 8829
Email: o.stephens_at_imperial.ac.uk


> -----Original Message-----
> From: Next generation catalogs for libraries
> [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Weinheimer Jim
> Sent: 04 January 2008 11:49
> To: NGC4LIB_at_listserv.nd.edu
> Subject: Re: [NGC4LIB] Relevance ranking: was Aqua Brow
>
> > I'm sorry Jim, but you are quite wrong in this assertion that Google
> > searches only text and that library catalogs represent concepts. The
> > broad PageRank technology is discussed in detail on
> wikipedia (http://
> > en.wikipedia.org/wiki/PageRank). You yourself cite Google Bombs,
> > which by their very nature show how google is searching exactly the
> > concepts you suggest. (http://en.wikipedia.org/wiki/Google_bomb for
> > those wanting more on google bombs)
>
> Sorry right back, but page rank has nothing to do with
> concepts. Here is the Page Rank explained:
> "PageRank relies on the uniquely democratic nature of the web
> by using its vast link structure as an indicator of an
> individual page's value. In essence, Google interprets a link
> from page A to page B as a vote, by page A, for page B. But,
> Google looks at more than the sheer volume of votes, or links
> a page receives; it also analyzes the page that casts the
> vote. Votes cast by pages that are themselves "important"
> weigh more heavily and help to make other pages "important""
>
> This says nothing about searching 50 different versions of a
> name, or the term for a concept, although it seems to use
> fuzzy algorithms, which are also based on text. (See fuzzy
> string searching: http://en.wikipedia.org/wiki/Fuzzy_string_searching)
>
> It is not searching different forms of "Dostoyevsky" in the
> way it is done in the library catalog. When you search the
> authorized form of "Dostoyevsky, Fyodor, 1821-1881, you are
> also retrieving all of the following forms. (This is taken
> from Bernhard's copy of the LC Authority file)
> Dostoievski, Fédor Mikhailovitch
> Dostoievski, Fiodor
> Dostojevski, F. M.
> Dostojewskij, Fjodor M.
> T »o-ssu-t »o-yeh-fu-ssu-chi
> Dostoevsky, Fyodor
> Zuboskal
> DostoevskiiÌ , Fedor MikhaiÌ lovich
> DostoevskiiÌ , F. M.
> Dostojewski, Fjedor Michailowitsch
> Dustūyafskī, Fīdūr
> Dostoievsky, F.
> Dosztojevszkij, Fjodor Mihajlovics
> Tu- ssu-t »o-yeh-fu-ssu-chi
> Dostojewski
> Dostojewski, Fiodor
> Dostoevskij, Fedor
> Dostojewskij, F. M.
> Dostojevskij, F. M.
> DostoevskiiÌ , Fedor
> Dostojevskij, Fjodor
> D̲ostogiephski, Ph. M.
> Dostoïevsky, Th. M.
> D̲ostogiephsky, Phiontor Michaēlovits
> Dostoïevski, Fiodor
> Dostoiewskij
> Dostojewski, Fjodor
> Dostoevsky, Fedor
> Dostoïevsky, Fédor
> Dostoevsky, F. M.
> Dostojevskis, F.
> Dostoevski, F.
> Dostojewsky
> Dosṭoyevsḳi, Fyodor Mikhailovits'
> Dostogephskē, Th
> Dostojewski, Teodor
> Dāstavaskī
> D̲ostogephski
> Dostoyewski, Fedor
> Dosztojevszkij, F. M.
> Dosṭoyeṿsḳi, F. M.
> Dostojevskij, Fedor Michajlovič
>
> Some records are more complex than this.
>
> Google cannot do this. Another example that I use (that
> people probably get tired of) is: WWI. When you search wwi in
> Google, you get 600,000 hits, with the first one to Wikipedia.
>
> So, the information expert immediately asks: What are we
> looking at? Is this a good search? Someone who doesn't
> understand the problems will be happy with the search--that
> is, until you realize that this search *cannot find primary
> documents about WWI". Why? Because nobody called it world war
> one until world war two began 20 years later.
>
> So, unless someone has gone in and manually added wwi to the
> primary documents, the text search for "WWI" cannot retrieve
> primary documents. This can be repeated with literally
> hundreds of thousands of examples.
>
> Google bombs work by citations (text) to specific pages. The
> famed "miserable failure" example (killed in Google--and  a
> discussion could take place whether this is censorship--but
> it still works in Yahoo) is based on people adding links of
> "miserable failure" to the White House page of George Bush,
> but it has nothing to do with concept searching.
>
> > I can't help thinking that perhaps you are asking the wrong question
> > here. You ask google about 'dostoyevsky'. Without any additional
> > information they infer that you are asking about the russian author
> > and present you with a page full of results about him - primarily
> > summaries about him, his writing and the period in history
> as well as
> > a lot of detail on where to find more information.
> >
> > What question was it that you were trying to answer about
> Dostoyevsky
> > when starting the search? When he was born? What he wrote? What
> > question does it fail to answer in the first page of
> results? Knowing
> > that would really help in knowing how to build a better search tool.
>
> For this discussion, I just want to know what is available in
> Google by and about Dostoyevsky. A library catalog is
> designed to do this, while Google cannot do it. I don't think
> most people understand this. If we want to decide that the
> traditional goals of a catalog no longer apply: i.e. to show
> what a collection has by its authors, titles, and subjects,
> that would be one thing, but it must be debated first.
>
> Again, there is nothing wrong with Google, but it has major
> weaknesses. I also want to build a better search tool, but it
> is vital that we all understand the strengths and weaknesses
> of all the tools we presently have.
>
> James Weinheimer
>
Received on Fri Jan 04 2008 - 08:06:43 EST