Re: Relevance ranking: was Aqua Brow

From: Weinheimer Jim <j.weinheimer_at_nyob>
Date: Fri, 4 Jan 2008 12:48:52 +0100
To: NGC4LIB_at_listserv.nd.edu
> I'm sorry Jim, but you are quite wrong in this assertion that Google
> searches only text and that library catalogs represent concepts. The
> broad PageRank technology is discussed in detail on wikipedia (http://
> en.wikipedia.org/wiki/PageRank). You yourself cite Google Bombs,
> which by their very nature show how google is searching exactly the
> concepts you suggest. (http://en.wikipedia.org/wiki/Google_bomb for
> those wanting more on google bombs)

Sorry right back, but page rank has nothing to do with concepts. Here is the Page Rank explained:
"PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important""

This says nothing about searching 50 different versions of a name, or the term for a concept, although it seems to use fuzzy algorithms, which are also based on text. (See fuzzy string searching: http://en.wikipedia.org/wiki/Fuzzy_string_searching)

It is not searching different forms of "Dostoyevsky" in the way it is done in the library catalog. When you search the authorized form of "Dostoyevsky, Fyodor, 1821-1881, you are also retrieving all of the following forms. (This is taken from Bernhard's copy of the LC Authority file)
Dostoievski, Fédor Mikhailovitch
Dostoievski, Fiodor
Dostojevski, F. M.
Dostojewskij, Fjodor M.
Tʻo-ssu-tʻo-yeh-fu-ssu-chi
Dostoevsky, Fyodor
Zuboskal
Dostoevskiĭ, Fedor Mikhaĭlovich
Dostoevskiĭ, F. M.
Dostojewski, Fjedor Michailowitsch
Dustūyafskī, Fīdūr
Dostoievsky, F.
Dosztojevszkij, Fjodor Mihajlovics
Tu- ssu-tʻo-yeh-fu-ssu-chi
Dostojewski
Dostojewski, Fiodor
Dostoevskij, Fedor
Dostojewskij, F. M.
Dostojevskij, F. M.
Dostoevskiĭ, Fedor
Dostojevskij, Fjodor
D̲ostogiephski, Ph. M.
Dostoïevsky, Th. M.
D̲ostogiephsky, Phiontor Michaēlovits
Dostoïevski, Fiodor
Dostoiewskij
Dostojewski, Fjodor
Dostoevsky, Fedor
Dostoïevsky, Fédor
Dostoevsky, F. M.
Dostojevskis, F.
Dostoevski, F.
Dostojewsky
Dosṭoyevsḳi, Fyodor Mikhailovits'
Dostogephskē, Th
Dostojewski, Teodor
Dāstavaskī
D̲ostogephski
Dostoyewski, Fedor
Dosztojevszkij, F. M.
Dosṭoyeṿsḳi, F. M.
Dostojevskij, Fedor Michajlovič

Some records are more complex than this.

Google cannot do this. Another example that I use (that people probably get tired of) is: WWI. When you search wwi in Google, you get 600,000 hits, with the first one to Wikipedia.

So, the information expert immediately asks: What are we looking at? Is this a good search? Someone who doesn't understand the problems will be happy with the search--that is, until you realize that this search *cannot find primary documents about WWI". Why? Because nobody called it world war one until world war two began 20 years later.

So, unless someone has gone in and manually added wwi to the primary documents, the text search for "WWI" cannot retrieve primary documents. This can be repeated with literally hundreds of thousands of examples.

Google bombs work by citations (text) to specific pages. The famed "miserable failure" example (killed in Google--and  a discussion could take place whether this is censorship--but it still works in Yahoo) is based on people adding links of "miserable failure" to the White House page of George Bush, but it has nothing to do with concept searching.

> I can't help thinking that perhaps you are asking the wrong question
> here. You ask google about 'dostoyevsky'. Without any additional
> information they infer that you are asking about the russian author
> and present you with a page full of results about him - primarily
> summaries about him, his writing and the period in history as well as
> a lot of detail on where to find more information.
>
> What question was it that you were trying to answer about Dostoyevsky
> when starting the search? When he was born? What he wrote? What
> question does it fail to answer in the first page of results? Knowing
> that would really help in knowing how to build a better search tool.

For this discussion, I just want to know what is available in Google by and about Dostoyevsky. A library catalog is designed to do this, while Google cannot do it. I don't think most people understand this. If we want to decide that the traditional goals of a catalog no longer apply: i.e. to show what a collection has by its authors, titles, and subjects, that would be one thing, but it must be debated first.

Again, there is nothing wrong with Google, but it has major weaknesses. I also want to build a better search tool, but it is vital that we all understand the strengths and weaknesses of all the tools we presently have.

James Weinheimer
Received on Fri Jan 04 2008 - 06:50:30 EST