Re: Relevance ranking: was Aqua Brow

From: Weinheimer Jim <j.weinheimer_at_nyob> Date: Fri, 4 Jan 2008 09:27:11 +0100 To: NGC4LIB_at_listserv.nd.edu

> Could you expand on the parallel between librarianship and say, being
> a medical doctor? For example, the similarities in the expectations
> of users that you're seeing or in the sanctions that are applied to
> librarians when they fail to find the right resource for a user's
> needs? I'm not sure library users see librarians in quite the same
> way as they see their lawyer or doctor.
>
> The google code of conduct is at: http://investor.google.com/
> conduct.html and makes interesting reading...
>
> It contains a section on serving their users well, taking a stand on
> issues that affect their users, respecting each other at work and not
> letting personal issues become a conflict of interest; all in all it
> seems to me to be very similar.

The Google code of conduct is aimed at the conduct of their employees. There is a lot of concern about Google's actual policy. For a rather extreme stance, see the Mother Jones article at: http://www.motherjones.com/news/feature/2006/11/google.html
Here is a report about Google being the worst at protecting privacy:
http://www.privacyinternational.org/article.shtml?cmd%5B347%5D=x-347-553961

I know that Google agreed to censor itself in China. This is not to say that Google is "bad": it is acting like a normal corporation out to make money, but it is no better and no worse than any other corporation. We all need to be aware of this.

Librarians have the code of ethics that says we do not censor user's information We do not promote certain information over other information for our own benefit. Of course, this is how Google works: if one site can get a whole lot of others to link to it, by paying them or in any way they can, they will wind up higher in the rankings. This can be seen the clearest with google bombs, but it is obviously happening in other ways as well.

> Not knowing much about Dostoyevsky I figured I'd try http://
> www.google.co.uk/search?q=Dostoyevsky and see what I got. The very
> first hit is the wikipedia article which contains the common
> transliterations of his name as well as the russian. More
> interestingly many of the first page results were for sites using the
> spelling 'Dostoevsky' as their prominent form. But as I'm not an
> expert on any russian authors I can't say what I'm missing out on.

This is something I have discussed in several other posts. People believe that when they are searching, e.g. "Dostoyevsky" and get one million hits, they feel happy. The question is: why?

When I ask my students this, the question literally shocks them. When I finally get an answer, it is that they say that they have retrieved the items about Dostoyevsky. Then I show them that they haven't. I show them the authority record for Dostoyevsky (you should look at it in Bernhard's wonderful copy of the LC authority file at: http://www.biblio.tu-bs.de/db/lcsh/page.php. It's quite a record) They I say that to do a good search for him, they would have to look under all of those forms. Yes, some of those forms may be in languages they may not be able to read, but how many forms are there in English? Besides, what does that have to do with it? Users should be able to limit their search afterwards to "English" or any other language.

What I'm getting at is a primary difference in the way Google and library catalogs search: in Google (et al.) you necessarily search "text," perhaps with fuzzy searching and other ingenious methods, but in the library catalog, you can search "concepts." This is how the catalog was designed. You can search the entire concept of Dostoyevsky, or Tolstoy, or WWI, or anything according to Cutter's rules from so long ago: to find what the library has by their authors, titles, and subjects. This *cannot be done* in Google because you are searching only text.

So, after I explain all of this, I ask them again: why are they happy with their Google search? The answer is: they thought they had done a concept search for Dostoyevsky, when they have only searched the text. They also thought that they received the most "relevant" items, when in actuality they are looking at the items that have the most links to them, or the most cited items. This does not mean that the items they are looking at are the most "relevant" items, at least not in the normal meaning of the term. At the end of the exercise, they are much more skeptical of Google results.

Finally, I want to emphasize again that Google searches are not bad--Google is simply one more tool in our toolbox to help users find things they need. Google searching has many, many weaknesses, just as library catalogs do. What is great is that the weaknesses and strengths complement each other: where one is weak, the other is strong and vice versa. One tool is not so strong as to negate the need of the other.

Library catalogs need to change in a thousand ways or more, but I think it is still important to give access to items in the traditional ways: reliable, complete results by author, title, and subject.

> The term "relevance" simply means "it meets my needs".
> Something that
> the world at large believe Google does very well. Certainly the
> search discussed above looks to me like it would meet the needs of
> all but the most ardent researcher - and I would expect they would be
> using far more distinct terms than 'Dostoyevsky'.
> That is to say, relevance is about matching the result to the context
> of the user. What Google excels at it is guessing that context from
> what you've typed, how much you've typed, how specific it is and
> much, much more.

Webster's dictionary gives the following definitions for "relevant":
a. having significant and demonstrable bearing on the matter at hand
b. affording evidence tending to prove or disprove the matter at issue or under discussion
only definition "a" is relevant here:-))

Therefore, a Google result could be interpreted as: number 1 is the item that is the most significant and demonstrable item within the Google database on the topic that I searched. The rest are arranged in descending order. This is how I believe most people would understand a Google result. I think I already demonstrated that Google searches "text" and not "topics" (concepts). Now, would we really want to say that the most cited item (or in other words, the most popular item) is necessarily the most significant? Or in the terms you put: that it best meets my needs of all the items in the Google database?

I don't believe we can make such a conclusion and it certainly doesn't follow logically. But, Google searches do make their customers happy, that is, as long as they don't examine the search results too closely.

> Surely a next-generation catalogue has to do a better job of that?

This I agree with completely! There is a lot of work to be done.

James Weinheimer