Re: Relevance ranking: was Aqua Brow

From: Rinne, Nathan (ESC) <RinneN_at_nyob> Date: Fri, 4 Jan 2008 16:32:15 -0600 To: NGC4LIB_at_listserv.nd.edu

Alex,

I look forward to more jousting between you and James (who would fully
agree with Karen's last post by the way, I'm sure)

OK, real quick.  You said:

"May I remind our good readers of this list of *why* people search for
things by author, title and subject? It's because we told them to do so.
It's because that's how we've done our systems, so for them to find
stuff in our collection, that's the way they have to do it."

Actually, I don't think this is true.  According to Wikipedia, in
Charles Cutter's "index catalog", he had an author index and a "classed
catalog", or rudimentary form of a subject index (he also did genre).

According to Francis Miska (see
http://www.catalogingfutures.com/catalogingfutures/2007/11/essential-lis
te.html), Cutter's cataloging experiments developed out of his
experiences with the library patrons of his day.  According to Miska,
"Cutter's Rules", for example, have a running commentary about why he is
doing what he is doing - giving a rationale based in concrete, real
world experiences for everything.  At 56 minutes into his talk, Miska
talks about Cutter's "Objects" like this:  "[Cutter talked about] how we
find out what a catalog is supposed to do... and he listed all these
questions that people ask of the librarian.  And you know what?  They're
his "Objects of the Catalog" in the form of user questions.  He was
listening to his users" (Cutter also, way back in 1875, did extensive
survey research)

In other words, how much of this was Cutter dictating to people "how to
do something", and how much of it is simply him creating explicit,
classified structure that enabled people to accomplish the either
implicit or explicit goals that they came to him with?

Regards,

Nathan Rinne
Media Cataloging Technician
ISD 279 - Educational Service Center (ESC)
11200 93rd Ave. North
Maple Grove, MN. 55369
Work phone: 763-391-7183

-----Original Message-----
From: Next generation catalogs for libraries
[mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Alexander Johannesen
Sent: Friday, January 04, 2008 3:31 PM
To: NGC4LIB_at_listserv.nd.edu
Subject: Re: [NGC4LIB] Relevance ranking: was Aqua Brow

Hiya,

I shouldn't get involved in this, but a few contrarian comments should
be in order ;

On Jan 4, 2008 10:48 PM, Weinheimer Jim <j.weinheimer_at_aur.edu> wrote:
[big list from authority file]

> Google cannot do this.

Oh, I'm sure Google can do it if it felt like it (the mechanisms are
there [the context argument]), but most of the time they don't because
they don't have fielded search so to take the "Did you mean?" concept
that far would be overkill / crazy for what their service is meant to
do. I think this is comparing apples and oranges, to be honest.

>Another example that I use (that people probably get tired of) is:
WWI. When you
> search wwi in Google, you get 600,000 hits, with the first one to
Wikipedia.
> So, the information expert immediately asks: What are we looking at?
Is this
> a good search? Someone who doesn't understand the problems will be
happy
> with the search--that is, until you realize that this search *cannot
find
> primary documents about WWI". Why? Because nobody called it world
> war one until world war two began 20 years later.

This is a straw-man argument as you're introducing "primary documents"
as the qualifier for the search for the more generic "WWI". If people
want information on that war then "WWI" is the perfect search. The
search criteria is also misleading, because if people are looking for
primary documents then they wouldn't even use a simplistic term such
as "WWI". In fact, there are so many ways of searching for that topic
that arguing semantics over this one little term is rather pointless.
(I'm pretty darn sure there's stuff in our collection with meta data
with publishing dates in the 1914-1918 region which is not marked with
the infamous "WWI" mark)

I get the feeling that you live in a world where catalogers catalog
everything and also perfectly, which is a place very far from mine. As
an aside, just yesterday (my last day) a colleague showed me a program
that scrutinize LCSH structure and validity (it's a program for
subject heading suggesting), a rather "throwing your arms in the air"
kinda experience with over 70% unvalidated terms but where those terms
where the ones making the most sense. ("Cooking (fish)" vs. "Cooking
-- Fish" anyone?) And beyond validation there's the "controlled" part
of "controlled vocabulary" which I won't go into here but certainly is
laughable ...

> So, unless someone has gone in and manually added wwi to the primary
> documents, the text search for "WWI" cannot retrieve primary
documents.
> This can be repeated with literally hundreds of thousands of examples.

Um, Google is a little bit more clever than that, and certainly uses
context for searching albeit not as extensive as some catalogers trawl
their collections, no. But this is a problem as we speak right now,
and hardly a future problem as Google slowly extends and improves
their search. (Unlike the library world, Google has got several teams
of AI and SemWeb experts to dig into that problem, and with their
resources I'm extremely confident they'll do this better than us)

> For this discussion, I just want to know what is available in Google
by and about
> Dostoyevsky.

Interestingly, *nothing* is available in Google, but merely points to
other places it might be available. You're wanting Google to be
something it ain't ; a library where indeed we've got stuff.

> A library catalog is designed to do this, while Google cannot do it.

No, a library catalog is foremost a catalog of stuff we've got in our
collection (and inventory list, if you like), which puts a whole heap
of different parameters around the concepts of finding information.
Google trawls *billions* of large pages about *anything* for textual
context, while our catalog searches a few million small and *fielded*
records of mostly books. We've got a specific domain with a specific
set of criteria for search success, and Google is free for all and
completely open to the quality of searching (albeit PageRank does a
pretty good job of finding relevant stuff). The interesting thing is
when you treat a library catalog as unstructured free-form text you're
more likely to find search direction (http://ll01.nla.gov.au/), so
there really is no one solution which is better than the other ; both
(or all) approaches adheres to a different aspect of searching, and
some methods are better for some people. Using Google, your library
search and some alternative may all do things differently, find
different results, but they can *all* be *exactly* what the user
wants. (Unless the user is a librarian, it appears ...)

> I don't think most people understand this. If we want to decide that
the
> traditional goals of a catalog no longer apply: i.e. to show what a
> collection has by its authors, titles, and subjects, that would be one
> thing, but it must be debated first.

May I remind our good readers of this list of *why* people search for
things by author, title and subject? It's because we told them to do
so. It's because that's how we've done our systems, so for them to
find stuff in our collection, that's the way they have to do it.

And may I remind you also that these kind of searches are usually not
the main goal of the search? Often they don't search for "Sagan, Carl"
and subject "Extraterrestrial Wagon" (if our catalogers ever put that
in) to read his book, but to get that book so they can read that
section in which this argument had it origin. That's research. If
catalogers could put in *that* level of semantics then we would be the
envy of the world, but we don't. We can tell you the books "Sagan,
Carl" wrote, but we can't point out what's in them that makes sense to
searchers. Or what about this one; in what book was "The Teapot"
argument by Bertrand Russell first made? Our catalog *CANNOT* get you
this answer, but search Google for it, and look what the number one
answer is. *That's* what we envy, and that's what we cannot ever, ever
hope to do.

> Again, there is nothing wrong with Google, but it has major
weaknesses. I also
> want to build a better search tool, but it is vital that we all
understand the
> strengths and weaknesses of all the tools we presently have.

Again, there's nothing wrong with library catalogs, but they have
major weaknesses. I also want to build a better search tool, but it is
vital that we all understand the strengths and weaknesses of all the
tools we presently have.

Alex
--
------------------------------------------------------------------------
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic
Maps
------------------------------------------ http://shelter.nu/blog/
--------