Re: Relevance ranking: was Aqua Brow

From: Weinheimer Jim <j.weinheimer_at_nyob> Date: Sat, 5 Jan 2008 13:47:02 +0100 To: NGC4LIB_at_listserv.nd.edu

> This conversation has left me more depressed about the future of libraries
> than I've ever been in the 7 years I've worked in this business.  So,
> a few
> parting shots, and I'm outta here.
>
> 1) Library catalogs most certainly do not work on the basis of concepts.
> They work on the basis of text, and algorithms for matching against that
> text.  For Google, the text in question is statistically derived data
> based
> upon billions of source documents.  For libraries, it's the MARC
> record.
> It's all text, though.
>
> It's not concepts, whatever one might mean by such a loaded and ambiguous
> word.  The text in the MARC record is more useful than the text in a
> bunch
> of LibraryThing tags or Google's indexes in some respects and less useful
> than others.  But they're all text.

It's unfortunate that you're depressed, but there are a few things that must be understood and accepted. Actually, I am depressed too because of the obvious loss of knowledge that has taken place within only the last 25 or 30 years. Keyword searching, which we take for granted today, was practically unknown not that long ago: everybody searched concepts. Yes, that's true: they searched concepts. How did anybody do anything in a universe without keyword searching? They had some very interesting methods they had created over centuries (or millennia) of trial and error.

The way they allowed people to search concepts was through this method of authority control. It was a highly structured experience and although the basic idea of concept searching was very simple, creating such a system turned out to be very, very complex. These methods survive today in library authority control of names, titles, and subjects.

The secret to understanding how it has worked is *not* to focus on an individual library record, be it in MARC format, XML, a catalog card, or whatever. One part is important to understand is that a bibliographic record makes sense only when it is related to other records in the catalog. one bibliographic record related to other bibliographic records through an incredibly complex set of still other records, and these are the authority files. Without the authority records, the individual bibliographic records, and the entire catalog, make much less sense, if any at all. They are absolutely necessary for a catalog to function as it should.

Let's see how concept searching works. I want to search a library catalog to find *everything* by Mark Twain. The way the catalog was originally designed to work, I would search in the catalog for "twain, mark" and find something similar to the following record:

Twain, Mark, 1835-1910
For works of this author written under other names, search also under Clemens, Samuel Langhorne, 1835-1910, Snodgrass, Quintus Curtius, 1835-1910 Louis de Conte, 1835-1910

and I would follow the instructions and search all the names. Each name would have the correct bibliographic records collated together for me. In this way, I can search for the *concept* of Mark Twain, something that Google *cannot do.* There was also a cross-reference for different ways people may spell the name, e.g. if I were a Russian, I might think of him as Tvein, Mark, and when I looked for him under this form, I would see:

Tvein, Mark, 1835-1910
See: Twain, Mark, 1835-1910

With fuzzy searching, for the sake of argument, I'll go ahead and grant that perhaps Google might find "Mark Twain" from a search for "Mark Tvein," but I absolutely refuse to believe it will find "Quintus Curtius Snodgrass."

To continue, let's examine a subject example. I am interested in dogs. I search "dogs" in the catalog and see:

Dogs
See also:
Broader Term: Domestic animals
Narrower Term:  Balto (Dog)
Narrower Term:  Bummer (Dog)
Narrower Term:  C. Fred (Dog)
Narrower Term:  Coydogs
Narrower Term:  Feral dogs
Narrower Term:  Fighting dogs
Narrower Term:  Flush (Dog)
Narrower Term:  Greff (Dog)
Narrower Term:  Lazarus (Dog)
Narrower Term:  Mabrouk (Dog)
Narrower Term:  Photography of dogs
Narrower Term:  Puppies [... truncated]

Now, I could continue to look at the records for "dogs" at this point, or actually think, "Oh! I really want fighting dogs." and go there immediately. If I look at the records for dogs, there will be a lot of them, and I will see that someone has thoughtfully subdivided these for me. Do me a favor and browse the items under "dogs" at Princeton University. It will only take a few minutes:
http://catalog.princeton.edu/cgi-bin/Pwebrecon.cgi?Search_Arg=dogs&Search_Code=SUBJ_&CNT=50&HIST=1

As you look at this, you will find headings that most probably would never occur to you, such as "Dogs as laboratory animals," and other concepts such as "Dogs--Japan," "Dogs--War use of." Something might even interest you enough to look at one of them more closely, but what is more important is that in just a few minutes, you have looked at *everything* about dogs (i.e. a concept search) in the collection of one of the great libraries in the world. It was painless, it was quick, and it was complete. (For those catalogers out there: I know I have overstated the case about *everything* since there are certain, known parameters in these issues such as the rule of three authors, the 20% rule for subjects, problems with analysing collections, serials, etc. but I am setting these issues aside for now. These are issues related primarily to lack of enough staffing, since it could be far more complete with additional staff)

Try searching Google for dogs and see what you get: rude references to women and probably politicians, porno and who knows what else, and you'll spend hours going through them all, but it will all be arranged by this magic of "relevance."

This is all very basically how a library catalog works and is structured. There is a lot more to it than this. Librarians learn it in very early in library school (or at least they should). Although I have written all this in the past tense, the same processes are still going on today. The library catalog is *not based on text at all.* To think otherwise is a serious mistake and must be understood very clearly. Library catalogs are based on *arrangement*: that is, by putting similar items together. The arrangement may be physical, as in the case when catalogs themselves were physical and the cards came together, and today virtually, by bringing similar MARC records together. Books are still arranged physically on the shelves by putting similar subjects together. This is also a type of concept searching without text.

The individual bibliographic records in the catalog need the authority records (essentially the series of cross-references) to provide additional access, and to supplement the bibliographic records. They work together. If you want to learn more about this in quite literally frightening detail, I suggest you discuss it with the catalogers in your library.

Is this a perfect system? Absolutely not, but it has powers that have been demonstrated over hundreds of years and contemporary search engines *cannot do it*. Some may find this depressing or not, but it is an absolute fact that can be demonstrated thousands of times with far more complex examples than that above. Imagine the depression of a cataloger who has been making records for his or her entire career and then discovers that very few understand at all, and then are "happy" with something with the loaded term (yes!) called "relevance ranking". Why are those people happy? Don't delve too deeply. This is why I have said that librarians are--or should be--more like attorneys or doctors than used car salesmen: our job is to help people find the information they need whether it makes them happy, angry, or bores them to death. Users who are happy may only be happy because they don't know what they aren't getting. This may be fine for a used car salesman, but will send a doctor to jail.

Is the work of catalogers useless? Should it be discarded? Or should it be reharnessed and reformatted to bring out the power that definitely is there, and can be easily demonstrated?

I believe it can be useful in the new environment, but I also admit there is a debate. Before entering into the debate however, we must all understand what the others are doing: the traditional librarians and catalogers must understand and accept the powers and capabilities of the new systems, but the "metadata creators" and other IT people must also understand and accept the powers of the traditional librarians. The two have always had trouble communicating and I find this list to be a very useful corrective.

James Weinheimer