Re: Relevancy-ranking LCSH?

From: Roberts, Anchalee Joy <aproberts_at_nyob> Date: Mon, 5 Feb 2007 19:50:51 -0600 To: NGC4LIB_at_listserv.nd.edu

I don't know if this will answer your question or not, Tim.
But I found the use of LCC in this e-print demo as 'subjects' an innovative way to use the legacy
cataloging concept: http://demoprints.eprints.org/8927/
Perhaps close to what you are looking for.
Cheers,
Joy--
Anchalee (Joy) Panigabutra-Roberts
Catalog Librarian
220C James W. Miller Learning Resource Center
720 Fourth Avenue South
St.Cloud, MN 56301-4498
Tel. (320)308-4771
E-mail: aproberts_at_stcloudstate.edu

________________________________

From: Next generation catalogs for libraries on behalf of Tim Spalding
Sent: Mon 2/5/2007 6:50 PM
To: NGC4LIB_at_listserv.nd.edu
Subject: Re: [NGC4LIB] Relevancy-ranking LCSH?

Two detailed replies.

Karen,

>Tim, in part I think at one point you confuse LCSH and LC
>Classification. LC Classification shelves things in a single place; LCSH
>allows multiple subject headings to be added to a record.

While I appreciate your comments, I am quite sure I was not confusing
them. The blog post explicitly contrasts shelf-order systems like LCC
and Dewey (as used 99% of the time) with subject systems like LCSH
which allows multiple headings per book. Indeed, the whole point of
the algorithm I was playing with was to rank books within one subject
by looking at the *other* subjects applied to the same books. I also
mention the practice of making the first LCSH the "primary one."

Perhaps you were addressing the letter alone, which speaks of the
physicality of the system.  A shelf-order system is the most
limited--every book it's place. But LCSH is equally rooted in
physical, not digital, limitations. When card catalogs were physical,
a book could have only so many subjects, first if it's to retain and
single card, but even if it spills over. Just imagine adding every
relevant heading to the "Encyclopedia Britannica" card. Similarly, a
subject's section can take up only so many cards. It would not do, for
example, to try to file under "Love," "Man-woman relationships,"
"Christian life" or "Civilization" every book that pertains to these
subjects. The catalog would be useless. It would be the map of China
that was as large as China.

Jonathan,

I am speaking of ranking books within a subject, not ranking subjects
in response to a query. Although I see that ranking subjects in
response to a request might be an interesting problem, the idea of
returning LCSHs rather than books in response to a user query turns me
off. Perhaps as facets.

I disagree with you about this:

> If we are talking about ranking books _within_ a certain LCSH subject,
> though, I'm not sure what our goal would be. Do we want a book to show
> up higher if it's somehow "more" about that subject than other books?
> What does that even mean? More of the book concerns this topic?

You're right that it's not entirely clear what it means. But "what
does this even mean" implies a more general disbelief that relevancy
ranking "means" anything. So let me step back.

In the real world, things are not "about" something in a binary way.
They are like this in LCSH because of the physical constraint of the
catalog card, which are now baked into the system. About 95% of
western literature is "about" "Man-woman relationships" to some
degree, but we all know that some books are more about it than others.
There is no non-arbitrary place to draw the line, and there will
always be shades and modes within the list you get.

Relevancy ranking is the attempt to force a large binary list into an
order that is useful to people. To take some examples from LCSH, most
of us would rank "Pride and Prejudice" as "more about" "Man-woman
relationships" than "Great Expectations," "The Lord of the Rings" a
better example of "Fantasy" than "Charlotte's Web," and "The Time
Machine" as more solidly "Time travel -- Fiction" than "Life, the
Universe and Everything." But LCSH makes no such distinctions.

As David Weinberger and others have noted, ranking and relevance is
central to how we think--in prototypes and good and bad examples, not
in binary trees. We know that penguins are birds, and tomatoes are
fruit, but ask someone for a good example of either category and
they're more likely to pick something closer to their prototype of the
term. What's true for birds is also true for books-a binary system
impoverishes our complex, nuanced understanding of aboutness.

Unfortunately, for library data, this kind of relevancy ranking is
central to computers today. It's not only how people think, it's how
computers increasingly *work*. Google doesn't return return *all*
pages with keywords, and throw its hands up philosophcially about what
ranking them would "mean." It ranks them. Certainly, there's no
getting around the fact that Google order is debatable--imperfect,
context dependent and subjective. But that also describes the truth of
the matter, that "aboutness" is not binary. Anyway, we can debate
aboutness all we like, but give a patron a list of 1,000 books "about"
a topic, and refuse to even try ranking it, and they will turn to
Google for their bibliographic research. They will be right to do so.

> This
> topic is more central to the book?  Hmm. In your 'folksonomy' example we
> know exactly what it means---a whole bunch of people thought that
> "dytopia <http://www.librarything.com/tag/dystopia>" was an appropriate
> tag for the book 1984. This is very useful information in a folksonomy
> environment, because we don't know how 'trustworthy' the tags are, this
> is one way of deciding it's a trustworthy tag.

Yes, statistics can screen for "trustworthiness," in case someone
tried to spam a folksonomic system. But that's missing the point. The
statistics of a folksonomy aren't there so you can flip between binary
condition--trust/don't trust--but to approach the "degrees of
belonging" a system like LCSH lacks.

Take a look at the LibraryThing tag for say, Chick lit and Cyberpunk:

http://www.librarything.com/tag/chick+lit
http://www.librarything.com/tag/cyberpunk

While thousands of books have been so tagged, the resutling tag page
has something close to the paradigmatic "reading list" for these
terms. It's not because 277 people tagging "Bridget Jones's Diary" as
"Chick lit" has proved "trustworthy." It's because the statistics
indicate "Bridget Jones's Diary" is a *particularly good example* of a
fuzzy, subjective and contestable category. On the other end, with a
certain creative acceptance, you can understand how someone decided to
tag "Jane Eyre" as chick lit. But it was only one person. By taking
account of the statistics, tagging can put such marginal examples
where they belong, at the end of a list.

Contrast that with the LibraryThing LCSH page for "Fantasy."
http://www.librarything.com/subject.php?subject=Fantasy

I've imposed a popularity order, but imagine it was completely
arbitrary and take a look at the full list, the data for which is all
library based (most from the LC). LibraryThing stops at 10,000
examples, and the examples it finds are hardly equally pertinent. This
isn't a philosophical question. If someone came into your library
asking for some fantasy books, it would be less helpful to start them
off with "Charlotte's Web" and "Le Petit Prince" than Tolkien, Lewis
or Moorcock.

I am the last person to chuck out LCSHs. Apart from the fact of their
existence and all the labor that went into that, the have virtues
(like hierarchy, disambiguation of homonyms, etc.) that give them
great power. They would have more if there were some way to relevancy
rank results within a subject set. Whether an algorithm could go some
of the way to "adding relevance back" was the point of my post.