Re: Relevancy-ranking LCSH?

From: Rob Styles <Rob.Styles_at_nyob> Date: Wed, 7 Feb 2007 11:31:44 -0000 To: NGC4LIB_at_listserv.nd.edu

Somehow today's xkcd cartoon seemed appropriate...

http://xkcd.com/c220.html

rob

Rob Styles
Programme Manager, Data Services, Talis
tel: +44 (0)870 400 5000
fax: +44 (0)870 400 5001
direct: +44 (0)870 400 5004
mobile: +44 (0)7971 475 257
msn: mmmmmrob_at_yahoo.com
irc: irc.freenode.net/mmmmmrob,isnick

> -----Original Message-----
> From: Next generation catalogs for libraries
> [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Tim Spalding
> Sent: 06 February 2007 00:50
> To: NGC4LIB_at_listserv.nd.edu
> Subject: Re: [NGC4LIB] Relevancy-ranking LCSH?
>
> Two detailed replies.
>
> Karen,
>
> >Tim, in part I think at one point you confuse LCSH and LC
> >Classification. LC Classification shelves things in a single place;
> LCSH
> >allows multiple subject headings to be added to a record.
>
> While I appreciate your comments, I am quite sure I was not confusing
> them. The blog post explicitly contrasts shelf-order systems like LCC
> and Dewey (as used 99% of the time) with subject systems like LCSH
> which allows multiple headings per book. Indeed, the whole point of
> the algorithm I was playing with was to rank books within one subject
> by looking at the *other* subjects applied to the same books. I also
> mention the practice of making the first LCSH the "primary one."
>
> Perhaps you were addressing the letter alone, which speaks of the
> physicality of the system.  A shelf-order system is the most
> limited--every book it's place. But LCSH is equally rooted in
> physical, not digital, limitations. When card catalogs were physical,
> a book could have only so many subjects, first if it's to retain and
> single card, but even if it spills over. Just imagine adding every
> relevant heading to the "Encyclopedia Britannica" card. Similarly, a
> subject's section can take up only so many cards. It would not do, for
> example, to try to file under "Love," "Man-woman relationships,"
> "Christian life" or "Civilization" every book that pertains to these
> subjects. The catalog would be useless. It would be the map of China
> that was as large as China.
>
> Jonathan,
>
> I am speaking of ranking books within a subject, not ranking subjects
> in response to a query. Although I see that ranking subjects in
> response to a request might be an interesting problem, the idea of
> returning LCSHs rather than books in response to a user query turns me
> off. Perhaps as facets.
>
> I disagree with you about this:
>
> > If we are talking about ranking books _within_ a certain LCSH
> subject,
> > though, I'm not sure what our goal would be. Do we want a book to
> show
> > up higher if it's somehow "more" about that subject than other
books?
> > What does that even mean? More of the book concerns this topic?
>
> You're right that it's not entirely clear what it means. But "what
> does this even mean" implies a more general disbelief that relevancy
> ranking "means" anything. So let me step back.
>
> In the real world, things are not "about" something in a binary way.
> They are like this in LCSH because of the physical constraint of the
> catalog card, which are now baked into the system. About 95% of
> western literature is "about" "Man-woman relationships" to some
> degree, but we all know that some books are more about it than others.
> There is no non-arbitrary place to draw the line, and there will
> always be shades and modes within the list you get.
>
> Relevancy ranking is the attempt to force a large binary list into an
> order that is useful to people. To take some examples from LCSH, most
> of us would rank "Pride and Prejudice" as "more about" "Man-woman
> relationships" than "Great Expectations," "The Lord of the Rings" a
> better example of "Fantasy" than "Charlotte's Web," and "The Time
> Machine" as more solidly "Time travel -- Fiction" than "Life, the
> Universe and Everything." But LCSH makes no such distinctions.
>
> As David Weinberger and others have noted, ranking and relevance is
> central to how we think--in prototypes and good and bad examples, not
> in binary trees. We know that penguins are birds, and tomatoes are
> fruit, but ask someone for a good example of either category and
> they're more likely to pick something closer to their prototype of the
> term. What's true for birds is also true for books-a binary system
> impoverishes our complex, nuanced understanding of aboutness.
>
> Unfortunately, for library data, this kind of relevancy ranking is
> central to computers today. It's not only how people think, it's how
> computers increasingly *work*. Google doesn't return return *all*
> pages with keywords, and throw its hands up philosophcially about what
> ranking them would "mean." It ranks them. Certainly, there's no
> getting around the fact that Google order is debatable--imperfect,
> context dependent and subjective. But that also describes the truth of
> the matter, that "aboutness" is not binary. Anyway, we can debate
> aboutness all we like, but give a patron a list of 1,000 books "about"
> a topic, and refuse to even try ranking it, and they will turn to
> Google for their bibliographic research. They will be right to do so.
>
> > This
> > topic is more central to the book?  Hmm. In your 'folksonomy'
example
> we
> > know exactly what it means---a whole bunch of people thought that
> > "dytopia <http://www.librarything.com/tag/dystopia>" was an
> appropriate
> > tag for the book 1984. This is very useful information in a
> folksonomy
> > environment, because we don't know how 'trustworthy' the tags are,
> this
> > is one way of deciding it's a trustworthy tag.
>
> Yes, statistics can screen for "trustworthiness," in case someone
> tried to spam a folksonomic system. But that's missing the point. The
> statistics of a folksonomy aren't there so you can flip between binary
> condition--trust/don't trust--but to approach the "degrees of
> belonging" a system like LCSH lacks.
>
> Take a look at the LibraryThing tag for say, Chick lit and Cyberpunk:
>
> http://www.librarything.com/tag/chick+lit
> http://www.librarything.com/tag/cyberpunk
>
> While thousands of books have been so tagged, the resutling tag page
> has something close to the paradigmatic "reading list" for these
> terms. It's not because 277 people tagging "Bridget Jones's Diary" as
> "Chick lit" has proved "trustworthy." It's because the statistics
> indicate "Bridget Jones's Diary" is a *particularly good example* of a
> fuzzy, subjective and contestable category. On the other end, with a
> certain creative acceptance, you can understand how someone decided to
> tag "Jane Eyre" as chick lit. But it was only one person. By taking
> account of the statistics, tagging can put such marginal examples
> where they belong, at the end of a list.
>
> Contrast that with the LibraryThing LCSH page for "Fantasy."
> http://www.librarything.com/subject.php?subject=Fantasy
>
> I've imposed a popularity order, but imagine it was completely
> arbitrary and take a look at the full list, the data for which is all
> library based (most from the LC). LibraryThing stops at 10,000
> examples, and the examples it finds are hardly equally pertinent. This
> isn't a philosophical question. If someone came into your library
> asking for some fantasy books, it would be less helpful to start them
> off with "Charlotte's Web" and "Le Petit Prince" than Tolkien, Lewis
> or Moorcock.
>
> I am the last person to chuck out LCSHs. Apart from the fact of their
> existence and all the labor that went into that, the have virtues
> (like hierarchy, disambiguation of homonyms, etc.) that give them
> great power. They would have more if there were some way to relevancy
> rank results within a subject set. Whether an algorithm could go some
> of the way to "adding relevance back" was the point of my post.

The very latest from Talis
read the latest news at www.talis.com/news
listen to our podcasts www.talis.com/podcasts
see us at these events www.talis.com/events
join the discussion here www.talis.com/forums
join our developer community www.talis.com/tdn
and read our blogs www.talis.com/blogs

Any views or personal opinions expressed within this email may not be those of Talis Information Ltd. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited.

Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.