Re: Relevancy-ranking LCSH?

From: Hahn, Harvey <hhahn_at_nyob>
Date: Mon, 5 Feb 2007 15:06:47 -0600
To: NGC4LIB_at_listserv.nd.edu
Tim Spalding wrote:
|I wonder if anyone has made, seen or can think of any good methods to
|do it. So far I've only seen non-ranked and popularity-ranked results.
|In the blog post I talk about playing around with how LCSHs reinforce"
|each other statistically, but I couldn't get the algorithm to produce
|good results more than sporadically.
|I'm not sure if this is a cataloging or a coding. Maybe that's
|the point.

Your blog entry said:
"It's easy to ignore a third, and very critical difference. Subject
classifications, like the Library of Congress Subject Headings (LCSH),
are essentially binary. It's non-overlapping buckets. Something either
does or does no belong in a subject. There are no gradations of
belonging."

This is true in the ideal--but not in reality.  LCSH is *NOT* a
thesaurus, although, at one point in time, there seemed to be some
movement in that direction.  A thesaurus, by design, attempts to be as
"binary" as possible.  LCSH has a century of "tradition" embedded, and
that tradition is not binary--there is a lot of non-mutually-exclusive
overlap in the headings.  Part of the reason is that the English
language contains a lot of ambiguity in its vocabulary--that is, a lot
of its richness comes from the fact that it is deliberately nonbinary
(how else could one negotiate compromises?).  (In some respects,
attempting to make it more precise for thesaurus-type searching would
result in making it more difficult for searchers to use "common sense"
natural language.)  Through a machine manipulation back in (I think) the
mid to late 1980s, the LCSH headings were made to *appear* to be more
thesaurus-like.  But that was done by taking the existing relationships
between headings and creating a relationship "mapping" to UFs, BTs, NTs,
and RTs.  It was *NOT*(!!!) done through a thorough re-analysis of LCSH
and cleaning it up to thesaurus standards.  If there are any LCers on
this list, I'll probably get it in the neck for some of these
statements--but these are my understandings from all the explanations
that were current at the time it happened.

Regarding the existing (often "nonthesaurical") interrelationships
between headings, you should note this paragraph from the introduction
to LCSH: "Since the inception of the list, headings have been created as
needed when works were cataloged for the collection of the Library of
Congress. Because the list has expanded over time, it reflects the
varied philosophies of the hundreds of catalogers who have contributed
headings. Inconsistencies in formulation [my addition: and assignment]
of headings can usually be explained by the policies in force at the
varying dates of their creation."

Your blog entry further noted: "I hit upon the idea that subjects
"reinforce" each other, and that this must leave a statistical
signature. For example, it seems that "Love stories" and "Psychological
fiction" are commonly applied to books about "Man-Woman Relationships,"
but that "Androgynous robot alone on an island -- Stories" is not.
(Okay, that's not real, but the point stands.) Can these "related
subjects" relevancy rank the subject itself? ... Anyway, my plane has
landed--allowing me to do real work again--so I end in aporia. Ideas?"
This description reminded me of something I came across about 30 years
ago related to a musical application of computer programming: Markov
chains.  (It was a way of relating successive notes of a melody
statistically according to their probabilities of being in a given
sequence.)  By statistically analyzing sequences of melodic notes and
creating a table of probabilities, it was possible to generate melodic
sequences similar to the original melodies analyzed.  It seems to me
that, by analyzing pairs of LCSH's assigned to works, you *might* be
able to come up with a similar table of probabilities of statistical
relationships.  You've piqued my interest in looking at our own LCSH
assignments within the 365,000

By the way, in your blog entry you noted: "I was reminded of the
question when checking out OCLC's new project, FictionFinder. I'll blog
about the whole later, but for now know that you can search for a LCSH
subject and get back a list of books belonging to it."  In reality,
OCLC's FictionFinder project is *not* directly LCSH-related per se but
is, rather, based on FRBR elements and attributes, some of which are
subject-related.  The algorithm is freely downloadable from:
http://www.oclc.org/research/projects/frbr/algorithm.htm

Everyone might find the OCLC research projects site to be of interest in
general:  http://www.oclc.org/research/projects/default.htm
Received on Mon Feb 05 2007 - 15:04:20 EST