Re: Relevancy-ranking LCSH?

From: Hahn, Harvey <hhahn_at_nyob> Date: Mon, 5 Feb 2007 15:42:07 -0600 To: NGC4LIB_at_listserv.nd.edu

Tim Spalding wrote:
|I wonder if anyone has made, seen or can think of any good methods to
|do it. So far I've only seen non-ranked and popularity-ranked results.
|In the blog post I talk about playing around with how LCSHs "reinforce"
|each other statistically, but I couldn't get the algorithm to produce
|good results more than sporadically.
|I'm not sure if this is a cataloging or a coding. Maybe that's
|the point.

Your blog entry said:
"It's easy to ignore a third, and very critical difference. Subject
classifications, like the Library of Congress Subject Headings (LCSH),
are essentially binary. It's non-overlapping buckets. Something either
does or does no belong in a subject. There are no gradations of
belonging."

This is true in the ideal--but not in reality.  LCSH is *NOT* a
thesaurus, although, at one point in time, there seemed to be some
movement in that direction.  A thesaurus, by design, attempts to be as
"binary" as possible.  LCSH has more than a century of "tradition"
embedded, and that tradition is *not* binary--there is a lot of
non-mutually-exclusive overlap in the headings.  Part of the reason is
that the English language contains a lot of ambiguity in its
vocabulary--that is, a lot of its richness comes from the fact that it
is deliberately nonbinary (how else could one negotiate compromises?).
(In some respects, attempting to make it more precise for thesaurus-type
searching would result in making it more difficult for searchers to use
"common sense" natural language.)  Through a machine manipulation back
in (I think) the mid to late 1980s, the LCSH headings were made to
*appear* to be more thesaurus-like.  But that was done by taking the
existing relationships between headings and creating a relationship
"mapping" to UFs, BTs, NTs, and RTs.  It was *NOT*(!!!) done through a
thorough re-analysis of LCSH and cleaning it up to thesaurus standards.
If there are any LCers on this list, I'll probably get it in the neck
for some of these statements--but these are my understandings from all
the explanations that were current at the time it happened.

Regarding the existing (often "nonthesaurical") interrelationships
between headings, you should note this paragraph from the introduction
to LCSH: "Since the inception of the list, headings have been created as
needed when works were cataloged for the collection of the Library of
Congress. Because the list has expanded over time, it reflects the
varied philosophies of the hundreds of catalogers who have contributed
headings. Inconsistencies in formulation [my addition: and assignment]
of headings can usually be explained by the policies in force at the
varying dates of their creation."

Your blog entry further noted: "I hit upon the idea that subjects
"reinforce" each other, and that this must leave a statistical
signature. For example, it seems that "Love stories" and "Psychological
fiction" are commonly applied to books about "Man-Woman Relationships,"
but that "Androgynous robot alone on an island -- Stories" is not.
(Okay, that's not real, but the point stands.) Can these "related
subjects" relevancy rank the subject itself? ... Anyway, my plane has
landed--allowing me to do real work again--so I end in aporia. Ideas?"
This description reminded me of something I came across about 30 years
ago related to a musical application (besides an MLS, I also have a
master's degree in music) of computer programming: Markov chains.  (It
was a way of relating successive notes of a melody statistically
according to their probabilities of being in a given sequence.)  By
statistically analyzing sequences of melodic notes and creating a table
of probabilities, it was possible to generate melodic sequences
displaying similarities to the original melodies analyzed.  It seems to
me that, by analyzing pairs of LCSH's assigned to works, you *might* be
able to come up with a similar table of probabilities of statistical
relationships.  You've piqued my interest in looking at the
relationships of our own LCSH assignments within the 334,000 bib records
in our OPAC, although, for experimental purposes, I may have to reduce
the number of records to use.  (On the other hand, the J language is
outstanding for manipulating arrays, including "sparse" arrays, but I'm
only a beginner at learning it.)  It'd be quite easy to export the group
of LCSH headings for each bib record.  The trickier part would be the
subject pair analysis, but, since I've done programming as an avocation
for over 30 years (even ran my own library software company for a
while), I hope that it wouldn't present too much of a challenge.  The
only "fly in the ointment" is that this is "performance appraisal
season" at my library, and, as a supervisor, my priority just now is
writing and giving appraisals.  (But I might be able to dump my bib LCSH
file to a USB flash drive or CD-ROM and take it home to work on.  That
way, I might be able to accomplish both things at the same time.)

By the way, in your blog entry you noted: "I was reminded of the
question when checking out OCLC's new project, FictionFinder. I'll blog
about the whole later, but for now know that you can search for a LCSH
subject and get back a list of books belonging to it."  In reality,
OCLC's FictionFinder project is *not* directly LCSH-related per se but
is, rather, based on *FRBR* elements and attributes, some of which are
subject-related.  The algorithm is freely downloadable from:
http://www.oclc.org/research/projects/frbr/algorithm.htm

In fact, everyone might find the OCLC research projects site to be of
interest in general:  http://www.oclc.org/research/projects/default.htm

Interesting post and blog entry!

Harvey

--
===========================================
Harvey E. Hahn, Manager, Technical Services Department
Arlington Heights (Illinois) Memorial Library
Desk: 847/506-2644 -- FAX: 847/506-2650 -- E mailto:hhahn_at_ahml.info
Personal web pages: http://users.anet.com/~packrat