Automatic Classification

From: Weinheimer Jim <j.weinheimer_at_nyob>
Date: Wed, 5 Sep 2007 09:20:44 +0200
To: NGC4LIB_at_listserv.nd.edu
On 9/5/07,      Alexander Johannesen wrote:
> > Also, in the flurry of postings, I apologize that I must have missed the
> > "paper [you] pointed to which did auto-classification across Project
> > Gutenberg [and] chose LCSH specifically so that librarians could verify
> > the result."  Do you happen to have a ref or link to that
> paper?
>
> Sure ;
>  http://www.ltg.ed.ac.uk/np/publications/ltg/papers/Betts2007Utility.pdf

I just looked at this paper, and I have a lot of questions about it. LC class numbers are rather complex. There are different parts to them, e.g.
ND623.C38D46 1995
The number above is about painting (ND623) individual artist (C38) shelf number (D46) and date of publication (1995)
Numbers get far more complex than this, by the way.

In the online world, I personally have my doubts about the utility of class numbers, but in any case, I would think that the shelf numbers would be of much less importance. Therefore, the interest would be in the part up to the shelf number (in MARC21, this would be the information in 050 $a)

The authors don't go into details over the LC class numbers and they give no examples at all. If they got their numbers from the project Gutenberg metadata, which I suspect, the LC numbers are only the letters, e.g. PS, PT, GV, etc. I don't know if they assigned the entire number, e.g. ND623.C38D46 1995. I doubt this very seriously.

This is an example that I have with evaluating this sort of research. Although I applaud the attempt, somebody must point out any problems in it. I suspect their project was assigning only the first two letters, that is, it was counted as a success if it assigned PS for a work of American literature, and not NK for decorative arts.

I would agree this would be a success, but only within the terms of this experiment. In the real world, only two letters would be useless.

Jim Weinheimer
Received on Wed Sep 05 2007 - 01:24:13 EDT