Re: Automatic Classification

From: Conal Tuohy <conal.tuohy_at_nyob> Date: Thu, 6 Sep 2007 10:20:41 +1200 To: NGC4LIB_at_listserv.nd.edu

On Wed, 2007-09-05 at 09:20 +0200, Weinheimer Jim wrote:
> On 9/5/07,      Alexander Johannesen wrote:
> > > Also, in the flurry of postings, I apologize that I must have
> missed the
> > > "paper [you] pointed to which did auto-classification across
> Project
> > > Gutenberg [and] chose LCSH specifically so that librarians could
> verify
> > > the result."  Do you happen to have a ref or link to that
> > paper?
> >
> > Sure ;
> >
> http://www.ltg.ed.ac.uk/np/publications/ltg/papers/Betts2007Utility.pdf
>
> I just looked at this paper, and I have a lot of questions about it.
> LC class numbers are rather complex. There are different parts to
> them, e.g.
> ND623.C38D46 1995
> The number above is about painting (ND623) individual artist (C38)
> shelf number (D46) and date of publication (1995)
> Numbers get far more complex than this, by the way.
>
> In the online world, I personally have my doubts about the utility of
> class numbers, but in any case, I would think that the shelf numbers
> would be of much less importance. Therefore, the interest would be in
> the part up to the shelf number (in MARC21, this would be the
> information in 050 $a)

The main point of the paper was to test out "named entity recognition"
in the context of automatic classification. They wanted to see if, by
identifying and extracting the names of people, places, and dates
mentioned in the text, they could use this "structured" extra data to
improve the performance of their existing classification algorithms. The
result was that, yes, the "hybrid" classification system which made use
of entity names was significantly better.

> The authors don't go into details over the LC class numbers and they
> give no examples at all. If they got their numbers from the project
> Gutenberg metadata, which I suspect,

>the LC numbers are only the letters, e.g. PS, PT, GV, etc. I don't
>know if they assigned the entire number, e.g. ND623.C38D46 1995. I
>doubt this very seriously.

I read the paper and it was absolutely clear to me that they used only
the 2-letter prefix of the LoCC, which had already been assigned to
these texts by project Gutenberg; i.e. they were using the
Gutenberg-supplied metadata (presumably assigned by human cataloguers)
as the 'gold standard' to measure the success of their algorithms - i.e.
they could use librarians to verify their results.

The reason that these 2-letter codes were chosen as their output was
simply that librarians had already created these classifications which
they could compare their output with. The point of the experiment was to
test various different classification algorithms, and compare their
precision and recall, not to produce an OPAC. So this is exactly
analogous to using OCLC's authority list of David Johnsons.

NB it's worth considering the possibility (which is not AFAICS mentioned
in the paper) that the discrepancies which exist between the
classification made by their various classification algorithms and the
"gold standard" classification might be resolved in the machines'
favour. i.e. that they might be putting too much faith in the "gold
standard".

> This is an example that I have with evaluating this sort of research.
> Although I applaud the attempt, somebody must point out any problems
> in it. I suspect their project was assigning only the first two
> letters, that is, it was counted as a success if it assigned PS for a
> work of American literature, and not NK for decorative arts.
>
> I would agree this would be a success, but only within the terms of
> this experiment. In the real world, only two letters would be useless.

Your problem with it is just that it's a scientific experiment rather
than a commercial library application?

NB the paper does make some mention of the problem of "over-fitting"
which arises when trying to learn to make fine discriminations using a
small training corpus (810 texts). By training the machine using a
larger corpus (which would certainly be the case in a real-world
application), finer discrimination would be possible. But the point of
this experiment was actually quite specifically to evaluate Named Entity
Recognition as a useful contribution to the toolbox; it was not to
generate a perfect catalogue for Project Gutenberg.

--
Conal Tuohy
New Zealand Electronic Text Centre
www.nzetc.org