Re: Automatic Classification

From: Weinheimer Jim <j.weinheimer_at_nyob> Date: Thu, 6 Sep 2007 14:58:44 +0200 To: NGC4LIB_at_listserv.nd.edu

Conal Tuohy wrote:

> Your problem with it is just that it's a scientific experiment rather
> than a commercial library application?
>
> NB the paper does make some mention of the problem of "over-fitting"
> which arises when trying to learn to make fine discriminations using a
> small training corpus (810 texts). By training the machine using a
> larger corpus (which would certainly be the case in a real-world
> application),
> finer discrimination would be possible. But the point of
> this experiment was actually quite specifically to evaluate Named Entity
> Recognition as a useful contribution to the toolbox; it was not to
> generate a perfect catalogue for Project Gutenberg.

My problem is pointing to this sort of  and calling it a "success." It is fine as an experiment--I'm all for it--but we can't draw any real-world conclusions from it. From this work, we cannot conclude that computers can assign classification numbers at all--in fact, quite the opposite. And maybe the machine would get better with a larger corpus, and maybe it would get worse.

> NB it's worth considering the possibility (which is not AFAICS mentioned
> in the paper) that the discrepancies which exist between the
> classification made by their various classification algorithms and the
> "gold standard" classification might be resolved in the machines'
> favour. i.e. that they might be putting too much faith in the "gold
> standard".

This I seriously doubt. The errors in classification I have seen are primarily people adding things incorrrectly, applying incorrect tables, writing the number in the wrong book, or they are dealing with very difficult classifications such as the Ns or Ks.

If everything is based on the 2 letter combinations, which is absurdly simple "classification," the only way a human could mess it up is if they were dealing with a text they could not read. For example, I don't know Arabic, and if I had to "classify" a text, I would be incompetent. I wouldn't know if it was a literary text or had something to do with electrical engineering.

Machines make such mistakes, however. We just don't call them "incompetent." ;-)

Jim Weinheimer