Those d___ed subject headings : inferring subdivision codes using a maximum entropy POS Tagger

From: Simon Spero <ses_at_nyob> Date: Tue, 26 May 2009 19:17:46 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

Apropos the recent discussions about the lack of subdivision codes in the
current LCSH SKOS rendering:

I trained the Stanford  maximum entropy POS Tagger using 90% of  the Dec
2006 Subjects file as training data. Using the remaining data as a test set
the tagger was able to guess the correct POS tag with 99.78% accuracy.

There were  91 tagging errors on 41,802 words. 56 of these were incorrect
prediction of $x versus $v (topic v. form/genre), and 28 were conflation
between $x and $z (topic v. place).

The $x/$v issue is to be expected, since the data really are ambiguous.
Using a gazetteer of place names such as the USGS GNIS files can deal with
many of the latter errors.

Simon