Apropos the recent discussions about the lack of subdivision codes in the
current LCSH SKOS rendering:
I trained the Stanford maximum entropy POS Tagger using 90% of the Dec
2006 Subjects file as training data. Using the remaining data as a test set
the tagger was able to guess the correct POS tag with 99.78% accuracy.
There were 91 tagging errors on 41,802 words. 56 of these were incorrect
prediction of $x versus $v (topic v. form/genre), and 28 were conflation
between $x and $z (topic v. place).
The $x/$v issue is to be expected, since the data really are ambiguous.
Using a gazetteer of place names such as the USGS GNIS files can deal with
many of the latter errors.
Simon
Received on Tue May 26 2009 - 19:20:09 EDT