Re: parts-of-speech

From: Laval Hunsucker <amoinsde_at_nyob> Date: Wed, 9 Feb 2011 09:40:03 -0800 To: NGC4LIB_at_LISTSERV.ND.EDU

Eric,

Interesting.

You're talking only  the full text of English-language documents 
here, right ?

But even then, there's lots of room for fundamental ambiguity/
uncertainty in your source of data, it seems to me. To deal with 
such meaningfully, the software'd have to be enormously 
sophisticated, no ? Can we even make software that 
sophisticated ?

And while I don't disagree with your two closing sentences below, 
far from it, and you have made these important points here before 
-- still I don't really ( yet ) get the ultimate point of this kind of 
POS usage analysis as such. ( As opposed to the clearly important 
"discovery" possibilities you'd broached in previous posts. )

Where are you headed and why ( apart from just being able to do 
it, of course, which may be kinda nice in itself ) ?  What value-
added functionality or product is in this case awaiting all those 
appreciative library users down the road ( i.e., those who are left 
:-] ) ?  You may have a clear -- or rough -- idea, but I don't as yet. 
( Maybe I'm stupidly overlooking something. )

[ And of course I'm wondering what of significance one could hope 
to say -- if anything at all -- in the cases of those two authors on 
your list who were presumably being represented not by what they 
wrote but by what their translators made of it in a language which 
in numerous ways works quite differently to the one they 
themselves employed. ]

 - Laval Hunsucker
   Breukelen, Nederland

----- Original Message ----
From: Eric Lease Morgan <emorgan_at_ND.EDU>
To: NGC4LIB_at_LISTSERV.ND.EDU
Sent: Mon, February 7, 2011 2:10:28 PM
Subject: [NGC4LIB] parts-of-speech

For the past year or so I have been dabbling with text mining, and my latest 
foray surrounded the analysis of parts-of-speech (POS) in full text.

With the advent of so much full text, it seems logical to me to figure out ways 
to describe individual items -- as well as our collections as a whole -- by 
analyzing more than the most basic of bibliographic information. Based on my 
initial and rudimentary investigations, differentiating texts on POS is not 
promising. From my blog posting:

  I now have the tools necessary to answer one of my initial
  questions, "Do some works contain a greater number of nouns,
  verbs, and adjectives than others?"... The result was very
  surprising to me. Despite the wide range of document sizes, and
  despite the wide range of genres, the relative percentages of POS
  are very similar across all of the documents... Based on this
  foray and rudimentary analysis the answers are, "No, there are
  not significant differences, and no, works do not contain
  different number of nouns, verbs, adjectives, etc."

  http://bit.ly/hsxD2i

By exploiting the existence of full text, library "discovery systems" can be so 
much more functional and useful. We need to be taking advantage of our 
environment to a much greater degree.

-- 
Eric Lease Morgan
University of Notre Dame

Great Books Survey -- http://bit.ly/auPD9Q

____________________________________________________________________________________
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html