I played with a project looking at the number of "mentions" (author + title
mentions) in the Google Books corpus. Unfortunately, since there's no
proximity searching in Google Books, there's no way AFAIK to weed out the
false hits. Maybe a similar thing could be done with Hathi Trust data? Do
you know of any indexing software with proximity searching (same sentence?)
that could be used for such a project?
Cindy Harper, Systems Librarian
Colgate University Libraries
charper_at_colgate.edu
315-228-7363
On Tue, Dec 21, 2010 at 2:55 PM, Eric Lease Morgan <emorgan_at_nd.edu> wrote:
> >> In the context of my previous message, there are two types of data:
> >> 1) quantitative, and 2) qualitative. The former is applicable to
> >> mathematical processes. The later is not.
> >
> > But you can quantify what you call qualitative data, that is, data
> > that is not numeric. You can count anything, as the applications that
> > are making use of full text are doing. You can make "more related to"
> > calculations even using words ("this word is more related to another
> > word than that word" or "A has a greater relationship to B than C has
> > a relation to B"). I'm not sure why you would limit yourself to
> > numerical data, rather than countable data. Once you count, you turn
> > your data into quantity. Based on the nature of our data, I think
> > that's where we'll get bang for our computational buck.
>
>
> Only things that are represented as numbers are countable. I can't count
> The Adventures of Huckleberry Finn. Nor can I count Origami--Juvenile works.
> Yes, I can count the number of books by Mark Twain a library owns, and I can
> count the number of works related to paper craft, but these tabulations tell
> me about the collection. I want to produce quantitative information on
> works, not the catalog. For example, some measurable characteristics of
> works may include:
>
> * Big Name index (percentage of quotes from leading authorities)
> * color index (normalized percentage of color words used)
> * date written
> * grade level
> * Great Ideas index (percentage of philosophy ideas in text)
> * length in words
> * librarian rating
> * number of citations
> * number of editions
> * number of graphics
> * number of pictures
> * number of prizes won
> * number of times circulated
> * percentage of languages used in a text
> * percentage of mathematical formulas in a text
> * percentage of unique words in a text
> * price
> * publisher rating
> * readability score
> * reader rating
>
> Given imagination, I'm sure many more quantifiable characteristics could be
> enumerated.
>
> Once done, these characteristics can be compared to one another, and they
> can be used from two sides of the same problem. On one hand such
> characteristics can be integrated into "discovery systems" (catalogs) to
> assist the reader in identifying items for use. "I want a book that is
> popular, contains a minimum of mathematical formulas, has many citations and
> illustrations, but is not too difficult to read." On the other hand, a
> person could identify an item not in a collection, feed the item to a system
> for analysis, and return a list of characteristics about the item. "This
> item is longer than most, has many citations, is expensive, has a low reader
> rating, and is not very 'colorful'." Finally, some sort of graph chart could
> be drawn literally illustrating the characteristics of a given work.
>
> Granted, none of this was feasible a decade ago since there was little full
> text. Things are changing. Things are different now. Full text is becoming
> the norm, and this opens up all sorts of possibilities. Somebody is going to
> do this sort of work, if it isn't being investigated already. Libraries are
> not about books. They are about what is inside the books. We need to be
> providing tools enabling our constituents to use these insides lest the
> profession becomes marginalized. Find is not as much of a problem to solve
> as it used to be. People can find more than they need, and the amount of
> effort needed to find more is past the point of diminishing returns.
> Instead, use and understanding is the name of the game. Measurement is a
> standard means to understanding. Quantification is necessary element of
> measurement.
>
> --
> Eric Lease Morgan
> "Take the Great Books Survey -- http://bit.ly/auPD9Q"
>
Received on Wed Dec 22 2010 - 09:54:40 EST