Re: our profession's bibliographic information

From: Jonathan Rochkind <rochkind_at_nyob> Date: Wed, 22 Dec 2010 10:59:54 -0500 To: NGC4LIB_at_LISTSERV.ND.EDU

Not with Google Books, but I believe you can do what you're describing 
with the new tool Google released for analysis of language usage in 
their corpus.

http://ngrams.googlelabs.com/

Hmm, although I'm not sure if that lets you specify proximity 
requirements -- I kind of thought I heard it did, but that particular 
interface I'm linking to doesn't seem to, maybe there's an option or 
alternate interface I'm not seeing, I haven't spent much (any) time with 
this.

On 12/22/2010 9:53 AM, Cindy Harper wrote:
> I played with a project looking at the number of "mentions" (author + title
> mentions) in the Google Books corpus.  Unfortunately, since there's no
> proximity searching in Google Books, there's no way AFAIK to weed out the
> false hits. Maybe a similar thing could be done with Hathi Trust data? Do
> you know of any indexing software with proximity searching (same sentence?)
> that could be used for such a project?
>
> Cindy Harper, Systems Librarian
> Colgate University Libraries
> charper_at_colgate.edu
> 315-228-7363
>
>
>
> On Tue, Dec 21, 2010 at 2:55 PM, Eric Lease Morgan<emorgan_at_nd.edu>  wrote:
>
>>>> In the context of my previous message, there are two types of data:
>>>> 1) quantitative, and 2) qualitative. The former is applicable to
>>>> mathematical processes. The later is not.
>>> But you can quantify what you call qualitative data, that is, data
>>> that is not numeric. You can count anything, as the applications that
>>> are making use of full text are doing. You can make "more related to"
>>> calculations even using words ("this word is more related to another
>>> word than that word" or "A has a greater relationship to B than C has
>>> a relation to B"). I'm not sure why you would limit yourself to
>>> numerical data, rather than countable data. Once you count, you turn
>>> your data into quantity. Based on the nature of our data, I think
>>> that's where we'll get bang for our computational buck.
>>
>> Only things that are represented as numbers are countable. I can't count
>> The Adventures of Huckleberry Finn. Nor can I count Origami--Juvenile works.
>> Yes, I can count the number of books by Mark Twain a library owns, and I can
>> count the number of works related to paper craft, but these tabulations tell
>> me about the collection. I want to produce quantitative information on
>> works, not the catalog. For example, some measurable characteristics of
>> works may include:
>>
>>   * Big Name index (percentage of quotes from leading authorities)
>>   * color index (normalized percentage of color words used)
>>   * date written
>>   * grade level
>>   * Great Ideas index (percentage of philosophy ideas in text)
>>   * length in words
>>   * librarian rating
>>   * number of citations
>>   * number of editions
>>   * number of graphics
>>   * number of pictures
>>   * number of prizes won
>>   * number of times circulated
>>   * percentage of languages used in a text
>>   * percentage of mathematical formulas in a text
>>   * percentage of unique words in a text
>>   * price
>>   * publisher rating
>>   * readability score
>>   * reader rating
>>
>> Given imagination, I'm sure many more quantifiable characteristics could be
>> enumerated.
>>
>> Once done, these characteristics can be compared to one another, and they
>> can be used from two sides of the same problem. On one hand such
>> characteristics can be integrated into "discovery systems" (catalogs) to
>> assist the reader in identifying items for use. "I want a book that is
>> popular, contains a minimum of mathematical formulas, has many citations and
>> illustrations, but is not too difficult to read." On the other hand, a
>> person could identify an item not in a collection, feed the item to a system
>> for analysis, and return a list of characteristics about the item. "This
>> item is longer than most, has many citations, is expensive, has a low reader
>> rating, and is not very 'colorful'." Finally, some sort of graph chart could
>> be drawn literally illustrating the characteristics of a given work.
>>
>> Granted, none of this was feasible a decade ago since there was little full
>> text. Things are changing. Things are different now. Full text is becoming
>> the norm, and this opens up all sorts of possibilities. Somebody is going to
>> do this sort of work, if it isn't being investigated already. Libraries are
>> not about books. They are about what is inside the books. We need to be
>> providing tools enabling our constituents to use these insides lest the
>> profession becomes marginalized. Find is not as much of a problem to solve
>> as it used to be. People can find more than they need, and the amount of
>> effort needed to find more is past the point of diminishing returns.
>> Instead, use and understanding is the name of the game. Measurement is a
>> standard means to understanding. Quantification is necessary element of
>> measurement.
>>
>> --
>> Eric Lease Morgan
>> "Take the Great Books Survey -- http://bit.ly/auPD9Q"
>>