Re: our profession's bibliographic information

From: Karen Coyle <lists_at_nyob> Date: Tue, 21 Dec 2010 06:52:48 -0800 To: NGC4LIB_at_LISTSERV.ND.EDU

The Open Library does a timeline for all subjects, also using Solr  
(and the code is open source):

http://openlibrary.org/subjects/love
http://openlibrary.org/subjects/communism

However, publishing increased at such a great rate over the last 50  
years that nearly every topic has that peak around modern times.

What Eric called "qualitative" data I call "text", I think. Even the  
page numbers are buried in text. The only "data" we have is in the  
fixed fields, and as we know there are many people who feel that those  
are not an important part of the cataloging process. (I bet they are a  
real pain to code, as well.) If we could provide more data, less text,  
what would that look like?

I can see:

1. identifiers for names and subjects, so that changes in display  
(cookery to cooking) would not change the identity
2. data for pagination, sizes (although I can only think of a few  
minor uses for this beyond record matching)
3. have all identifying numbers treated as data (no more ISBN followed  
by "(paperback)" in the same string)
4. identifiers for place of publication and publisher (we've had this  
argument on the RDA-L list)
5. I'd also like to see clear coding of transcribed v. non-transcribed  
text strings (so it would be clearer which fields could be used in  
matching)
6. coding of relationships between bibliographic items (translation  
of, adaptation of)
7. coding of relationships between persons and corporate bodies and  
the bibliographic item being described

That's what jumps to my mind, but I'm sure you all can fill in others.

kc

Quoting Weinheimer Jim <j.weinheimer_at_AUR.EDU>:

> There are a few projects dealing with this. First, there is simply  
> Google, which has the option in the left-hand menu of plotting any  
> search to a timeline, e.g. search for "wisdom":
> http://www.google.com/search?hl=en&hs=7ZI&tbo=1&tbs=tl%3A1&q=wisdom&aq=f&aqi=g10&aql=&oq=&gs_rfai=
> How this is generated, I have absolutely no idea, but just glancing  
> at it, it looks as if the word "wisdom" was widely used around  
> 200BC, in 0AD it stopped being used until about 50AD; it went  
> through sporadic use until around 900AD when it became popular  
> again, and then with the rise of printing, its use went up more or  
> less steadily.
>
> Does anybody really believe that?!
>
> There is also the Corpus of Historical American English (COHA) at  
> http://corpus.byu.edu/coha/, which has many more controls. They have  
> an interesting comparison with Google's Ngram tool at:  
> http://corpus.byu.edu/coha/compare-culturomics.asp.
>
> And of course, there are the notable OCR problems, discovered and  
> blogged simultaneously by many people (including myself!) who  
> apparently think alike.  
> http://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181 is one  
> example.
>
> I mentioned my own amazement to find this "specific word" in the  
> book "The Act of Tonnage and Poundage, and Rates of Merchandize"  
> from 1702, where I found the exact usage:  
> http://books.google.com/books?id=Zjk7AAAAcAAJ&pg=PA201&dq=%22fuck%22&hl=en&ei=ilALTbPpIo72sgb8h63jDA&sa=X&oi=book_result&ct=result&resnum=3&ved=0CC8Q6AEwAjgK#v=onepage&q=%22fuck%22&f=false in the  
> sentence:
> "Every Merchant making an Entry of Goods, either Inwards or Outwards  
> shall be dispatched in such Order as he cometh;..." and it misread  
> the old spelling of "such". So, not only did it mistake the medial s  
> for an f, it also misread the h as a k.
>
> The poor author must be spinning in his grave! It appears that  
> Google's OCR tool is more similar to many human beings than I had  
> suspected: both have filthy minds! :-)
>
> Of course, this is far from the only OCR problem. To be fair, this  
> sort of "data mining" is in its very earliest stages, so it is easy  
> to point out problems. It will take time, plus trial and error, to  
> discover if these techniques lead to anything of value.
>
> We are in a time of experimentation.
>
> James Weinheimer  j.weinheimer_at_aur.edu
> Director of Library and Information Services
> The American University of Rome
> via Pietro Roselli, 4
> 00153 Rome, Italy
> voice- 011 39 06 58330919 ext. 258
> fax-011 39 06 58330992
> First Thus: http://catalogingmatters.blogspot.com/
> Cooperative Cataloging Rules:  
> http://sites.google.com/site/opencatalogingrules/
>

-- 
Karen Coyle
kcoyle@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet