Re: Google Magicians?

From: McGrath, Kelley C. <kmcgrath_at_nyob> Date: Mon, 21 Sep 2009 16:14:05 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

Jason Thomale already said, more eloquently than I would have, a lot of the things that I've thought about this thread, and much more. I agree that it does seem like not everyone is talking about the same thing and I'm not sure anyone is denying that there is a lot of valuable and high quality metadata locked up in MARC records. I very much agree that the problem of dealing with a relatively small set of reasonably homogeneous records is not the same as dealing with an enormous number of records created by different catalogers, following different local practices, and using different rules from different times and places. Not to mention the sheer number of mistakes in a large record set.

It seems to me that MARC often gets painted as the villain, but I think the problem for programmers is mostly not MARC per se, although there are cases like the lack of granularity for identifying given and family names in 100/600/700 fields that could be blamed on MARC. MARC has its shortcomings in that it is not as expansive as one might like (what do you do when you run out of field numbers or subfield letters?) and doesn't do some things, like relate different pieces of data, well or at all, but I don't think what the Google magicians/programmers can't overcome is the MARC format. Certainly records could be exported to MARCXML and presumably someone could come up with a standard way to convert that to more human-friendly names. Programmers have to know something about the idiosyncrasies of the data, but they'd have to know that however the data was labeled--as was pointed out in the ONYX post. 

What if Google did put all these authorized names or LCSH in their GoogleBooks records (and it does seem nutty to me that they truncated information)? That might be okay for display, but is it really going to work for collocation? Who is going to do authority control on data on this scale from all these different libraries? Perhaps, the Virtual Authority File could help with names, but still, it's not a trivial undertaking and it would have to be an ongoing one.

I also think this point that "... those who have cataloging/bibliographic knowledge lack computing knowledge/server space. Those who have computing knowledge/server space probably lack cataloging/bibliographic knowledge," although a legitimate problem, is not the root issue (although I would very much like to see more cooperation between people with these two types of knowledge and resources and I do think we need more people who have backgrounds in both).

Lynne Bisko and I wrote an article for the Code4Lib Journal on our experiences as part of an OLAC (Online Audiovisual Catalogers) task force trying to pull out just five pieces of information about moving images. See journal.code4lib.org/articles/775. We're both experienced catalogers. Lynne has a computer science background and although my limited programming skills are neither terribly efficient nor elegant, I don't think it was lack of computer background or equipment that limited us. We had to use a number of convoluted, parallel approaches and we were unable to get out through an automated process all the data that we wanted even when that data would have been obvious to a trained eye actually looking at the record. This I think is the fundamental problem--the gap between what a person who knows what they're looking at gets out of a record and what a machine can be trained to identify (and even some of what a machine *can* be trained to do is still pretty time-consuming!
  and requires a lot of iterations and maybe manual review of outliers). The problem of making our data more reusable by machines is not resolved just by putting it into XML, but requires rethinking *how* we store our data.

Kelley McGrath
kmcgrath_at_bsu.edu