Re: Google Magicians?

From: Thomale, J <j.thomale_at_nyob> Date: Mon, 21 Sep 2009 13:58:20 -0500 To: NGC4LIB_at_LISTSERV.ND.EDU

> >> All I can
> say is the only people I know that think "it should be easy to get
> whatever data you want out of library MARC" are people who aren't
> programmers who have tried.
> 
> I'm not much of a programmer, but using the open-source Perl module,
> developed by REAL programmers (really GOOD programmers, I would add.)
> I've managed to pull out pretty much everything what I needed.  On the
> rare occasions when we needed and were able to hire a real programmer
> the results were excellent.

A large part of my job is doing just this--extracting metadata from MARC to use in our digital collections.

First, I have to agree with Jonathan. Second, Jane--I think you and Jonathan are actually talking about two very different things. 

In my job, my experiences with MARC have been on a small scale. For any given digital collection, there's a smallish, relatively homogenous set of MARC records that I have to work with. So, I'm working with just one data source--say, a set of records that just a handful of people have cataloged locally. When I write a program to extract the data, I can look at the records, see basically what subset of the MARC/AACR2 combo was used for those records, and tailor my routines to match what's in the records. I can hardcode certain things into a script for a given collection because I know that those things aren't going to be variable *for this particular small subset of MARC records.*

It certainly isn't rocket science, and eventually I end up being able to get the data that I need. I'm just guessing, but I think this is probably more along the lines of your experiences, Jane.

With that said--even when I'm working with a homogenous set of records, I still run into problems like the following. Even as granularly-defined as the MARC standard is, it still relies on AACR2 to delineate additional semantically meaningful data elements (personal names are one obvious example). When processing MARC data, you can easily tell a script to extract the 100$a. But then you still have to deal with the AACR2 embedded within the contents of that subfield to further extract meaningful data. The problem? To do this extraction automatically, you have to trust that the AACR2 is entered correctly and consistently. You're automating something based on hand-entered data. In my experience, even when working with records done by the same cataloger, 100% consistency is too much to ask. It doesn't happen. So you end up compensating by adding pattern-matching routines to your script to weed out the particular errors that you see *in the data you have at hand.* But, when you w!
 ork on a new batch of records for a different project from a different data source, those patterns that so neatly matched records from the previous source don't apply at all and you're back to square one.

Even after that, I *still* find that I have to have a human check over the metadata that gets produced.

Now--I think what Jonathan probably has in mind when he talks about what Google wouldn't even try to tackle is this: writing a (more-or-less) universal MARC metadata extraction program that pulls complete, high-quality metadata from *any* MARC record out in the wild, no matter the data source. That's an enormous--I would even say impossible--task. The normal ways of dealing with metadata errors go out the window, because there's no way to predict the errors you're going to see--well, not in the traditional sense, anyway. You can't rely on anything as given--everything is variable. Theoretically, to write such a program, you'd have turn the entirety of MARC cataloging rules into machine-actionable rules. On top of that, you'd have to turn the entirety of AACR2 rules into machine-actionable rules. On top of that, you'd have to figure out all historical MARC and AACR2 rules changes and turn those into machine-actionable rules. On top of that, you'd have to take into account reg!
 ional differences and local cataloging practices and turn *those* into machine-actionable rules. And that *still* assumes that we're working with data that has been cataloged 100% perfectly according to all of the rules! Add in typos, inconsistencies, anything that doesn't follow the rules, and...well, you see the point.

When you think about it, processing MARC like this is not much different than processing natural language. You've got extremely complex, nuanced rules that change regionally/temporally and are broken very often in practice. For a small, consistent data set, you can write a program that acts on the data reliably. But to do this for anything you might find in the wild is (at least given current technology) impossible.

So, MARC/AACR2 is like a language--a language in which catalogers happen to be fluent. Experienced catalogers are great at examining a MARC record, mentally processing all of its weird nuances, and telling you what it means. But the ability to do this for *any* given MARC record is very much an issue of interpretation. Computers are notoriously bad at interpretation. I think that's part of the fundamental weirdness between the catalogers who hammer out MARC records day after day and the programmers that try to deal with those records. [Some] catalogers praise the high quality of library metadata. [Some] programmers simultaneously bemoan the horrible quality of library metadata. The two groups, as a whole, look at "data quality" differently.

To take this thread back to its roots--that is, the Google Books metadata discussion:

Given the similarities of "all MARC data" to natural language, it would make sense for Google to use machine-learning/statistical-computation/NLP as a basis for handling their 100+ data sources. In fact, that's how I read Jon Orwant's comments on the Language Log blog. Especially:

"We have collected over a trillion individual metadata fields; when we use our computing grid to shake them around and decide which books exist in the world, we make billions of decisions and commit millions of mistakes."

So...at least for part of their process, they're deciding "which books exist in the world" based on "shaking around" all their metadata that they've put into one big pot? To me that just screams statistical clustering. It would also explain Orwant's point about "uncommonly strange but genuine metadata" (w/rt outlying publication dates). And it would also help explain why they're having Google employees do manual metadata entry (it would provide another data point to use in their analysis).

I may be pointing out the obvious--I don't know. Or I might be completely wrong. The reason I mention it is because I haven't seen/heard anyone else talk about this particular aspect of Orwant's comments. I don't know if nobody's talking about it because it's so obvious to them or because they don't see it. In a way, it does seem counterintuitive to what we believe about our metadata. Machine-learning is what you use on blocks of free text or other unstructured (or semi-structured) data. Right? We gave Google great, high-quality MARC data; why would they need to treat it that way? (See my previous comments about MARC's similarity to natural language for my answer to that.)

But, Orwant's comments as a whole make much more sense when you read them with the understanding that Google might be doing some combination of machine-learning and more traditional data-mapping with all of its source metadata. It also suggests that what Google is doing is a bit more complex than we realize, and it isn't a simple matter of, "Why isn't Google *using* the awesome MARC data that we gave them?" It isn't that they aren't using it--it's that they have a lot of conflicting data for every single book, and teaching a computer to reconcile the conflicting data to pull out the awesome chunks is a difficult and error-prone (clearly) process.

Jason Thomale
Metadata Librarian
Texas Tech University Libraries