Re: coyle/hillman article from dlib [mods]

From: Thomale, J <j.thomale_at_nyob> Date: Thu, 18 Jan 2007 10:00:00 -0600 To: NGC4LIB_at_listserv.nd.edu

> > The point is that an opaque label like 246$4 is entirely explicit
and
> > does not encourage guesswork.  There are pros and cons here, and
when
> > I said "I'm not sure whether I buy this" I really didn't mean "I'm
not
> > sure", i.e. it wasn't just a polite way of saying "I don't buy it"
:-)
>
> To set the record straight, let's not assume I think MODS is the
> answer.  I prefer MODS to MARC, yes, but that's only because it's
> convenient and already there.
>
> However, wouldn't your argument run into the same problem as any other
> token?  The interpretation of field definition sounds, to me, like a
> wholly different problem.  Not one that is solved by neutering
> semantics in the field name.

I think what Mike's getting at is that a code like "245$a" does not
carry the semantic baggage that "Title," for example, does. So, somebody
who is unfamiliar with MARC would be forced to look up the definition of
the 245$a without having any preconceived idea about what data should go
in that field. On the other hand, the notion of "Title" might vary from
person to person. Somebody could look at the "Title" element of a
metadata schema and automatically conjure up his/her own definition of
that element without even bothering to look up the actual definition.
Similarly, it would be tempting to think that a "Title" in one data
schema automatically equals the "Title" of another data schema, but the
definitions might be slightly different.

Semantic data field labels can encourage those kinds of assumptions,
which could lead to imprecise data.

Now, you could say that this entire discussion isn't important--they're
just "tokens," as you say, and why does it matter what they are? But,
really, I think the reason that some of us are getting caught up in this
is that it touches on some issues that *are* important.

What it all comes down to is purpose. Any type of data abstraction,
whether it's a data model, a data schema, or a data format, must be
constructed with a clear purpose in mind. MARC has become multi-purpose,
and the different arguments in this thread illustrate the various
understandings of MARC's different purposes. On the one hand, MARC
supports detailed inventory control--it's an internal, backend type of
data schema, if you will. For this, data precision and detail are
important because you're relying on your data to essentially fuel your
system and keep track of what you have for budgetary, etc. purposes.
From this perspective, a coded data schema makes sense also because that
data doesn't necessarily need to be shared beyond the walls of the
library, and you can easily train your staff to understand the codes.

On the other hand, MARC also supports information discovery and
retrieval. Here's where things become a little bit stickier, but as long
as we're still looking at MARC as an internal, backend data schema for a
single institution, then codes are still okay. The codes can be
translated into human-readable labels. The problem becomes the very
precision and high level of detail that those codes support. The act of
information discovery is inherently imprecise. If you're looking for
"Title" in a MARC record, that "Title" information is sprinkled
throughout the record, and the actual title that one is seeking might be
in an unexpected place in the record (i.e., not the 245$a). When you
start thinking about information discovery and retrieval, level of
precision becomes (or *should* become) a factor of user needs, which
further screws with our MARC data schema. MARC's level of precision
might be necessary for the researcher who actually *is* looking for that
specific type of title, but completely inappropriate for Joe User who is
trying to find the latest Grisham novel. MARC, as a data model, is not
optimal for this, and has no systematic way for recording data of
varying levels of precision for this type of purpose.

Finally, it's when you start looking at MARC as a format for data
exchange and interoperability, especially outside of the library
context, that the "coded vs. semantic labels" debate gains a little bit
more importance. From this perspective, MARC's purely opaque nature
makes it prohibitive for anybody outside of a library to bother with,
unless there's substantial motivation to do so. The beauty of data
schemas with simple semantic labels (such as Dublin Core) is that
anybody can look at a simple Dublin Core record and get the gist of it
almost immediately. Yes, "getting the gist of" is imprecise. Yes, Dublin
Core is semantically ambiguous, but it assumes that most people have an
idea of what "Title," "Creator," etc. means. Sure, if you're going to
work with Dublin Core, you really need to look at the documentation and
find out the precise definitions of those data elements. And, of course,
many have documented the problems with an approach that is perhaps as
imprecise as MARC is precise. BUT--it's still an open question as to
what level of precision is "good enough" for information
discovery/retrieval purposes, which is what Dublin Core is supposed to
facilitate across institutions and collections.

So--without beating this horse to death--it really all comes down to a
question of purpose and priority. I don't think a single, monolithic
data format (a la MARC) is ideal for supporting all of these purposes,
especially when each individual purpose has a different level of
priority at each library. How do we even approach that issue?

Jason Thomale
Metadata Librarian
Texas Tech University Libraries