Re: MARC structure (Was: Re: Ceci n'est pas un catalogue)

From: Alexander Johannesen <alexander.johannesen_at_nyob> Date: Tue, 28 Aug 2007 10:26:02 +1000 To: NGC4LIB_at_listserv.nd.edu

Hi,

Ok, several people seem to think that it's fair to compare MARC21 with
XML as a delivery format for meta data, so I think it's time to
explain just exactly what we're missing out on by clinging on to the
MARC dinosaur. But first; MARC is a great piece of engineering and
thinking. It's a format that's survived for almost 30 years, and still
can handle a great deal of stuff. But it also fails miserably where we
now need it the most, such as modeling and data validation.

First, let's tackle that whole English element name thing. It's
rubbish. The most important point here is that these tokens are *only*
a human problem, not a technical one. Simply put, tokens are just
tokens to a computer, and it makes no difference if the token is <a>
or <title> or ,245> as long as the standard explains what the semantic
meaning is (which is, still, human interpretation in computer
programs). This is indeed how MARC21 works; you take the number, and
look it up in the (still) English manual. Some seem to think that
using English as the Lingua Franca somehow is doing us a misfavor, but
I think you're forgetting that a) most IT fundamentals are in English,
even programming languages (so why mix up idioms of programming,
artificially separating semantics, logic and work flow through
language?), and b) experts coming together *need* to speak the same
language (so at least we need to choose *one*).

I can point to a couple of programming languages such as Java, PHP,
Perl and Ruby where the statement "foreach ( $object as $iterator ) {
... }" is roughly understood by them all (apart from the $ indicating
variables in PHP and Perl), but certainly understood by programmers
across those languages. this is the Lingua Franca of programming
languages, and there are good reasons for this. And no, it's not that
their inventors were English (PHP was by Danish/Greenlandic
programmer, Ruby is by a Japanese programmer, for example) but because
a shared human language creates shared understanding of a problem
space, of logic and of solutions. Mathematicians use their elaborate
notation, programmers use Proglish. We currently use field and
sub-field numbers.

The stretch here to tokens indicating what a piece of meta data is is
quite strong ; cross-pollination of knowledge across cultures rarely
happens through tokens that need to be learned. Hieroglyphs is a good
example of this; depiction of a narrative without a spoken language,
now dead. Root your tokens in a common understanding (in this case
English) if you want better adaption of your tokens. I've been a
library developer for close to 4 years now, and I still can't tell you
the field number for <author>, but I gave you a good guess for its
element name. And before we go on with "English folks are taking over
the world", English is my third language ; I'm Norwegian.

Okay, let's move on to our starting point, MARC21 XML ;

> <record>
>     <datafield tag="245" ind1="1" ind2="0">
>       <subfield code="a">[Interview with Keith McCance&#93;</subfield>
>       <subfield code="h">[sound recording&#93; /</subfield>
>       <subfield code="c">[Interviewer : Bronwyn Benn&#93;.</subfield>
>     </datafield>
> </record>

This is a really poor use of XML. One can understand why the
initiators did this, though, because doing things right takes a *lot*
of effort that the library world weren't (aren't?) prepared to take
on, but it's a good example of XML done badly. It even draws poorly
understood comparisons to XML alternatives, saying it contains so much
meta meta, so much formatting space. Of course it does; it's bad XML.

Let's look to MODS for a few. MODS is an improvement on MARC XML; it
tries to come up with better element names for some of the things
MARC21 has got, albeit not everything. MODS (at least in the latest
versions) has got something right ; it has a schema (although XML
Schema is, in my eyes, a poor but understandable choice). The schema
is used rather structurally, which is one step forward without too
much gain. let's dig in ;

   <mama>
      <child />
   </mama>

We can create a schema that controls what elements can contain what
elements and attributes, so we can say that <mama> can only contain
<child> elements, so if we did ;

   <mama>
      <mama>
         <child />
      </mama>
   </mama>

we would get a validation error. This validation error could exist
anywhere in the chain of our systems, from the originating systems
export facilities to whomever is trying to import it. You can trap it,
and see if you can fix it, use it, or if you have to discard it. It's
an indicator of structure.

But wait, there's more. With a bit more schema work you can make it
data aware as well, stating that <mama> elements can only contain
<child> elements if structural or content conditions apply. The best
way to illustrate this is using lists ;

   <languageTerm authority="iso639-2b" type="code">eng</languageTerm>

You can in your schema make sure the attribute @authority contains
data that is approved, or even that the element data itself belongs to
the language selected (for example, if you put in 'bollocks' as the
language code, the validator will bark at you), and this is helpful at
the cataloging end of things. You can check for characters valid or
illegal, if-then-else clausing elements and data, and so forth. By
extending all of this into schema work an international cooperation
that's now handled by various human (as in, non-modeled) standards, we
can putter with each part of it and make the schema-aware software
pull things together. Through this we don't need to create software
that in itself understand what the Culture of MARC comes up with; we
only need to create software that understands XML. And I assert that
finding good people who understand XML is easier (and smarter) that
good people who understand the full stack of the Culture of MARC.

Of course, the above MODS example isn't the best use of XML
technology, and it's rooted in the culture of MARC. It's trying to say
that the resource has a particular language, but there's stuff out of
the box for this sort of stuff. For example ;

<titleInfo xml:lang="en">
   <title>Sound and fury :</title>
   <subTitle>the making of the punditocracy /</subTitle>
</titleInfo>

or even ;

   <originInfo xml:lang="no-bm">

uses the xml:lang attribute to signify language according to [IETF RFC
3066]. This is not an addon; this is part of the XML standard. Add to
that out of the box for any kind of character encoding (including
UNICODE 8 and 16).

Let's talk about modeling, since it's dear to my heart. In MARC there
is no such concept, which causes great pain these days as we're trying
to merge and reuse huge amounts of MARC data. FRBR is one such thing
which is gaining popularity (if only a philosophical one;
implementations are still far off), but it's a model that do not fit
into MARC. Some might say it isn't supposed to go there, but I think
that's a bit narrow-minded.

Let's try a bit more clever XML ;

   <work id="some_book" xml:lang="no">
      <title>Fiskepudding</title>
   </work>

   <work id="some_other_book" xml:lang="en">
      <title>Fish cakes</title>
   </work>

   <expression>
      <manifestation idref="some_book" />
      <manifestation idref="some_other_book" />
   </expression>

For any software that understands the <expression> elements, the @id
and @idref attributes are references to eachother, and through this
you can create any elaborate model you like (this thing refers to this
other thing, and so forth). This principle sits behind extended XML
formats for semantic modeling such as RDF and Topic Maps, and it's out
of the box with XML. Chuck schema work on top (especially RELAX-NG and
/ or SCHEMATRON for some seriously powerful typing and data contents
handling), and this whole discussion of what's wrong with MARC would
be over.

Again, this is all about XML, and even though some have said that most
systems today are denormalized data models in RDBMS systems I suspect
there's still lots of MARC21 specific limitations to how data can be
handled by these systems. I'd be thrilled to be shown otherwise, so
feel free.

Anyways, a couple of 0.2$ added.

Regards,

Alexander
--
 ---------------------------------------------------------------------------
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
------------------------------------------ http://shelter.nu/blog/ --------