Re: data vs "data structure"

From: Eric Lease Morgan <emorgan_at_nyob> Date: Thu, 20 Sep 2007 22:39:15 -0400 To: NGC4LIB_at_listserv.nd.edu

On Sep 20, 2007, at 5:25 PM, Rinne, Nathan (ESC) wrote:

>> But we've had this discussion so many times before, we just
>> keep going round.
>
> Is there any way to make all of this more concrete?  If it is not
> the content of a typical MARC record that is outdated or has
> outlived its usefulness, but rather the "data structure"
> (container), what exactly does this mean?  How can we map it to
> make it easier to understand?

I will take a stab at answering this question.

Here in the United States, the current data structure of our catalogs
is MARC, and considering today's computing environment, MARC is a
very poor container. I will list at least three reasons why:

1. MARC has a number of arbitrary limitations - Metadata regarding
information resources can include things like titles, authors,
physical descriptions, all types of notes, added entries,
authorities, controlled vocabularies, pointers to locations, etc. The
total sum of this information, counted in bytes, should not have any
limitations. If I want to use a megabyte of data to describe a book,
then I should be allowed to do so. Unfortunately, by definition, a
MARC record can be no larger than 99,999 characters. This is true
because the first five characters of every MARC record is a left-
hand, zero-padded integer defining the length of the record.  If the
first five characters of a MARC record are 00100, then the record is
100 bytes long. If the first five characters are 02100, then the
record is 2,100 bytes long. A mere 2K.

Moreover, each field in the directory section of a MARC record is
defined by three sets of four-character long integers something like
this 005002450100. Like the leading characters of a MARC record, the
last four digits of these 12-digit placeholders represent the length
of fields, in this case the 245 field. Since the last set is of
characters is four characters long, the maximum length for any field
is 9,999 characters. What do I do if I want to place an abstract in
my MARC record and it is 10,000 long? I can't do that because the
data structure won't accommodate it.

2. Presentation information is mixed with MARC content - An easy-to-
use data structure should not include presentation elements because
it is unknown in what context the data will be given. Because
(bibliograhic) MARC data includes an abundance of "syntactical sugar"
is very difficult to parse. For example, field 020 is intended to
contain ISBN numbers but often times you might see it contain a value
such as "0804837635 (pbk.)". The text "(pbk.)" is not the ISBN number
but a value describing the item's format. Similarly, the title of a
book might be encoded as "The adventures of Huckleberry Finn /", but
the trailing slash is not really a part of the title. The slash is
used for humans to make the item easier to read -- presentation. The
same thing goes for author names "Kilgour, Fred (1914 - 2006)". The
punctuation is full of "sugar" telling you that the first string of
characters is the last name, the second string of characters is the
first name, and the last string of characters are birth/death years.
While things like the (pbk.), /, commas, etc. are important
denotations of valuable metadata, there are too many rules needed to
be known in order for a computer to make sense of it. "If there are
dashes in between the integers of subfield a of a 020 field, then
ignore them, and if there are parentheses at the end of a 020 field,
then ignore that too. Everything else is the ISBN number."

To remove this problem, things like format (pbk.), first name, last
name, birth year, death year should be explicitly stated, not denoted
by punctuation that varies from data element to data element. By
doing so it is easy to present the information in any way desired
without having to do any special processing.

3. MARC is not the data structure of the rest of the world - IMHO,
this the most important reason why MARC is a poor data structure. It
has nothing to do with logic, efficient use of disk space, nor
elegance. It has to do with communication. If you want to communicate
with other people, then you need to use a common language. In today's
computing environment, that language is XML. It is the language of
the Web. It is the language of publishers. It is the language of
blogs. It is the language of "mash-ups". It is the language of modern-
day search technologies (OpenSearch and SRU). It is the language of
business transactions. It is used to mark-up electronic texts (TEI
and DocBook). It is used to capture the content of archives (EAD). If
Microsoft had its way, it would be the language of word processors.

Libraries are about collecting, organizing, preserving, and
*disseminating* data, information, and knowledge. If you want to
disseminate content, then you need to disseminate it in a way that is
easily understandable. Increasingly computers are the transmitters
and receivers -- the middlemen -- of content. XML is easy to read and
write. All you need is a text editor. (I challenge anybody in this
forum to use only a text editor to read, write, and modify sets of
valid MARC records. Anybody!) Moreover, there are a multitude of
industries focusing on just creating and supporting tools for reading
and writing XML. At the same time, there are a shrinking number of
companies, let alone industries, who specialized in the reading/
writing of MARC.

To see what our bibliographic data can look like in XML, try looking
as some MARCXML and MODS data. Here are some bibliographic examples
from my very simple library catalog made with MyLibrary (and, yes, it
is a catalog despite what some people have said):

   tagged - http://tinyurl.com/2edncu
   MARCXML - http://tinyurl.com/2xuofn
   MODS - http://tinyurl.com/ynkkux

The MODS example is the best implementation of the three, but it is
still not perfect. Last name first. First name last. Publisher listed
as "Tuttle Pub." when it really should be something like "Tuttle
Publishers". A call number such as "736/.982" when the Cutter number
should probably have its own field. Etc.

Finally, it is not the "what" that needs to change but the "how". The
content of MARC records is authoritative. Through controlled
vocabularies it brings together like items and provides a means for
discovering ideas -- a novel idea in a world of keyword searching.
The content of MARC records adds value to the items in our
collections. Unfortunately, we are writing this data in a dead or
dying language. Our message is great, but the way we are trying to
communicate it stinks. We might as well be trying to do algebra with
the use of Roman numerals.

Whew!

--
Eric Lease Morgan
University Libraries of Notre Dame