Re: data vs "data structure"

From: David Dorman <dorman_at_nyob> Date: Fri, 21 Sep 2007 05:19:54 -0400 To: NGC4LIB_at_listserv.nd.edu

At 10:39 PM 09/20/2007, Eric Lease Morgan wrote:
>On Sep 20, 2007, at 5:25 PM, Rinne, Nathan (ESC) wrote:
>
>>>But we've had this discussion so many times before, we just
>>>keep going round.
>>
>>Is there any way to make all of this more concrete?  If it is not
>>the content of a typical MARC record that is outdated or has
>>outlived its usefulness, but rather the "data structure"
>>(container), what exactly does this mean?  How can we map it to
>>make it easier to understand?
>
>
>I will take a stab at answering this question.
>
>Here in the United States, the current data structure of our catalogs
>is MARC, and considering today's computing environment, MARC is a
>very poor container. I will list at least three reasons why:
>
>1. MARC has a number of arbitrary limitations - Metadata regarding
>information resources can include things like titles, authors,
>physical descriptions, all types of notes, added entries,
>authorities, controlled vocabularies, pointers to locations, etc. The
>total sum of this information, counted in bytes, should not have any
>limitations. If I want to use a megabyte of data to describe a book,
>then I should be allowed to do so. Unfortunately, by definition, a
>MARC record can be no larger than 99,999 characters. This is true
>because the first five characters of every MARC record is a left-
>hand, zero-padded integer defining the length of the record.  If the
>first five characters of a MARC record are 00100, then the record is
>100 bytes long. If the first five characters are 02100, then the
>record is 2,100 bytes long. A mere 2K.
>
>Moreover, each field in the directory section of a MARC record is
>defined by three sets of four-character long integers something like
>this 005002450100. Like the leading characters of a MARC record, the
>last four digits of these 12-digit placeholders represent the length
>of fields, in this case the 245 field. Since the last set is of
>characters is four characters long, the maximum length for any field
>is 9,999 characters. What do I do if I want to place an abstract in
>my MARC record and it is 10,000 long? I can't do that because the
>data structure won't accommodate it.
>
>
>2. Presentation information is mixed with MARC content - An easy-to-
>use data structure should not include presentation elements because
>it is unknown in what context the data will be given. Because
>(bibliograhic) MARC data includes an abundance of "syntactical sugar"
>is very difficult to parse. For example, field 020 is intended to
>contain ISBN numbers but often times you might see it contain a value
>such as "0804837635 (pbk.)". The text "(pbk.)" is not the ISBN number
>but a value describing the item's format. Similarly, the title of a
>book might be encoded as "The adventures of Huckleberry Finn /", but
>the trailing slash is not really a part of the title. The slash is
>used for humans to make the item easier to read -- presentation. The
>same thing goes for author names "Kilgour, Fred (1914 - 2006)". The
>punctuation is full of "sugar" telling you that the first string of
>characters is the last name, the second string of characters is the
>first name, and the last string of characters are birth/death years.
>While things like the (pbk.), /, commas, etc. are important
>denotations of valuable metadata, there are too many rules needed to
>be known in order for a computer to make sense of it. "If there are
>dashes in between the integers of subfield a of a 020 field, then
>ignore them, and if there are parentheses at the end of a 020 field,
>then ignore that too. Everything else is the ISBN number."
>
>To remove this problem, things like format (pbk.), first name, last
>name, birth year, death year should be explicitly stated, not denoted
>by punctuation that varies from data element to data element. By
>doing so it is easy to present the information in any way desired
>without having to do any special processing.
>
>
>3. MARC is not the data structure of the rest of the world - IMHO,
>this the most important reason why MARC is a poor data structure. It
>has nothing to do with logic, efficient use of disk space, nor
>elegance. It has to do with communication. If you want to communicate
>with other people, then you need to use a common language. In today's
>computing environment, that language is XML. It is the language of
>the Web. It is the language of publishers. It is the language of
>blogs. It is the language of "mash-ups". It is the language of modern-
>day search technologies (OpenSearch and SRU). It is the language of
>business transactions. It is used to mark-up electronic texts (TEI
>and DocBook). It is used to capture the content of archives (EAD). If
>Microsoft had its way, it would be the language of word processors.
>
>Libraries are about collecting, organizing, preserving, and
>*disseminating* data, information, and knowledge. If you want to
>disseminate content, then you need to disseminate it in a way that is
>easily understandable. Increasingly computers are the transmitters
>and receivers -- the middlemen -- of content. XML is easy to read and
>write. All you need is a text editor. (I challenge anybody in this
>forum to use only a text editor to read, write, and modify sets of
>valid MARC records. Anybody!) Moreover, there are a multitude of
>industries focusing on just creating and supporting tools for reading
>and writing XML. At the same time, there are a shrinking number of
>companies, let alone industries, who specialized in the reading/
>writing of MARC.
>
>To see what our bibliographic data can look like in XML, try looking
>as some MARCXML and MODS data. Here are some bibliographic examples
>from my very simple library catalog made with MyLibrary (and, yes, it
>is a catalog despite what some people have said):
>
>   tagged - http://tinyurl.com/2edncu
>   MARCXML - http://tinyurl.com/2xuofn
>   MODS - http://tinyurl.com/ynkkux
>
>The MODS example is the best implementation of the three, but it is
>still not perfect. Last name first. First name last. Publisher listed
>as "Tuttle Pub." when it really should be something like "Tuttle
>Publishers". A call number such as "736/.982" when the Cutter number
>should probably have its own field. Etc.
>
>
>Finally, it is not the "what" that needs to change but the "how". The
>content of MARC records is authoritative. Through controlled
>vocabularies it brings together like items and provides a means for
>discovering ideas -- a novel idea in a world of keyword searching.
>The content of MARC records adds value to the items in our
>collections. Unfortunately, we are writing this data in a dead or
>dying language. Our message is great, but the way we are trying to
>communicate it stinks. We might as well be trying to do algebra with
>the use of Roman numerals.
>
>Whew!

But well worth it.  Nicely said.

The only important issue not raised regarding MARC vs XML that I have
seen discussed before is the idea that MARC does a better job of
expressing relationships among records than XML is capable of
doing.  I don't know the validity of this assertion, and I would be
interested in reading what others have to say.

David

>--
>Eric Lease Morgan
>University Libraries of Notre Dame

David Dorman
US Marketing Manager, Index Data
52 Whitman Ave.
West Hartford, Connecticut  06107
dorman_at_indexdata.com
860-389-1568 or toll free 866-489-1568
fax: 860-561-5613

INDEX DATA Means Business
for Open Source and Open Standards
- - - - - - - - - - - - - - -
www.indexdata.com