Re: data vs "data structure"

From: Karen Coyle <kcoyle_at_nyob> Date: Fri, 21 Sep 2007 07:22:34 -0700 To: NGC4LIB_at_listserv.nd.edu

ISO 2709 does include a way to create fields that are longer than the
field limit: multiple directory entries can be created. All but that
last one is given a zero length, which should be translated to mean: the
maximum. The last one contains the remainder. So if your length limit is
9999, and you want to create a field with 10500 bytes, you have two
directory entries, one is 0 and the other is 501.

Unfortunately, there is no such capability in MARC, which was based on
Z39.2 and not "upgraded" to ISO 2709. But the problem is social and
economic, not technical.

kc

Sharon Foster wrote:
> "What do I do if I want to place an abstract in my MARC record and it is
> 10,000 long? I can't do that because the data structure won't accommodate
> it."
>
> I don't know enough about the other two problems to suggest a solution, but
> this one seems relatively easy. Modify the definition of the length field so
> that if the first character is an 'E' (or choose your favorite character
> that's not 0 through 9) then there is, by definition, an extended length
> field where the next nine (not four) characters comprise the length field.
> Repeat as needed.
>
> Why isn't that possible?
>
> On 9/20/07, Eric Lease Morgan <emorgan_at_nd.edu> wrote:
>> On Sep 20, 2007, at 5:25 PM, Rinne, Nathan (ESC) wrote:
>>
>>>> But we've had this discussion so many times before, we just
>>>> keep going round.
>>> Is there any way to make all of this more concrete?  If it is not
>>> the content of a typical MARC record that is outdated or has
>>> outlived its usefulness, but rather the "data structure"
>>> (container), what exactly does this mean?  How can we map it to
>>> make it easier to understand?
>>
>> I will take a stab at answering this question.
>>
>> Here in the United States, the current data structure of our catalogs
>> is MARC, and considering today's computing environment, MARC is a
>> very poor container. I will list at least three reasons why:
>>
>> 1. MARC has a number of arbitrary limitations - Metadata regarding
>> information resources can include things like titles, authors,
>> physical descriptions, all types of notes, added entries,
>> authorities, controlled vocabularies, pointers to locations, etc. The
>> total sum of this information, counted in bytes, should not have any
>> limitations. If I want to use a megabyte of data to describe a book,
>> then I should be allowed to do so. Unfortunately, by definition, a
>> MARC record can be no larger than 99,999 characters. This is true
>> because the first five characters of every MARC record is a left-
>> hand, zero-padded integer defining the length of the record.  If the
>> first five characters of a MARC record are 00100, then the record is
>> 100 bytes long. If the first five characters are 02100, then the
>> record is 2,100 bytes long. A mere 2K.
>>
>> Moreover, each field in the directory section of a MARC record is
>> defined by three sets of four-character long integers something like
>> this 005002450100. Like the leading characters of a MARC record, the
>> last four digits of these 12-digit placeholders represent the length
>> of fields, in this case the 245 field. Since the last set is of
>> characters is four characters long, the maximum length for any field
>> is 9,999 characters. What do I do if I want to place an abstract in
>> my MARC record and it is 10,000 long? I can't do that because the
>> data structure won't accommodate it.
>>
>>
>> 2. Presentation information is mixed with MARC content - An easy-to-
>> use data structure should not include presentation elements because
>> it is unknown in what context the data will be given. Because
>> (bibliograhic) MARC data includes an abundance of "syntactical sugar"
>> is very difficult to parse. For example, field 020 is intended to
>> contain ISBN numbers but often times you might see it contain a value
>> such as "0804837635 (pbk.)". The text "(pbk.)" is not the ISBN number
>> but a value describing the item's format. Similarly, the title of a
>> book might be encoded as "The adventures of Huckleberry Finn /", but
>> the trailing slash is not really a part of the title. The slash is
>> used for humans to make the item easier to read -- presentation. The
>> same thing goes for author names "Kilgour, Fred (1914 - 2006)". The
>> punctuation is full of "sugar" telling you that the first string of
>> characters is the last name, the second string of characters is the
>> first name, and the last string of characters are birth/death years.
>> While things like the (pbk.), /, commas, etc. are important
>> denotations of valuable metadata, there are too many rules needed to
>> be known in order for a computer to make sense of it. "If there are
>> dashes in between the integers of subfield a of a 020 field, then
>> ignore them, and if there are parentheses at the end of a 020 field,
>> then ignore that too. Everything else is the ISBN number."
>>
>> To remove this problem, things like format (pbk.), first name, last
>> name, birth year, death year should be explicitly stated, not denoted
>> by punctuation that varies from data element to data element. By
>> doing so it is easy to present the information in any way desired
>> without having to do any special processing.
>>
>>
>> 3. MARC is not the data structure of the rest of the world - IMHO,
>> this the most important reason why MARC is a poor data structure. It
>> has nothing to do with logic, efficient use of disk space, nor
>> elegance. It has to do with communication. If you want to communicate
>> with other people, then you need to use a common language. In today's
>> computing environment, that language is XML. It is the language of
>> the Web. It is the language of publishers. It is the language of
>> blogs. It is the language of "mash-ups". It is the language of modern-
>> day search technologies (OpenSearch and SRU). It is the language of
>> business transactions. It is used to mark-up electronic texts (TEI
>> and DocBook). It is used to capture the content of archives (EAD). If
>> Microsoft had its way, it would be the language of word processors.
>>
>> Libraries are about collecting, organizing, preserving, and
>> *disseminating* data, information, and knowledge. If you want to
>> disseminate content, then you need to disseminate it in a way that is
>> easily understandable. Increasingly computers are the transmitters
>> and receivers -- the middlemen -- of content. XML is easy to read and
>> write. All you need is a text editor. (I challenge anybody in this
>> forum to use only a text editor to read, write, and modify sets of
>> valid MARC records. Anybody!) Moreover, there are a multitude of
>> industries focusing on just creating and supporting tools for reading
>> and writing XML. At the same time, there are a shrinking number of
>> companies, let alone industries, who specialized in the reading/
>> writing of MARC.
>>
>> To see what our bibliographic data can look like in XML, try looking
>> as some MARCXML and MODS data. Here are some bibliographic examples
>> from my very simple library catalog made with MyLibrary (and, yes, it
>> is a catalog despite what some people have said):
>>
>>    tagged - http://tinyurl.com/2edncu
>>    MARCXML - http://tinyurl.com/2xuofn
>>    MODS - http://tinyurl.com/ynkkux
>>
>> The MODS example is the best implementation of the three, but it is
>> still not perfect. Last name first. First name last. Publisher listed
>> as "Tuttle Pub." when it really should be something like "Tuttle
>> Publishers". A call number such as "736/.982" when the Cutter number
>> should probably have its own field. Etc.
>>
>>
>> Finally, it is not the "what" that needs to change but the "how". The
>> content of MARC records is authoritative. Through controlled
>> vocabularies it brings together like items and provides a means for
>> discovering ideas -- a novel idea in a world of keyword searching.
>> The content of MARC records adds value to the items in our
>> collections. Unfortunately, we are writing this data in a dead or
>> dying language. Our message is great, but the way we are trying to
>> communicate it stinks. We might as well be trying to do algebra with
>> the use of Roman numerals.
>>
>> Whew!
>>
>> --
>> Eric Lease Morgan
>> University Libraries of Notre Dame
>>
>
>
>
> --
> Sharon M. Foster
> VSA Software
> Open Source Software for Libraries
> http://www.vsa-software.com/ils655
>
>

--
-----------------------------------
Karen Coyle / Digital Library Consultant
kcoyle@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------