Re: data vs "data structure"

From: Sharon Foster <vsa.software_at_nyob> Date: Thu, 20 Sep 2007 23:07:11 -0400 To: NGC4LIB_at_listserv.nd.edu

"What do I do if I want to place an abstract in my MARC record and it is
10,000 long? I can't do that because the data structure won't accommodate
it."

I don't know enough about the other two problems to suggest a solution, but
this one seems relatively easy. Modify the definition of the length field so
that if the first character is an 'E' (or choose your favorite character
that's not 0 through 9) then there is, by definition, an extended length
field where the next nine (not four) characters comprise the length field.
Repeat as needed.

Why isn't that possible?

On 9/20/07, Eric Lease Morgan <emorgan_at_nd.edu> wrote:
>
> On Sep 20, 2007, at 5:25 PM, Rinne, Nathan (ESC) wrote:
>
> >> But we've had this discussion so many times before, we just
> >> keep going round.
> >
> > Is there any way to make all of this more concrete?  If it is not
> > the content of a typical MARC record that is outdated or has
> > outlived its usefulness, but rather the "data structure"
> > (container), what exactly does this mean?  How can we map it to
> > make it easier to understand?
>
>
> I will take a stab at answering this question.
>
> Here in the United States, the current data structure of our catalogs
> is MARC, and considering today's computing environment, MARC is a
> very poor container. I will list at least three reasons why:
>
> 1. MARC has a number of arbitrary limitations - Metadata regarding
> information resources can include things like titles, authors,
> physical descriptions, all types of notes, added entries,
> authorities, controlled vocabularies, pointers to locations, etc. The
> total sum of this information, counted in bytes, should not have any
> limitations. If I want to use a megabyte of data to describe a book,
> then I should be allowed to do so. Unfortunately, by definition, a
> MARC record can be no larger than 99,999 characters. This is true
> because the first five characters of every MARC record is a left-
> hand, zero-padded integer defining the length of the record.  If the
> first five characters of a MARC record are 00100, then the record is
> 100 bytes long. If the first five characters are 02100, then the
> record is 2,100 bytes long. A mere 2K.
>
> Moreover, each field in the directory section of a MARC record is
> defined by three sets of four-character long integers something like
> this 005002450100. Like the leading characters of a MARC record, the
> last four digits of these 12-digit placeholders represent the length
> of fields, in this case the 245 field. Since the last set is of
> characters is four characters long, the maximum length for any field
> is 9,999 characters. What do I do if I want to place an abstract in
> my MARC record and it is 10,000 long? I can't do that because the
> data structure won't accommodate it.
>
>
> 2. Presentation information is mixed with MARC content - An easy-to-
> use data structure should not include presentation elements because
> it is unknown in what context the data will be given. Because
> (bibliograhic) MARC data includes an abundance of "syntactical sugar"
> is very difficult to parse. For example, field 020 is intended to
> contain ISBN numbers but often times you might see it contain a value
> such as "0804837635 (pbk.)". The text "(pbk.)" is not the ISBN number
> but a value describing the item's format. Similarly, the title of a
> book might be encoded as "The adventures of Huckleberry Finn /", but
> the trailing slash is not really a part of the title. The slash is
> used for humans to make the item easier to read -- presentation. The
> same thing goes for author names "Kilgour, Fred (1914 - 2006)". The
> punctuation is full of "sugar" telling you that the first string of
> characters is the last name, the second string of characters is the
> first name, and the last string of characters are birth/death years.
> While things like the (pbk.), /, commas, etc. are important
> denotations of valuable metadata, there are too many rules needed to
> be known in order for a computer to make sense of it. "If there are
> dashes in between the integers of subfield a of a 020 field, then
> ignore them, and if there are parentheses at the end of a 020 field,
> then ignore that too. Everything else is the ISBN number."
>
> To remove this problem, things like format (pbk.), first name, last
> name, birth year, death year should be explicitly stated, not denoted
> by punctuation that varies from data element to data element. By
> doing so it is easy to present the information in any way desired
> without having to do any special processing.
>
>
> 3. MARC is not the data structure of the rest of the world - IMHO,
> this the most important reason why MARC is a poor data structure. It
> has nothing to do with logic, efficient use of disk space, nor
> elegance. It has to do with communication. If you want to communicate
> with other people, then you need to use a common language. In today's
> computing environment, that language is XML. It is the language of
> the Web. It is the language of publishers. It is the language of
> blogs. It is the language of "mash-ups". It is the language of modern-
> day search technologies (OpenSearch and SRU). It is the language of
> business transactions. It is used to mark-up electronic texts (TEI
> and DocBook). It is used to capture the content of archives (EAD). If
> Microsoft had its way, it would be the language of word processors.
>
> Libraries are about collecting, organizing, preserving, and
> *disseminating* data, information, and knowledge. If you want to
> disseminate content, then you need to disseminate it in a way that is
> easily understandable. Increasingly computers are the transmitters
> and receivers -- the middlemen -- of content. XML is easy to read and
> write. All you need is a text editor. (I challenge anybody in this
> forum to use only a text editor to read, write, and modify sets of
> valid MARC records. Anybody!) Moreover, there are a multitude of
> industries focusing on just creating and supporting tools for reading
> and writing XML. At the same time, there are a shrinking number of
> companies, let alone industries, who specialized in the reading/
> writing of MARC.
>
> To see what our bibliographic data can look like in XML, try looking
> as some MARCXML and MODS data. Here are some bibliographic examples
> from my very simple library catalog made with MyLibrary (and, yes, it
> is a catalog despite what some people have said):
>
>    tagged - http://tinyurl.com/2edncu
>    MARCXML - http://tinyurl.com/2xuofn
>    MODS - http://tinyurl.com/ynkkux
>
> The MODS example is the best implementation of the three, but it is
> still not perfect. Last name first. First name last. Publisher listed
> as "Tuttle Pub." when it really should be something like "Tuttle
> Publishers". A call number such as "736/.982" when the Cutter number
> should probably have its own field. Etc.
>
>
> Finally, it is not the "what" that needs to change but the "how". The
> content of MARC records is authoritative. Through controlled
> vocabularies it brings together like items and provides a means for
> discovering ideas -- a novel idea in a world of keyword searching.
> The content of MARC records adds value to the items in our
> collections. Unfortunately, we are writing this data in a dead or
> dying language. Our message is great, but the way we are trying to
> communicate it stinks. We might as well be trying to do algebra with
> the use of Roman numerals.
>
> Whew!
>
> --
> Eric Lease Morgan
> University Libraries of Notre Dame
>

--
Sharon M. Foster
VSA Software
Open Source Software for Libraries
http://www.vsa-software.com/ils655