Re: data vs "data structure"

From: James Weinheimer <j.weinheimer_at_nyob> Date: Fri, 21 Sep 2007 11:47:55 +0200 To: NGC4LIB_at_listserv.nd.edu

If I may enter into this, there are a few aspects to data and data structure
that I think need to be kept separated. In library terms,, there is the
ISO2709 format that contains the MARC structure; the MARC structure contains
MARC21 (from now on, I am discussing Anglo-American cataloging); the MARC21
structure contains the cataloging/metadata information, which is composed of
ISBD, AACR2 (which is ISBD plus rules for headings), and Subject access in
various ways, but we think of LCSH; finally, when multiples of these
ISO2709/MARC/MARC21/ISBD/AACR2/LCSH records come together, it is called a
library catalog.

There are different, competing systems for each of these datas and data
structures. I believe that our number one task is to have all of these
competing systems work together in the most simple and harmonious way for
the good of our users and ourselves.

Out of this, the biggest problem is that the ISO2709 format needs to be
junked ASAP in favor of XML. For all intents and purposes, this has already
been done in most catalogs (except for ISIS databases and some others) and
ISO2709 is used almost solely for record transfer. This needs to change in
favor different methods of XML harvesting, so that I am not stuck with
importing only my own formats. This should be a relatively simple task and
has already been done in several catalogs.

The basic MARC stucture is a different matter (that is, the numbered
fields-subfields structure). There are all kinds of different MARC formats
and does there have to be a single one if they are all in XML? Is there any
inherent advantage in <mainentrypersonalname> to <100>? Do we have to impose
uniformity in this structure?
I think not, since XML allows people to convert from any format into any
other format. Additionally, since very few places use MARC-type formats, is
it reasonable to expect all archives, news providers, libraries, publishers,
statisticians, geographers, etc. to all use the same format, no matter what
that format happens to be? I don't think so, and for the sake of our users
and ourselves, our efforts should focus on making all of these formats
interoperate.

MARC21 is heavily dependent on ISBD/AACR2 but MARCXML still allows a certain
amount of flexibility in defining additional fields. I personally believe
that as the new catalogs catch on, especially the XML non-databases such as
Lucene and Zebra, the datastructures of individual catalogs will change and
become more local, but the XML/XSL structure will always allow XML records
to be shared, although some local editing probably will be needed.

ISBD/AACR2/LCSH are an altogether different task, since this goes beyond
computer systems. It also involves including authority files. Some people I
have spoken with see the task as too complex and believe that the choice is
either to forego all standardization in this "data" and mix all types of
this information together, or to try to force everyone to use the same
authorized forms in some unspecified way. As a result, these people believe
the whole idea of "authority control" is no longer applicable.

I think authority control is necessary, and there are options through
concept servers, interoperable authority files, and so on, so that if
someone finds an interesting concept in one database, they will be able to
find related concepts in other databases, and in this way, we can help build
the Semantic Web. All of the points mentioned above are important parts of
the Semantic Web.

There is a growing demand for the Semantic Web, and I think that when all
materials are placed on the web (IMHO, this is inevitable), the Semantic Web
(i.e. authority control) will be absolutely necessary; otherwise, nobody
will be able to find anything at all. I haven't heard, and cannot think, of
any other way for the Semantic Web to work besides imposing some sort of
authority control somewhere along the way. Perhaps this task can be
automated, but I'm not holding my breath.

Consequently, there is a lot of very hard work, and very interesting work,
to be done.

James Weinheimer  j.weinheimer_at_aur.edu
Director of Library and Information Services
The American University of Rome
via Pietro Roselli, 4
00153 Rome, Italy
voice- 011 39 06 58330919 ext. 327
fax-011 39 06 58330992

> -----Original Message-----
> From: Next generation catalogs for libraries
> [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Sharon Foster
> Sent: Friday, September 21, 2007 5:07 AM
> To: NGC4LIB_at_listserv.nd.edu
> Subject: Re: [NGC4LIB] data vs "data structure"
>
> "What do I do if I want to place an abstract in my MARC record and it is
> 10,000 long? I can't do that because the data structure won't accommodate
> it."
>
> I don't know enough about the other two problems to suggest a solution,
> but
> this one seems relatively easy. Modify the definition of the length field
> so
> that if the first character is an 'E' (or choose your favorite character
> that's not 0 through 9) then there is, by definition, an extended length
> field where the next nine (not four) characters comprise the length field.
> Repeat as needed.
>
> Why isn't that possible?
>
> On 9/20/07, Eric Lease Morgan <emorgan_at_nd.edu> wrote:
> >
> > On Sep 20, 2007, at 5:25 PM, Rinne, Nathan (ESC) wrote:
> >
> > >> But we've had this discussion so many times before, we just
> > >> keep going round.
> > >
> > > Is there any way to make all of this more concrete?  If it is not
> > > the content of a typical MARC record that is outdated or has
> > > outlived its usefulness, but rather the "data structure"
> > > (container), what exactly does this mean?  How can we map it to
> > > make it easier to understand?
> >
> >
> > I will take a stab at answering this question.
> >
> > Here in the United States, the current data structure of our catalogs
> > is MARC, and considering today's computing environment, MARC is a
> > very poor container. I will list at least three reasons why:
> >
> > 1. MARC has a number of arbitrary limitations - Metadata regarding
> > information resources can include things like titles, authors,
> > physical descriptions, all types of notes, added entries,
> > authorities, controlled vocabularies, pointers to locations, etc. The
> > total sum of this information, counted in bytes, should not have any
> > limitations. If I want to use a megabyte of data to describe a book,
> > then I should be allowed to do so. Unfortunately, by definition, a
> > MARC record can be no larger than 99,999 characters. This is true
> > because the first five characters of every MARC record is a left-
> > hand, zero-padded integer defining the length of the record.  If the
> > first five characters of a MARC record are 00100, then the record is
> > 100 bytes long. If the first five characters are 02100, then the
> > record is 2,100 bytes long. A mere 2K.
> >
> > Moreover, each field in the directory section of a MARC record is
> > defined by three sets of four-character long integers something like
> > this 005002450100. Like the leading characters of a MARC record, the
> > last four digits of these 12-digit placeholders represent the length
> > of fields, in this case the 245 field. Since the last set is of
> > characters is four characters long, the maximum length for any field
> > is 9,999 characters. What do I do if I want to place an abstract in
> > my MARC record and it is 10,000 long? I can't do that because the
> > data structure won't accommodate it.
> >
> >
> > 2. Presentation information is mixed with MARC content - An easy-to-
> > use data structure should not include presentation elements because
> > it is unknown in what context the data will be given. Because
> > (bibliograhic) MARC data includes an abundance of "syntactical sugar"
> > is very difficult to parse. For example, field 020 is intended to
> > contain ISBN numbers but often times you might see it contain a value
> > such as "0804837635 (pbk.)". The text "(pbk.)" is not the ISBN number
> > but a value describing the item's format. Similarly, the title of a
> > book might be encoded as "The adventures of Huckleberry Finn /", but
> > the trailing slash is not really a part of the title. The slash is
> > used for humans to make the item easier to read -- presentation. The
> > same thing goes for author names "Kilgour, Fred (1914 - 2006)". The
> > punctuation is full of "sugar" telling you that the first string of
> > characters is the last name, the second string of characters is the
> > first name, and the last string of characters are birth/death years.
> > While things like the (pbk.), /, commas, etc. are important
> > denotations of valuable metadata, there are too many rules needed to
> > be known in order for a computer to make sense of it. "If there are
> > dashes in between the integers of subfield a of a 020 field, then
> > ignore them, and if there are parentheses at the end of a 020 field,
> > then ignore that too. Everything else is the ISBN number."
> >
> > To remove this problem, things like format (pbk.), first name, last
> > name, birth year, death year should be explicitly stated, not denoted
> > by punctuation that varies from data element to data element. By
> > doing so it is easy to present the information in any way desired
> > without having to do any special processing.
> >
> >
> > 3. MARC is not the data structure of the rest of the world - IMHO,
> > this the most important reason why MARC is a poor data structure. It
> > has nothing to do with logic, efficient use of disk space, nor
> > elegance. It has to do with communication. If you want to communicate
> > with other people, then you need to use a common language. In today's
> > computing environment, that language is XML. It is the language of
> > the Web. It is the language of publishers. It is the language of
> > blogs. It is the language of "mash-ups". It is the language of modern-
> > day search technologies (OpenSearch and SRU). It is the language of
> > business transactions. It is used to mark-up electronic texts (TEI
> > and DocBook). It is used to capture the content of archives (EAD). If
> > Microsoft had its way, it would be the language of word processors.
> >
> > Libraries are about collecting, organizing, preserving, and
> > *disseminating* data, information, and knowledge. If you want to
> > disseminate content, then you need to disseminate it in a way that is
> > easily understandable. Increasingly computers are the transmitters
> > and receivers -- the middlemen -- of content. XML is easy to read and
> > write. All you need is a text editor. (I challenge anybody in this
> > forum to use only a text editor to read, write, and modify sets of
> > valid MARC records. Anybody!) Moreover, there are a multitude of
> > industries focusing on just creating and supporting tools for reading
> > and writing XML. At the same time, there are a shrinking number of
> > companies, let alone industries, who specialized in the reading/
> > writing of MARC.
> >
> > To see what our bibliographic data can look like in XML, try looking
> > as some MARCXML and MODS data. Here are some bibliographic examples
> > from my very simple library catalog made with MyLibrary (and, yes, it
> > is a catalog despite what some people have said):
> >
> >    tagged - http://tinyurl.com/2edncu
> >    MARCXML - http://tinyurl.com/2xuofn
> >    MODS - http://tinyurl.com/ynkkux
> >
> > The MODS example is the best implementation of the three, but it is
> > still not perfect. Last name first. First name last. Publisher listed
> > as "Tuttle Pub." when it really should be something like "Tuttle
> > Publishers". A call number such as "736/.982" when the Cutter number
> > should probably have its own field. Etc.
> >
> >
> > Finally, it is not the "what" that needs to change but the "how". The
> > content of MARC records is authoritative. Through controlled
> > vocabularies it brings together like items and provides a means for
> > discovering ideas -- a novel idea in a world of keyword searching.
> > The content of MARC records adds value to the items in our
> > collections. Unfortunately, we are writing this data in a dead or
> > dying language. Our message is great, but the way we are trying to
> > communicate it stinks. We might as well be trying to do algebra with
> > the use of Roman numerals.
> >
> > Whew!
> >
> > --
> > Eric Lease Morgan
> > University Libraries of Notre Dame
> >
>
>
>
> --
> Sharon M. Foster
> VSA Software
> Open Source Software for Libraries
> http://www.vsa-software.com/ils655