Re: Ceci n'est pas un catalogue

From: Hahn, Harvey <hhahn_at_nyob>
Date: Fri, 24 Aug 2007 18:36:33 -0500
To: NGC4LIB_at_listserv.nd.edu
Katherine McConnell wrote:
|Quoting Bernhard Eversberg <ev_at_BIBLIO.TU-BS.DE>:
|> Hahn, Harvey wrote:
|>> I've argued the non-need of ISBD in MARC records
|>> repeatedly in cataloging forums to no avail.  Oh, well...
|
|    I am loving this line of discussion where we pull apart the
|components of what we work with.  A MARC database is a thing of
|beauty.  Separating it from the constraints of AARC2R and/or ISBD and
|even LCSH leaves it open for use in areas other than traditional
|libraries.  The design of the database and the level of indexing
|available and done by most ILSs leaves me wanting to use it for other
|data stores.  And MARCXML starts to look a lot more attractive.  Time
|for MARC to break out of the library system?

What's really interesting is that many (most??) people are unaware that
MARC21 is only one of thousands (maybe millions) of structures for
records that are possible using the "MARC" format.  You have to "think
between the lines" of the MARC21 structural definitions to see the far
more general possibilities:

    <http://www.loc.gov/marc/specifications/specrecstruc.html>

There are four hardcoded values in the general MARC structure: (1) the
leader is a 24-character ASCII alphanumeric string, (2) the record
length is a 5-character ASCII numeric string, (3) the base address of
data (part of the leader) is a 5-character ASCII numeric string and (4)
a tag is a 3-character ASCII alphanumeric string; everything else
defines a specific type of MARC record structure.  (There is another
"hardcoding" aspect in that some of this data is limited to a *single*
digit, that is, a maximum value of 9.)  At this point in time, there is
one and only one defined and implemented MARC record structure: MARC21.
The MARC21 implementation further defines other hardcoded values in the
leader that contribute to what that particular record structure looks
like.

There are some other hardcoded groupings in MARC as well--there are
three parts to a MARC record: (1) leader area, (2) directory area, and
(3) data area; and there are three delimiting characters: (1) subfield
delimiter (SFD), (2) field terminator (FT), and (3) record terminator
(RT).

I might note that, although created in 1988, the Tag(ged) Image File
Format (TIFF) is conceptually similar to the general MARC format from 20
years earlier (a tribute to the genius of Henriette Avram and her
team!), particularly since they both share the idea of tagged data.

As I mentioned earlier, what makes MARC into MARC21 are certain values
in the leader.  But what if those values are changed?  Here they are
(zero-based from the start of the leader), with all values in the range
0 to 9:

    10: number of indicators
    11: length of identifier (subfield code)
    20: max number of digits in length-of-field value
    21: max number of digits in starting-character-position

The respective values 2, 2, 4, and 5 (and tags limited to numeric values
000 to 999) define the structure of what we know as MARC21 records.  But
there's nothing to say that you couldn't change these values to come up
with a *different* (i.e., non-MARC21) kind of MARC record.

If you think about varying the values for a little bit, you'll probably
note that the values in positions 20 and 21 above are actually
constrained by the 5-character record length; in other words, the
"practical" ranges of values would be 2-4 for length-of-field and 3-5
for starting-character-position, with the greatest practicality and
flexibility at the high ends.  I think there would be greater value to
increasing the number of indicators by 1 or 2 and increasing the size of
the subfield code by 1 to permit things such as $aa, $b3, and $12.  Both
of these changes would increase granularity and flexibility in coding
data.  I first came across these kinds of thoughts 20-some years ago in
Walt Crawford's book "MARC for Library Use".  In the second edition,
page 33, he says:

"The standard allows a very wide range of implementations.  A format
need not have any indicators or subfields to be a Z39.2 format (i.e.,
positions 10 and 11 of the leader could both be '0').  A format could
also have eight indicators per field and subfield codes which were six
characters long--with positions 10 and 11 being '86'--and still be a
Z39.2 format.

"An implementation could even *theoretically* have different directory
structures for different records, since the leader in each record
defines that record's directory.  In practice such an implementation
would be quite difficult to use, as the associated data dictionaries and
parsing rules would be extremely complex."

But there are two *other* things that could be changed, too--one legal,
one currently illegal.

The legal change has to do with the content of the 3-character tags.
MARC21 limits their values to numeric ASCII values, but the MARC
definition of a tag indicates that it contains *alphanumeric* values,
that is, each of the three values of a tag can be numeric and/or
alphabetic.  This gives a possibility of up to 46,655 tags rather than
"merely" 999.  (Of course, this is "legal" MARC, but *illegal* MARC21.)

The illegal change has to do with the challenge in today's electronic
world that some people wish that the MARC record could carry digital
data content within the MARC record itself.  This is currently
impossible because of the hardcoded record length of 5 digits, limiting
record lengths to 99,999 ASCII characters; with multibyte Unicode
characters, that would be reduced in half (or more!) in the blink of an
eye--it's still 99,999 8-bit bytes of data, however.

I can think of two ways around the limitation--but, of course, it means
changing the world! ;-)  One method, not easily human-readable at all
and requiring all currently existing MARC records to be rewritten, would
be to redefine the five numeric digits from base-10 to either base-16
(hexadecimal), base-32, or base-36.  All of the latter could
meaningfully use both numeric and alphabetic characters in each of the
digit positions.  For example, most of you know that the hexadecimal
system uses the numbers 0 to 9 and the letters A to F for the 16 needed
digits; the base-32 system (16 doubled) would use the numbers 0 to 9 and
the letters A to V; a base-36 system (all 10 numbers and 26 letters)
would use 0 to 9 and A to Z.  Although base-16 (max record size =
1,048,575 characters), base-32 (33,554,431 characters), and base-36
(60,466,175 characters) increase the maximum MARC record size, many
current digital files (and most future digital files) still would not
fit within the increased size constraints.

A second (and, I think, much more flexible for the future) method would
retain the initial 5 characters for record length *info* and the
24-character standard for the leader (as the first method above also
does) but, instead of using the first 5 characters as the actual length,
it would use them as a *pointer* to where in the MARC record the true
length can be found.  Since no known MARC records have a length anywhere
near approaching the maximum 99999 value, I suggest that an initial
digit "9" would indicate that the number is a pointer rather than a
value.  (That way, all current MARC records can exist as is, without any
changes needed.)  The remaining four digits would then indicate a
position (or, perhaps, an offset to a position).  The position would be
just after the directory and before the data content.  This location
(containing the actual length of the record) could be variable in length
(just like data content) and terminated with a field terminator
character, just like the directory and variable fields.

There's a *second* set of 5 digits within the leader that needs to be
redefined when the first 5 characters are a pointer rather than a value:
the "base address of data = length of leader + length of directory + 1"
needs to be changed to "base address of data = length of leader + length
of directory + 1 + length of record size + 1".  (The two 1's represent
the length of each of the two field terminators involved.)

With this approach, all current MARC records can be handled without any
changes to parsing and reading/writing routines.  The difference is that
MARC software would need to *add* new parsing and reading/writing
routines to handle the new situation where the record size begins with
the digit "9".  New reading/writing routines might have to be added as
well to handle the new digital content that might exist within records.
This content might be identified either with one or more "standardized"
9XX tags or, perhaps preferably, with *alphanumeric* tags (permitted in
MARC but not currently in MARC21), where a leading alphabetic character
might perhaps indicate a particular type of digital content.  It would
probably work best if these new tags would be exempt from field length
limitations.

Obviously, what I just said would work only when there is a single
digital content element in the record (because the starting position of
the digital content would still be a relatively small number, capable of
being handled by the current MARC directory structure).  If records
needed to carry multiple digital content elements, then the directory
structure would have to be revamped (either to permit lengthier fields
or to use pointers, like the record size in the leader that I've
proposed), and parsing and reading/writing routines would have to be
newly written for these records containing digital contents.  The
problem with "merely" lengthening the directory fields is that the
single-digit values in the leader for this data (the "4" and "5" near
the end of the leader) would limit positions to only 9 digits, that is,
a positional value of 999,999,999--a pretty big number but still limited
in terms of possible future needs.  To use (and where and how to store)
multiple pointers instead gets complicated, and I haven't thought about
the ease or challenge of that.

The *big* challenge for any solution is how to avoid rewriting or
restructuring all existing MARC records.  My suggestion of changing the
first 5 characters of the leader from "record size" to "record size
information" (permitting a record size or a pointer to the record size,
depending upon a key value to determine which) accomplishes that.
Existing MARC records (and future ones requiring no more or no different
cataloging information than now) can be handled as they are now.
However, "special" MARC records containing digital content (coded with a
first digit of "9" in the "record size information" area of the leader)
can also be handled with new parsing and reading/writing software
routines.

As I recall, I think I may have described a lot of this a year or two or
more ago on either this list or some other lists.  In any case, I don't
pretend to have all the answers, but maybe my thoughts above might
stimulate some further explorations of enhanced MARC solutions to some
of the issues discussed here.

Harvey

--
===========================================
Harvey E. Hahn, Manager, Technical Services Department
Arlington Heights (Illinois) Memorial Library
847/506-2644 - FX: 847/506-2650 - Email: hhahn(at)ahml(dot)info
OML & Scripts web pages: http://www.ahml.info/oml/
Personal web pages: http://users.anet.com/~packrat
Received on Fri Aug 24 2007 - 17:15:42 EDT