Re: more on MARC char encoding: Now we're about ISO_2709 and MARC21

From: Jonathan Rochkind <rochkind_at_nyob>
Date: Wed, 18 Apr 2012 10:04:12 -0400
To: CODE4LIB_at_LISTSERV.ND.EDU
On 4/18/2012 6:04 AM, Tod Olson wrote:
> It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format.

I'm not sure that follows. One could certainly have UTF-16 in a Marc 
record, and still count bytes to get a directory structure and byte 
offsets. (In some ways it'd be easier since every char would be two bytes).

In fact, I worry that the standard may pre-date UTF-8, with it's 
reference to "UCS" ---  if I understand things right, at one point there 
was only one unicode encoding, called "UCS", which is basically a 
backwards-compatible subset of what became UTF-16.

So I worry the standard really "means" UCS/UTF-16.

But if in fact records in the wild with the 'u' value are far more 
likely to be UTF-8... well it's certainly not the first time the MARC21 
standard was useless/ignored as a standard in answering such questions.
Received on Wed Apr 18 2012 - 10:05:57 EDT