Re: Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

From: Godmar Back <godmar_at_nyob> Date: Thu, 8 Mar 2012 15:32:51 -0500 To: CODE4LIB_at_LISTSERV.ND.EDU

On Thu, Mar 8, 2012 at 3:18 PM, Ed Summers <ehs_at_pobox.com> wrote:

> Hi Terry,
>
> On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry
> <terry.reese_at_oregonstate.edu> wrote:
> > This is one of the reasons you really can't trust the information found
> in position 9.  This is one of the reasons why when I wrote MarcEdit, I
> utilize a mixed process when working with data and determining characterset
> -- a process that reads this byte and takes the information under
> advisement, but in the end treats it more as a suggestion and one part of a
> larger heuristic analysis of the record data to determine whether the
> information is in UTF8 or not.  Fortunately, determining if a set of data
> is in UTF8 or something else, is a fairly easy process.  Determining the
> something else is much more difficult, but generally not necessary.
>
> Can you describe in a bit more detail how MARCEdit sniffs the record
> to determine the encoding? This has come up enough times w/ pymarc to
> make it worth implementing.
>
>
One side comment here; while smart handling/automatic detection of
encodings would be a nice feature to have, it would help if pymarc could
operate in an 'agnostic', or 'raw' mode where it would simply preserve the
encoding that's there after a record has been read when writing the record.

[ Right now, pymarc does not have such a mode - if leader[9] == 'a', the
data is unconditionally utf8 encoded on output as per mbklein's patch. ]

 - Godmar