Re: ONIX data

From: David Williamson <dawi_at_nyob> Date: Thu, 30 Dec 2010 10:45:19 -0500 To: NGC4LIB_at_LISTSERV.ND.EDU

Hi everyone,

I really haven't been reading this list since I joined a year ago but thanks to Allen Mullen for letting AUTOCAT know about the ONIX converter.  
We've been converting ONIX to MARC for 18 months now for pre-publication records in a pilot project.  To date we've done just under 4300 
records (we do 50K CIPs per year so this is a small sample).  We're happy enough with it that we will be scaling it up to production once we 
come up with an implementation plan.  The National Library of Medicine is also partnering with us in this pilot and is eager to expand as much 
as possible.

I ran a test using an electronic CIP that we cataloged this morning using ONIX with the result from Charles Ledvina's converter.  My comments 
on the records are below, but the records came out very similar:

Charles' converter:
020:  : $a 0521765196 $c $105.00
020:  : $a 9780521765190
082:04: $a 420
100:1 : $a Manzini, M. Rita.
245:10: $a Grammatical categories : $b variation in romance languages / $c M. Rita Manzini.
250:  : $a 1st ed.
260:  : $a [S.l.] : $b Cambridge University Press, $c 2011.
300:  : $a 368 p.
490:1 : $a Cambridge studies in linguistics.
500:  : $a Hardcover.
520:  : $a Grammatical categories (e.g. complementizer, negation, auxiliary, case) are some of the most important building blocks of syntax and 
morphology. Categorization therefore poses fundamental questions about grammatical structures and about the lexicon from which they are 
built. Adopting a 'lexicalist' stance, the authors argue that lexical items are not epiphenomena, but really represent the mapping of sound to 
meaning (and vice versa) that classical conceptions imply. Their rule-governed combination creates words, phrases and sentences - structured 
by the 'categories' that are the object of the present inquiry. They argue that the distinction between functional and non-functional categories, 
between content words and inflections, is not as deeply rooted in grammar as is often thought. In their argumentation they lay the emphasis on 
empirical evidence, drawn mainly from dialectal variation in the Romance languages, as well as from Albanian.
700:1 : $a Savoia, Leonardo Maria.
830: 0: $a Cambridge studies in linguistics.
856:40: $3 Amazon.com $u http://www.amazon.com/exec/obidos/ASIN/0521765196/chopaconline-20
856:40: $3 Amazon customer reviews $u http://www.chopac.org/cgi-bin/tools/azrev.pl?q=0521765196

LC converter:
010:  : $a   2010052183
020:  : $a 9780521765190 (hardback)
040:  : $a DLC $c DLC
042:  : $a pcc
084:  : $a LAN000000 $2 bisacsh
100:1 : $a Manzini, Maria Rita
245:10: $a Grammatical categories : $b variation in romance languages / $c M. Rita Manzini, Leonardo Maria Savoia.
260:  : $a Cambridge ; $a New York : $b Cambridge University Press, $c 2011.
263:  : $a 1103
300:  : $a p. cm.
490:0 : $a Cambridge studies in linguistics ; $v 128
520:  : $a "Grammatical categories (e.g. complementizer, negation, auxiliary, case) are some of the most important building blocks of syntax 
and morphology. Categorization therefore poses fundamental questions about grammatical structures and about the lexicon from which they are 
built. Adopting a 'lexicalist' stance, the authors argue that lexical items are not epiphenomena, but really represent the mapping of sound to 
meaning (and vice versa) that classical conceptions imply. Their rule-governed combination creates words, phrases and sentences - structured 
by the 'categories' that are the object of the present inquiry. They argue that the distinction between functional and non-functional categories, 
between content words and inflections, is not as deeply rooted in grammar as is often thought. In their argumentation they lay the emphasis on 
empirical evidence, drawn mainly from dialectal variation in the Romance languages, as well as from Albanian"-- $c Provided by publisher.
505:8 : $a Machine generated contents note: Introduction: the biolinguistic perspective; 1. The structure and interpretation of (Romance) 
complementizers; 2. Variation in Romance k-complementizer systems; 3. Sentential negation: adverbs; 4. Sentential negation: clitics; 5. The 
middle-passive voice: evidence from Albanian; 6. The auxiliary: have/be alternations in the perfect; 7. The noun (phrase): agreement, case and 
definiteness in an Albanian variety; 8. (Definite) denotation and case in Romance: history and variation.
650: 7: $a LANGUAGE ARTS & DISCIPLINES / General $2 bisacsh.
700:1 : $a Savoia, Leonardo Maria, $d 1948-
856:42: $3 Cover image $u http://assets.cambridge.org/97805217/65190/cover/9780521765190.jpg

Comments on Charles' converter record:
020: ISBN-10 not given in the ONIX data
082: based on BISAC code?
245: 2nd author was given in a full contributor composite in the ONIX data from Cambridge U. Press.
250: not given in ONIX data

Comments on LC record:
010: retrieved from the electronic CIP application
020: qualifier given here rather than in 500.
084: a sub-pilot project for ONIX-derived records.  Taken from the ONIX data.
100: cataloger presented with the heading, she verified the form.
260: cataloger supplied place to ONIX converter application, also for 008.
300: LC decided not to add pagination at this time.
490: cataloger found series number in electronic manuscript and added it.
505: LC decided to add non-standard TOC data.
650: same sub-pilot as 084 but with a look-up table to convert the BISAC code to the textual equivalent.
700: cataloger given Savoia, Leonardo Maria and added the date from the authority
From here the LC record will go for subject cataloging and Dewey assignment.

It's great that the records came out so similar.  While we're using ONIX for pre-publication records, Charles' converter would be good for 
creating the initial records for other items that lack copy.

We're doing ONIX conversions on the fly as we catalog a particular CIP.  The ONIX records are in an Access database on the shared network 
space.  If no ONIX is found, the converter passes basic information off to the other converter we use to convert the ASCII text manuscript to a 
MARC record (essentially copy-and-paste with benefits).  Doing the ONIX conversions 1-by-1 makes it easier to update just the pre-publication 
ONIX records rather than trying to convert thousands of records at a time to MARC in a temporary database, updating existing, creating new, 
and deleting old once the pub. date has passed.  We're also only using a subset of our ONIX files and have 12 catalogers now in the pilot.

Our biggest problem with converting ONIX to MARC are strange characters in the summaries and TOCs (black diamonds, inverted question 
marks, etc.) or HTML coding that publishers are adding more to try to control the format of TOCs in the display of their data.  I'm filtering a lot of 
this, but things keep popping up.

Our catalogers have done time studies and when ONIX data is clean and all elements are present, ONIX conversion takes about half the time to 
do the description of an item over the other converter (which is also faster than manually keying a new record).  The catalogers report an ONIX 
conversion averages about a minute and is more like proofreading that creating a catalog record.  The rest of the process (subject analysis, 
classification, Dewey) takes the usual amount of time.  The enhancements to the record (summaries, TOCs, BISAC codes/terms) are provided 
with very little overhead, and if something is really messed up, the cataloger deletes the summary or TOC and moves on.  The biggest 
headache is when the publisher doesn't provide all of the contributors listed on the title page and the ONIX converter does something that the 
cataloger has to then undo because of the rules.

In one of the messages about ONIX, someone was asking for data.  I recommend Cambridge U. Press DataShop for very good, clean, free 
ONIX data.  While there is no central ONIX distribution, Firebrand's Eloquence service provides ONIX for many publishers as does NetRead.  
The other files LC gets are directly from the publishers that self-distribute.  If anyone has a lead on ONIX from non-US sources for free, I'd like to 
hear about it.  Springer and Dilve (Spain) provide free data but none others in Europe that I've found, yet.

David Williamson

David Williamson
Cataloging Automation Specialist
Acquisitions and Bibliographic Access Directorate
Library of Congress
Washington, D.C. 20540-4200
202.707.5179 (voice)
dawi_at_loc.gov