Re: Online Catalogs: What Users and Librarians Want

From: Weinheimer Jim <j.weinheimer_at_nyob> Date: Thu, 23 Apr 2009 16:38:59 +0200 To: NGC4LIB_at_LISTSERV.ND.EDU

Interesting.

I believe the original question was about dealing with duplicate records, and I mentioned that XSLT can merge entire metadata/bibliographic records according to all sorts of criteria, therefore the issue of duplicates can be handled at the automation level. It doesn't mean that the original XML records are changed at all, so there are no changes to note down, but automated means can make these masses of individual records more comprehensible and useful to people. 

Is this the most efficient way of handling things, i.e. merging lots of unit cards into a single view? Possibly not, but it allows for a lot of flexibility and would be a huge improvement over what we have now, where everything is being done by hand. From my reading of RDA, it foresees this same manual procedure into the future.

In addition, I think there is little likelihood that RDA will be accepted outside the library community (even inside is looking rather iffy right now), but certainly outside the library community, the outlook is very dim indeed. If we are to have a chance of working with these other communities and increasing our own productivity (and thereby justifying our further existence), it won't be by insisting that everyone follow our rules and methods. We must work with others, too, and this can be done by finding new ways of using and transforming the records others share with us, while presumably, they will be doing the same things to our records if they wish.

After a long time of this interaction with everyone's records, there *may* emerge a general consensus on display, search, and so on, or there may not. But I still see no need for the individual records to change, at least for the time being There is plenty we can do using the powerful automated tools today.

Jim Weinheimer

> I'd also like to point out that XSLT, though fine for one time or
> on-the-fly transformations, is not a good basis for improving data over
> time, where you want to be able to store changes and track the
> provenance of the change (was it a machine or a person who did this?
> what was the process? when was it done?)  In other words, in order to
> learn from the work we do in matching, improving, etc., and in order to
> share those improvements, we need to look beyond just changing the
> display, to figuring out how to manage the data for use outside a local
> context.
> 
> Diane Hillmann
> 
> Karen Coyle wrote:
> > You don't need XML for that. It might be handy, but you can do it with
> > just about any data format. As a matter of fact, I just got a list of
> > 'matching name variants' from some work being done on the Open Library
> > to match names from MARC records to wikipedia entries. Wikipedia has a
> > lot of information, including full dates of birth and death, place of
> > birth, titles and dates of works. Here are some of the name variants
> > that were found in the MARC records, all of which match up to a single
> > name in wikipedia using an algorithm. (Note, some of the differences
> > are in Unicode encoding, and probably won't show up in an email message.)
> >
> >    * *$a*A. C. Bhaktivedanta Swami
> Prabhupada*$d*1896-1977.
> >    * *$a*A. C. Bhaktivedanta Swami
> Prabhupada*$d*1896-1977
> >    * *$a*A.C. Bhaktivedanta Swami
> Prabhupada*$d*1896-1977
> >    * *$a*A. C. Bhaktivedanta Swami
> Prabhupa-da*$d*1896-1977.
> >    * *$a*A. C. Bhaktivedanta Swami
> Prabhupa-da*$d*1896-1977
> >    * *$a*A.C. Bhaktivedanta Swami
> Prabhupa-da*$d*1896-1977.
> >    * *$a*Bhaktivedanta, A. C.*$d*1896-1977.
> >    * *$a*Bhaktivedanta Swami, A. c.*$d*1896-
> >    * *$a*Bhaktivedanta Swami, A. C.*$d*1896-
> >    * *$a*Bhaktivedanta Swami, A.C.*$d*1896-
> >    * *$a*Bhaktivedanta Swami, A. C.*$d*1896-1977.
> >    * *$a*Bhaktivedanta Swami*$d*1896
> >  0 * *$a*Bhaktivedanta Swami Prabhupa-da*$d*1896-1977.
> >
> >
> >    * *$a*Athanasius*$c*Saint*$c*Patriarch of
> Alexandria*$d*d. 373.
> >    * *$a*Athanasius*$c*Saint*$c*Patriarch of
> Alexandria*$d*d. 373
> >    * *$a*Athanasius*$c*Saint*$d*295-373.
> >    * *$a*Athanasius*$c*Saint*$d*295-373 A.D.
> >    * *$a*Athanasius*$c*Saint*$d*ca. 298-373.
> >    * *$a*Athanasius*$c*Saint*$d*ca.298-373.
> >    * *$a*Athanasius*$c*Saint, Patriarch of
> Alexander*$d*d. 373.
> >    * *$a*Athanasius*$c*Saint, patriarch of
> Alexandria*$d*d. 373.
> >    * *$a*Athanasius*$c*Saint, Patriarch of
> Alexandria*$d*d. 373.
> >    * *$a*Athanasius*$c*Saint, Patriarch of
> Alexandria*$d*d. 373
> >    * *$a*Athanasius, Saint*$d*295-373 A.D.
> >
> > You can see more here:
> > http://edwardbetts.com/ol/marc_author_variants.html
> >
> > I'd like to see a link between the LCCN in LC names and wikipedia
> > pages...
> >
> > kc
> >
> > Weinheimer Jim wrote:
> >> Deborah Fritz wrote:
> >>
> >>>  Weinheimer Jim wrote:
> >>>
> >>>  > I don't think FRBR is necessary. XML processing
> can eliminate
> >>>  > duplicates in all kinds of ways, so I still
> believe that the
> >>>  > main thing is to dump the ISO2709 format ASAP,
> change to some
> >>>  > kind of XML format, be it MARCXML or MODS, switch
> to URIs the
> >>>  > moment LC (finally) puts everything online, then
> share our
> >>>  > records widely (!!) in all different kinds of
> formats.
> >>>
> >>>  Jim, can you clarify how "XML processing can
> eliminate duplicates"?
> >>>
> >>
> >> Actually, it's XSLT processing that can eliminate duplicates. XML can
> >> do very little on its own, you need the style sheets that will
> >> transform the XML file into something more useful, such as an HTML
> >> page or pdf document. There are other XML tools as well such as
> >> XQuery, which I understand less.
> >>
> >> There are all kinds of things you can do with XSLT such as sorting,
> >> transforming, etc. in all sorts of ways that I think will take some
> >> time for people to fully appreciate. But one thing it can do is
> >> detect duplicate values and display them as you want. It can also
> >> perform fuzzy value detection. I understand the
> principle quite well,
> >> but haven't implemented it in a long time. For a short,
> >> semi-technical discussion, see:
> >> http://www.xml.com/pub/a/2002/10/02/tr.html
> >>
> >> Therefore, you can make an XSLT to say that if you have the same
> >> 245abc, 250, 260, 300a, 4xx/8xx (don't know how this would work today
> >> with the new series treatments!), it could merge all the records with
> >> the same information into one record. You could also make it
> "fuzzy"
> >> with e.g. the 260.
> >>
> >> Or we could merge based on completely different criteria and find
> >> out... who knows? This is where you can play and perhaps discover
> >> something new.
> >>
> >> This is yet another reason why I hesitate to enact RDA and FRBR. If
> >> we want FRBR-type records, I think a *LOT* could be done with XSLTs
> >> to generate those new types of records automatically so that we can
> >> discover if they really are useful to our patrons or not.
> >> There is less and less reason to de-duplicate manually today.
> >>
> >> Jim Weinheimer
> >>
> >>
> >>
> >
> >