I'd also like to point out that XSLT, though fine for one time or
on-the-fly transformations, is not a good basis for improving data over
time, where you want to be able to store changes and track the
provenance of the change (was it a machine or a person who did this?
what was the process? when was it done?) In other words, in order to
learn from the work we do in matching, improving, etc., and in order to
share those improvements, we need to look beyond just changing the
display, to figuring out how to manage the data for use outside a local
context.
Diane Hillmann
Karen Coyle wrote:
> You don't need XML for that. It might be handy, but you can do it with
> just about any data format. As a matter of fact, I just got a list of
> 'matching name variants' from some work being done on the Open Library
> to match names from MARC records to wikipedia entries. Wikipedia has a
> lot of information, including full dates of birth and death, place of
> birth, titles and dates of works. Here are some of the name variants
> that were found in the MARC records, all of which match up to a single
> name in wikipedia using an algorithm. (Note, some of the differences
> are in Unicode encoding, and probably won't show up in an email message.)
>
> * *$a*A. C. Bhaktivedanta Swami Prabhupada*$d*1896-1977.
> * *$a*A. C. Bhaktivedanta Swami Prabhupada*$d*1896-1977
> * *$a*A.C. Bhaktivedanta Swami Prabhupada*$d*1896-1977
> * *$a*A. C. Bhaktivedanta Swami Prabhupa-da*$d*1896-1977.
> * *$a*A. C. Bhaktivedanta Swami Prabhupa-da*$d*1896-1977
> * *$a*A.C. Bhaktivedanta Swami Prabhupa-da*$d*1896-1977.
> * *$a*Bhaktivedanta, A. C.*$d*1896-1977.
> * *$a*Bhaktivedanta Swami, A. c.*$d*1896-
> * *$a*Bhaktivedanta Swami, A. C.*$d*1896-
> * *$a*Bhaktivedanta Swami, A.C.*$d*1896-
> * *$a*Bhaktivedanta Swami, A. C.*$d*1896-1977.
> * *$a*Bhaktivedanta Swami*$d*1896
> * *$a*Bhaktivedanta Swami Prabhupa-da*$d*1896-1977.
>
>
> * *$a*Athanasius*$c*Saint*$c*Patriarch of Alexandria*$d*d. 373.
> * *$a*Athanasius*$c*Saint*$c*Patriarch of Alexandria*$d*d. 373
> * *$a*Athanasius*$c*Saint*$d*295-373.
> * *$a*Athanasius*$c*Saint*$d*295-373 A.D.
> * *$a*Athanasius*$c*Saint*$d*ca. 298-373.
> * *$a*Athanasius*$c*Saint*$d*ca.298-373.
> * *$a*Athanasius*$c*Saint, Patriarch of Alexander*$d*d. 373.
> * *$a*Athanasius*$c*Saint, patriarch of Alexandria*$d*d. 373.
> * *$a*Athanasius*$c*Saint, Patriarch of Alexandria*$d*d. 373.
> * *$a*Athanasius*$c*Saint, Patriarch of Alexandria*$d*d. 373
> * *$a*Athanasius, Saint*$d*295-373 A.D.
>
> You can see more here:
> http://edwardbetts.com/ol/marc_author_variants.html
>
> I'd like to see a link between the LCCN in LC names and wikipedia
> pages...
>
> kc
>
> Weinheimer Jim wrote:
>> Deborah Fritz wrote:
>>
>>> Weinheimer Jim wrote:
>>>
>>> > I don't think FRBR is necessary. XML processing can eliminate
>>> > duplicates in all kinds of ways, so I still believe that the
>>> > main thing is to dump the ISO2709 format ASAP, change to some
>>> > kind of XML format, be it MARCXML or MODS, switch to URIs the
>>> > moment LC (finally) puts everything online, then share our
>>> > records widely (!!) in all different kinds of formats.
>>>
>>> Jim, can you clarify how "XML processing can eliminate duplicates"?
>>>
>>
>> Actually, it's XSLT processing that can eliminate duplicates. XML can
>> do very little on its own, you need the style sheets that will
>> transform the XML file into something more useful, such as an HTML
>> page or pdf document. There are other XML tools as well such as
>> XQuery, which I understand less.
>>
>> There are all kinds of things you can do with XSLT such as sorting,
>> transforming, etc. in all sorts of ways that I think will take some
>> time for people to fully appreciate. But one thing it can do is
>> detect duplicate values and display them as you want. It can also
>> perform fuzzy value detection. I understand the principle quite well,
>> but haven't implemented it in a long time. For a short,
>> semi-technical discussion, see:
>> http://www.xml.com/pub/a/2002/10/02/tr.html
>>
>> Therefore, you can make an XSLT to say that if you have the same
>> 245abc, 250, 260, 300a, 4xx/8xx (don't know how this would work today
>> with the new series treatments!), it could merge all the records with
>> the same information into one record. You could also make it "fuzzy"
>> with e.g. the 260.
>>
>> Or we could merge based on completely different criteria and find
>> out... who knows? This is where you can play and perhaps discover
>> something new.
>>
>> This is yet another reason why I hesitate to enact RDA and FRBR. If
>> we want FRBR-type records, I think a *LOT* could be done with XSLTs
>> to generate those new types of records automatically so that we can
>> discover if they really are useful to our patrons or not.
>> There is less and less reason to de-duplicate manually today.
>>
>> Jim Weinheimer
>>
>>
>>
>
>
Received on Thu Apr 23 2009 - 10:13:59 EDT