Re: FRBR WEMI and identifiers

From: McGrath, Kelley C. <kmcgrath_at_nyob> Date: Mon, 23 Nov 2009 10:11:35 -0500 To: NGC4LIB_at_LISTSERV.ND.EDU

Well, I see that in the time I have been on vacation and catching up, this thread has diverged into a myriad of interesting topics. However, to go back to my original concern...

> McGrath, Kelley C. wrote:
>
>> It seems to me that Work records will only really be useful if they
>> are a public good and available to everyone. Although Works don't
>> really have inherent identifiers, theoretically, Work records could
>> more easily have identifiers in the way that authority records do
>> now. It may be that different groups will create different Work
>> records which may have to be linked in some way, as the VIAF tries to
>> do now for names.
>

>Jakob Voss wrote:
> Right. The only usable work identifiers that I know of are LibraryThing Work
> identifiers and some Wikipedia articles about works - and this identifiers
> work very well (if they exist for a given work)! For music there is the
> International Standard Musical Work Code (ISWC) and "uniform titles" could
> be used to some degree for mapping manifestations. Instead of inventing your
> own work-identifier-system you should just reuse existing efforts. Do you
> store LibraryThing work identifiers in your catalog?

Kelley: The project I am working with is focused on moving images. An obvious source of reliable identifiers that is fairly comprehensive for the territory it intends to cover is IMDB. I wonder, though, since we will have to create identifiers of some sort for our project anyway if it might not make more sense for us to maintain a complete set of identifiers and just map to external identifiers?

Neither IMDB nor any other external source is likely to have a comprehensive list of identifiers for library holdings. There are a great many things in IMDB that are not held by libraries because they are not extant in a form that can be collected by libraries or are just unlikely to be collected by libraries. There is an even larger group of things that are held by libraries that are not and never will be in IMDB (and no, we can't just add our holdings to IMDB. IMDB is similar to a library in many ways and one of them is that they have what is essentially a collection development policy that excludes most things that have not been "publicly released" by their definition).

There is also a certain loss of control that comes with using external identifiers. Not everyone dices up the world in quite the same way. In most cases, the lines between one work and a next are obvious once agreement is reached on a few basic assumptions (e.g., a film adaptation is a new work and not an expression of a literary work).

However, there remain a non-trivial number of edge cases. 

Take for example the TV and film versions of Bergman's Fanny och Alexander. IMDB treats these as one work, whereas the library world has established two separate uniform titles and apparently considers them two works. However, the authority record notes that Wikipedia states that it was "originally conceived as a four part TV movie which spanned 312 minutes. A version lasting only 188 minutes was created later for cinematic release." So perhaps these really should be related expressions. The current rules don't really cover these kinds of situations.

Apocalypse Now and Apocalypse Now Redux present a similar situation, but in this case both the library uniform title and IMDB consider them a single work. The authority record quotes IMDB as saying "Apocalypse now; 1979 film directed by Francis Ford Coppola; also known as: Apocalypse now redux; 2001 release, longer version."

Other situations that the OLAC task force struggled with when looking at work boundaries include situations where the visual aspect of a film has been retained, but the dialogue has been dubbed with something completely different, usually a parody, e.g., What's up Tiger Lily?, in which Woody Allen uses dubbed dialogue to spoof a Japanese action film. Other tricky cases include European and American versions of a silent film, with the same personnel but made of different takes of the same scenes, usually but not always edited the same, as well as early talkies that were filmed in multiple language versions, sometimes with different casts, e.g., the Spanish and English versions of Dracula from 1931. In the end, we came up with heuristics, rather than hard-and-fast rules, on where to draw the lines.

It's true that we have a sort of a library identifier in uniform titles and so far as I can tell moving image uniform titles are at the work level (e.g., you are not supposed to qualify by the language(s) of the expression as is done for textual works). However, the proportion of moving images works that have established uniform titles is miniscule.

WorldCat work identifiers are interesting, but it often seems to me that WorldCat clusters on a level somewhere below the work and their clustering for moving images seems to be somewhat unreliable in that they often don't bring together things that it seems to me should be collocated. I also don't know how stable they are since their identifiers seem to be based on clustering rather than hard-coded and explicitly-maintained links.

>> So I am not sure how we could control Manifestations on a massive
>> scale, especially in a distributed model rather than a centralized
>> one. The best I could come up with is finding a way to say in a
>> Manifestation record that this is a Manifestation record and then
>> putting the Work identifier in the Manifestation record to say that
>> this represents a Manifestation of or includes this Work. This does
>> not completely resolve the problem as it would seem to be hard to
>> manipulate the Manifestations if they don't have their own
>> identifiers.
>>
>> Or perhaps we could create a centralized repository of Manifestations
>> and records could be linked to that. Which is essentially what OCLC
>> is trying to do now, but definitely not in a way that those numbers
>> will become a public good.
>>
>> Or perhaps I'm just looking at this wrong?

>Jakob Voss wrote:
> There already is one distributed identifier system that can be used for
> manifestations: it's the URI. Just use RDF to express that some given URI
> identifies a manifestation and RDF properties to state which works and
> expressions it is connected to:
> properhttp://vocab.org/frbr/core.html#Expression
>
> In addition you can harvest the Semantic Web for expressions that other
> people have created. The rest only depends on nice interfaces that people
> can use for to manage FRBR statements.

Kelley: Well, I could certainly unilaterally create a URI for a given manifestation. But how useful is that if no one else uses it? It would seem to me that these identifiers would be most useful if we could all agree to use the same identifier for the same manifestation.

This is a hard problem. The most comprehensive attempt is obviously WorldCat, which has the advantage of being a centralized, closed system, and its OCLC numbers. OCLC also has clever algorithms and the resources for skilled human intervention. Nevertheless, WorldCat is plagued with duplicate manifestation records. Not because OCLC is somehow incompetent, but the nature of the problem makes it almost impossible to solve. Between missing data, incorrect data, and different interpretations of how data should be entered, it hard to imagine a realistic, practical way to resolve this, especially with new data continually being added. However, I do think that if we made our assertions about data identifying the manifestation explicitly machine-interpretable, we could develop algorithms that are much better at identifying potential duplicates.

Kelley McGrath
kmcgrath_at_bsu.edu