Re: Tim Berners-Lee on the Semantic Web

From: Alexander Johannesen <alexander.johannesen_at_nyob> Date: Mon, 16 Nov 2009 21:34:32 +1100 To: NGC4LIB_at_LISTSERV.ND.EDU

2009/11/7 Bernhard Eversberg <ev_at_biblio.tu-bs.de>:
> What better way, regarding the size of the calamity, than to start with
> what we have?

Then the question becomes "what have you got?" which can be a bit
tricky to answer.

> There's the LCSH authority file, and there's VIAF.
> And the LC names and strings can, as they do now, serve as a first
> approximation to identifiers. Better of course, add the Id numbers to as
> many catalog data as possible.

Well, you certainly got meta data, but there needs to be some pretty
grueling data quality process to this. Unless. Well, unless, of
course, you wrap it some framework that doesn't break down in
rigidity. (More on this later)

> In LCSH, there are about 220.000 identifiers for work titles, and over
> 5 million for persons. And VIAF has links between persons and works.

The thing, though, is whether these "identifiers" can be turned into
conceptual identifiers without the usual denormalized mess we normally
find in the MARC universe (no, I'm not bringing MARC up to confuse us,
only as an example of the source we normally use to poke into the meta
data itself)

> The link from there into WorldCat seems to work more often than not.

... which isn't exactly a good resume for its reliability. :)

> These title authorities, however, are based on expressions, not works
> really, but the main title given in the record supposedly is always
> the original title. There is no usable linking from related titles
> to the originals! The future policy and requirements for dealing with
> these need to be discussed anyway.

One job that needs to be done (and I'm sure many of you have already
done) is the normalization of expressions into works. I've tried one
myself (large scale), and I know of many such undertakings, but I have
yet to come across one that gets it right, or even in the close
vicinity of right. More like, looks like a spunky new Lambourghini
from a distance, but when you get up close you realize it's only a
Toyota. From 10 years ago. Rusty.

However, on this point I'd be *delighted* to hear of new and better
attempts! I'm not too hopeful, of course (being the egg-faced
contrarian that I am), mostly because I don't think computing power
alone is enough, that this problem requires library culture to change,
which is a much harder problem.

> This, however, can only be a first step. Users will need more, they will
> need a robust and simple data format with which to communicate in easy
> and straightforward manners.

Watch out; the term "user" is ambiguous in our discussion here, I
think, and needs a bit of clarification. There's users that need
user-interfaces to use our various systems, and then there's users
(agents, more commonly in the IT world) who need no user-interface at
all (linked data and its ilk) who need different things (rigidity,
ontological backing, normalized data [well, the discussion is still
out there], stronger type, etc.)

There's some discussion to be had about in what order these things
should be prioritized. Now, normal systems and user-interfaces have
had their time for a very long time, and *now* we're talking about
linked data, but the more we study our meta data in this context, the
less we find it usable. If the order was flipped around, if our meta
data quality was better at an earlier stage, making the
user-interfaces on top would probably be easier as well (because it
would be easier to rely on that meta data rather than making systems
and interfaces that assume their provenance). It's a shame, in my
view, that the typed quality of the collected meta data wasn't
scrutinized properly before the last 5-6 years or so.

> The format itself is not the whole story

It's worse than that; the format is probably not even important at
all. The Semantic Web is breaking formats apart, putting data and meta
data into two different ontological realms, which is really good. But
that's also bad for the library which has got their ontological
foundations locked away in a very human culture of AACR2 and MARC
peculiarities, and the data in mostly untyped, denormalized form (and
most often wrapped in before-mentioned MARC). All those years of
getting AACR2/MARC right (and I say "right" in a patronizing, snarky
kinda way, just because it's that time of the month) is more or less
wasted in the context of the Semantic Web / Linked Data; sure, you can
use remnants from it, but there's so much more waste around.

> It doesn't
> seem like vendors would develop any of this by themselves!

Well, some of them are pretty clued in (like Talis, who's represented
here on the list and are actively participating, are damn clued in),
but a lot of the, ahem, traditional systems a) struggle with the
conceptual technology (it came out of SciComp and AI and is still
fundamentally different from normal software development processes and
existing software) and b) don't need to change this unless under
appropriate pressure (so apply some already!).

> Or have you seen any specs for
> bibliographic data, developed by non-librarians, that are not in some ways
> inadequate or horrifying?

Usually when I want to point out this stuff, I get a bit angry that DC
didn't evolve. It could today be the defacto standard we're talking
about. But no, it started well but disappeared into obscurity as the
set didn't extend (and certainly not in Internet-time related ways).
If we go back to http://richard.cyganiak.de/2007/10/lod/ you'll notice
one of the largest bubbles (bubble size is related to the number of
triplets in the set, or, in other words, size of the meta data) which
is http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/, the RDF Book
Mashup project, which is an amalgam of Amazon and Google (using their
API's). *That* is what you're up against, and *that* is the way that
is being cut forward, because when it comes to books, they have the
most books available for people to use right now. Of course, you may
not like it (I'm pretty sure you don't, and neither would most people
here, me included), but that is happening right now, they've got years
head-start, and is *in*that*map*of*linked*data*, and you are not. And
when you're not, you will become less and less relevant to those who
look to that map for guidance. You could always try to tell them about
id.loc.gov, but at this point I'm not even sure they would understand
what that service is all about. :)

Anyways, off on some adventure!

Regards,

Alex
-- 
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
--- http://shelter.nu/blog/ ----------------------------------------------
------------------ http://www.google.com/profiles/alexander.johannesen ---