Re: Tim Berners-Lee on the Semantic Web--Missing His Main Point

From: Ross Singer <rossfsinger_at_nyob> Date: Fri, 23 Oct 2009 14:43:02 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

On Fri, Oct 23, 2009 at 10:44 AM, James Weinheimer <j.weinheimer_at_aur.edu> wrote:
> Karen mentioned that the entire file of LC is in the Internet Archive. I was
> unaware of that, but I can't find it. The files I can find are MARC21
> ISO2709 files which is the equivalent of what TBL said about pdf files.
> While MARC may be "well-documented" is is not "well-understood" by anybody
> except catalogers. Nobody will dig the information out of that.

For the people that haven't found the MARC records on archive.org:
http://www.archive.org/details/marcrecords

I disagree that MARC is much of an impediment to data sharing,
certainly it isn't conducive to it (outside of the library domain, of
course), but it, in and of itself, is no harder to work with than your
notion of text delimited files.  There are, after all, parsers in
pretty much any programming language you could possibly want and tools
(yaz-marcdump + xslt, for instance) for turning it some other
serialization that may or may not be preferable.  I cannot see how you
would share the data that we have in CSV format in any ideal way.

The problem is not the data carrier, it's the data.  As Matthew Beacom
mentioned, the corpus of records we have is prose, not a data set, and
as such, is extraordinarily difficult to glean the hard facts from.
Further complicating matters is that it's a very select set
(librarians, and, more realistically, catalogers) that understand the
nuances of the prose (especially the punctuation).  This is what is
frustrating turning the larger collections of records into linked
data, we have our best minds, /with access to the people who
understand the embedded semantics/ and we can't figure out how to
model it efficiently.  God help the poor soul with a CSV file and no
library background.

That being said, Jim, I understand your restlessness.  Part of what
makes linked data so good, so powerful and so necessary is that it we
don't have to solve all of our problems at once.  Get the low hanging
fruit (titles, subjects, control numbers, standard numbers, etc.),
mint URIs for them and release the data.  Then, as we free more and
more data from our own cleverness, it can be be asserted as it comes
along, because the identifiers (URIs) are already out there and we
know we're talking about the same thing.  So, yes, you're right.
Let's just release something.

The key is not to get swept under by the criticisms that it's not
perfect (see: lcsh.info/id.loc.gov/authorities) by our own ranks.

-Ross.