Re: data vs "data structure"

From: Stephens, Owen <o.stephens_at_nyob> Date: Fri, 21 Sep 2007 10:51:28 +0100 To: NGC4LIB_at_listserv.nd.edu

> At the end of all of the posts she writes the following:
>
> "If you're reading this far, note that in the editing of this piece,
> 'data structure' changed to 'data.' We have great data. We just don't
> have good data structure. Sigh." (end)
>
> So, evidently, here we have a case where the editors of the piece
> thought changing "data structure" to "data" would facilitate
> understanding - in any case, should I assume that they thought
> exchanging one word for the other didn't really make much of a
> difference?
>
> But it does make a difference, right?  I'll admit I could use
> some more
> guidance on this if anyone has some super helpful articles.

[This started as a short reply, but seems to have grown a bit!]

Yes, it does make a difference! I don't have any helpful articles, but I
hope the following is useful (and not over simplified):

To take a really simple example, the President of the USA is:

George Walker Bush
Bush, George Walker

These two lines represent the same data and it's good data (i.e.
accurate)

However, the data structure is different. In the first data structure it
isn't clear which part of the name is which. The advantage of the second
data structure I've used here is that it separates out the 'family name'
from the other parts of the name. It adds explicit meaning to the data -
if I had written 'Walker Bush, George' you would have understood
something different.

So:

Walker Bush, George is an example of bad data (because we understand the
rules of the data structure, it is clear this data is wrong)
George Walker Bush is an example of bad data structure (because there
are no rules to interpret the data, we can't tell what this data means)
Bush, George Walker is an example of good data in a good data structure

Now, before the brickbats fly, I should say that 'Bush, George Walker'
isn't a brilliant data structure, it's just slightly more explicit than
'George Walker Bush'. Once you get into the complexity of names and
wanting them to be expressed in explicitly meaningful ways, you start to
realise how limited 'Bush, George Walker' is as a data structure -
perhaps especially when you want something that can be automatically
processed by a computer.

Something like the vCard-XML format is quite a bit more explicit and
allows us to expand on the name by allowing us to specify a 'formatted
(or display) name' (<FN>) and a Nickname.
...
<FN>George W. Bush</FN>
<N>
<FAMILY>Bush</FAMILY>
<GIVEN>George</GIVEN>
<MIDDLE>Walker</MIDDLE>
</N>
<NICKNAME>Dubya</NICKNAME>
...

Just to add a spin on this discussion, what is interesting is that
libraries have put quite a lot of effort into both data and data
structure, but we aren't reaping the benefit of this in the modern web
world. (Although I agree that MARC is not ideal, it is also quite a lot
better than nothing - so why aren't we making more of this?)

Happily, this is starting to change, take a look at the Vanderbilt
'Acorn' catalog (http://acorn.library.vanderbilt.edu
<http://acorn.library.vanderbilt.edu> ) vs the Vanderbilt 'Alphasearch'
NGC (Primo from Ex Libris). A search for 'A semantic web primer' on
Acorn, it finds a record for an online resource. If you look at the HTML
source, it is presented as:

...
<td class="itemlisting" colspan="3">
<strong>A semantic Web primer [electronic resource]</strong>
<br/>
  Antoniou, G. (Grigoris)
</td>
...

If you do the same search on Alphasearch you get:
...
<td>
<div class="title">
<a
href="display.do?ct=display&doc=1599225&indx=19&frbg=&dum=true&vl(1UI0)=
contains&vid=VANDERBILT&srt=rank&indx=11&vl(2044637UI0)=any&vl(29953649U
I1)=all_items&tab=default_tab&ct=&scp.scps=scope:(vanunicorn)&vl(freeTex
t0)=semantic&fn=search&mode=Basic">
<span class="fat">A <span class="searchword">semantic</span> Web
primer</span>
(View details)
</a>
</div>
<div class="author"> Antoniou, G. (Grigoris) Van Harmelen, Frank.
NetLibrary, Inc. </div>
...

This is exactly the same record, from the same source, but the former
has lost all data structure, so it is impossible for any automated
system to tell which bit is the title, and which bit is the Author. The
latter has some explicit sematic structure, so that it would be easy to
tell write a program that showed just the title or whatever. There are
some problems (ironically) with the data in the second example, but
that's not the point.

The point is that someone went to a lot of trouble to catalog the record
in MARC format, with structured data and the Acorn display mechanism
throws it all away. This means that if I want to go mashing up (or
whatever the kids call it these days) the Acorn catalog with something
else, I can't, despite Acorn being based on structured data, it won't
play nice.

I'm not going to hold up Primo as a great example of the Semantic web,
as it probably doesn't go far enough - but it is a move in the right
direction (I've skipped the fact that Primo also offers the results as
RSS and this is a better implementation of structured data than the html
IMO)

In the UK at the moment the mantra for green living is 'Reduce, Reuse,
Recycle' - unfortunately people often forget the first one, and
concentrate too much on the last one!

To paraphrase, the library world needs to

Expose - it's structured data
Enhance - it's data structures
Exploit - it's data, data structures and expertise

Perhaps we need to not forget the first one?

Owen