Re: Resignation

From: Sperr, Edwin <sperr_at_nyob> Date: Thu, 6 Sep 2007 11:10:44 -0500 To: NGC4LIB_at_listserv.nd.edu

Getting back to a discussion a little ways up-stream, Alexander and I
recently had the exchange...

(ed)
>> You have repeatedly assured us that AI can *already* do everything
>> that catalogers can.

(alexander)
> Rubbish. You *argue* as if I have, though, but I suspect the heat of
> the moment and the poignancy of the topic blows small claims up like
> hot air balloons. I've said AI can do much, but certainly not
> everything a cataloger can do.

I'm sorry if you feel I mischaracterized your statements, but I don't
see how there's much ambiguity in the following:

(alexander, earlier)
** I've argued before that seriously smart systems can easily subtract
** most normal metadata from free-text versions of any book or paper,
** including TOCs, subjects, contextual domains and quotations...I know
** several people here have said they don't think AI or smart
** software is up to the task, and that they represent no real risk to
** serious cataloging. Well, I can only say, please trust me in this! I
** worked in professional AI for over 7 years ; this stuff not only can
** be done, but has been done for some time.

No it hasn't "been done for some time" -- not yet, and probably won't be
that soon, either.  I don't mean to pick on you in particular; it's just
that there seems to be a notion in some quarters (including places where
folks are making funding decisions) that human-derived metadata is
*already* obsolete.  Look, I really don't care about the "poignancy of
the topic" -- my wish that we not discard traditional cataloging too
quickly is because I want things to continue to work at *least* as well
they do now.

There's certainly disagreement on the pace of advance in AI, what
constitutes "real" AI and the relative merits of Librarians and Computer
Scientists.  If I can suggest one thing, it would be that instead of
worrying about all this or about what might happen in the future, we
concentrate on the here and now.  I think we all can agree that
statistical analysis and other computational techniques can be really
useful adjuncts to traditional cataloging, and where the full-text
exists to be parsed, they should be experimented with.  Indeed, that's
already happening at the World Bank and NLM.  Any stories from other
places?

> There's some problems here. What *is* interoperability?
> Between who? For what purpose? If the purpose is for libraries
> to share metadata, why squeeze good metadata into a format that
> force us to reduce its quality? (MARC can't hold structural models,
> for example) If we create smarter systems that can create clusters,
> models, trained objects and so on, should we discard the possibilities

> those bring just to reduce it down to 5-7 subject headings?

Sorry for not being more clear. To be sure, lets enhance the heck out of
*new* records: structure, relationships, vector values, the whole nine
yards.  The interoperability I'm talking about is not at the transport
layer -- indeed, I think the most feasible model is a separate discovery
system (like VUFind) where both old MARC records and shiny new NGC-grade
records all get folded together and indexed.  In this case, it doesn't
really matter how the data is represented so long as it ingests
correctly.

However, keep in mind the fact that there are tens of millions of
existing records that might never get juiced with whatever new
technologies and techniques we come up with (fat chance we're getting
the full-text to play with for any works that are still in copyright...)
If you are to build a catalog that encompasses both old and new, then
*all* records need to have some points of commonality if you're going to
search the pile successfully.  Think of it like a pidgin language used
for trade on a far frontier.

> I also asked because why subject headings? Why are subject headings
the
> goal?  Surely there's better things to model if you've got the tools
to do it.
> Why aim low? Is it the law of conservatism? That pesky reality?

Because controlled-vocabulary ontologies are still the best system
available for modeling the "aboutness" of an item.  Note that
"aboutness" is different from just saying that document X has a
similarity score of 678 to document Y.  Lots more can be said (and to
some degree has been said on-list already) about this topic, but I'll
post a link to something more finished later on.

Ed Sperr
sperr_at_nelinet.net