Re: Resignation

From: Sperr, Edwin <sperr_at_nyob> Date: Thu, 30 Aug 2007 16:16:57 -0500 To: NGC4LIB_at_listserv.nd.edu

Thanks for the specific examples (and to Julia for the Terragram
pointer).

Just what is the relevant literature for this space?  I guarantee you
plenty of librarians who *do* care about the relevant developments in CS
have a hard time stumbling upon them.  Of course, that's not just a
one-way problem -- I remember one recent program where two vendors new
to the library market talked about how big an advancement their stuff
was over the card catalog.

One concern I have is that both World Bank Docs and News stories seem to
be limited to defined scopes: in one case technical reports that are
extremely focused (and likely to telegraph their main points ad-nauseum)
and in the other, punchy 10-15 paragraph news stories.

I wonder how well such approaches would work in an environment where the
length of the texts is variable and the texts themselves often
meandering from point to point?  Is there another test corpus that
models library requirements better?  Anybody banging at the Project
Guttenberg docs yet?

Ed Sperr
sperr_at_nelinet.net

-----Original Message-----
From: Next generation catalogs for libraries
[mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Will Kurt
Sent: Thursday, August 30, 2007 4:40 PM
To: NGC4LIB_at_listserv.nd.edu
Subject: Re: [NGC4LIB] Resignation

This response just sort of proves a point that has been made here that
librarians simply don't know what the current state of the art is.

The standard categorization test is not simply "a lab, with pre-selected
documents from a single topic domain, or test runs against 5
paragraphs."

For quite awhile many standard categorization tests have been run
against the Reuters corpus, which has an incredibly large variety of
information (hundreds of thousands of news
stories):
<http://trec.nist.gov/data/reuters/reuters.html>http://trec.nist.gov/dat
a/reuters/reuters.html

Support Vector Machines are incredibly successful at auto-classifying
information based on supervised learning and I have yet to see any
mention of them in the library community.

I know you wanted real world examples but go to the ACM or the IEEE and
search SVM, 'Reuters corpus', or classifiers.  You'll see that the
current state of the art is very, very close to real world applications.

And those people in Virginia you mentioned are interested in these
technologies, I know because I do the library research for the people
that are building them:
<http://www.msnbc.msn.com/id/15547807/>http://www.msnbc.msn.com/id/15547
807/
I've watched Arabic newscasts be transcribed into English in real-time
with an 80% accuracy rate. Believe me the technology is there. While the
article above does not specifically address text classification, speech
translation involves all those technologies and more.

The point that I think Alexander is bringing up is that libraries should
be following and experimenting with these technologies rather than being
almost completely oblivious to them.  I've noticed that there's a trend
of LIS people ignoring/disbelieving CS people, which is a shame because
CS has done a lot more research in many of the areas LIS is interested
in than even LIS has. I've made the analogy before, but for LIS people
to ignore the solutions that CS has to offer would be the same as if CS
people to ignored hardware solutions of EE people.

I personally don't feel that I know anywhere near enough regarding
what's currently being researched, but I have seen enough to have faith
that when Alexander (and other people with experience in these
areas) say that there's something important libraries should see, we
should take a good look.

--Will

At 10:46 AM 8/30/2007, you wrote:
>Alexander Johannesen wrote...
>
> > I know several people here have said they don't think AI or smart
>software is up to the task,
> > and that they represent no real risk to serious cataloging. Well, I
>can only say, please trust
> > me in this! I worked in professional AI for over 7 years ; this
> > stuff
>not only can be done, but
> > has been done for some time. Most AI problems have historically been
>with access to various
> > corpi to make them work, but those things are now changing
>dramatically. Now that all books
> > are written on computers and available for analysis, even more so.
>
> > The reason this isn't in widespread use quite yet is because there
>hasn't been much money in it,
> > so it's been mostly an academic venture with lots of interesting but
>underfunded projects that
> > sits in a portfolio but never gets out of campus.
>
>I'm sorry, but I just don't believe you.
>
>Show us.  Point us to these applications that can *currently* slurp in
>250 pages of full text and return 5 to 7 reasonably good, controlled
>vocabulary subject headings (or topics or topic maps or well-formed RDF

>triples or what have you).  Point to one *real world example* of this
>happening -- not a lab, with pre-selected documents from a single topic

>domain, or test runs against 5 paragraphs.  This is *not* a trivial
>task. To say that it is misapprehends the entire scope of what we're
>talking about.
>
>I also would respectfully disagree as to the potential market for such
>an automatic-subject-assigning beast. There are a *lot* of corporate
>records managers (and large records management companies) that would
>love to offer their users a systematic list of what's in the archives.
>For that matter, I don't doubt that there are a few folks working from
>undisclosed locations in Virginia that would like a systematic list of
>the subjects of web pages, emails, voice calls, etc.
>
>To be sure, at *some* point AI will catch up.  But it is hard to
>imagine that the capabilities of such AI will be that far south of
>Reading and Comprehending text itself.  When we get to that point, I'll

>cheerfully fire up my work avatar and let it write these posts for
me...
>
>
>Ed Sperr
>Digital Services Consultant
>NELINET, Inc.
>153 Cordaville Rd. Suite 200  Southborough, MA
>(508) 597-1931  |  (800) 635-4638 x1931