Re: Resignation

From: Rinne, Nathan (ESC) <RinneN_at_nyob> Date: Fri, 31 Aug 2007 11:14:03 -0500 To: NGC4LIB_at_listserv.nd.edu

Con,

Thanks for your reply.  I see that I overestimated the capabilities of
scanning technology (Jim's example helped here).

See everyone?  When it comes to the capabilities of technology I
overestimate! :)

No, this could be a huge study - *it could even be a contest of sorts
(to see how well competing software programs do the job)* - it might
very well be worth the investment (of taking the time to get all that
text from publishers / type it in yourself, correcting what can be
scanned - Mark, if Google, a corporate entity, can do this, [it looks
like they can] people doing research should be fine - with my knowledge
of copyright law, this should fall under "fair use", seeing as how you
certainly would not be harming anyone financially from such research) if
you can catch the software producer's interest in this way.  People will
want to show how good their product is.  I think it could capture
people's imagination plus the fact that it is so concrete (I can't
imagine how anyone could dispute the significance of the facts that
would come from such a study).

Seriously though, unless one of the other catalogers wishes to correct
me, I think just *this kind of experiment is what is desperately needed*
- I think it would be of great help to many highly capable, employable
cataloger-folk in dispelling their concerns *not so much for their jobs*
as much as the veracity and reliability of the claims of CS/AI folks,
which to them, seem more based on faith and marketing pitches than
verifiable reality.  Again, it seems to me that this would be a
terrific, empirical test case.  In fact, in my simple little mind, *only
after something solid this would it make sense* to attempt to
intelligently tackle far more subjective issues like the capabilities of
software to assign appropriate controlled subject headings on things
like Project Wittenberg texts, etc (with known, respected expert
catalogers/indexers and non-catalogers who specialize in various fields
thrown in the mix)

You said:

BTW I don't see any irony in your proposed experiment relying on OCLC's
authority work. Since the experiment was precisely to test the
performance of machines in identifying authors, and your test dataset
was precisely a set of authors defined by OCLC, I don't see how you can
avoid making use of that human authority work in the experiment. Or was
there some other irony I missed? :-)

No, that was precisely my point.  You would have to rely on proven
authority work - the work that is now increasingly seen as less
important by more and more people in the Googlized atmosphere we inhabit
- in order to do this very concrete experiment (where else will you get
the more-or-less complete biographies of the various authors -
Wikipedia?)  Maybe, "ironic" was the wrong choice of words.  I just feel
the need to point out that its not just "natural selection" that makes
this possible, you know.  :)

Regards,
Nathan Rinne
Media Cataloging Technician
ISD 279 - Educational Service Center (ESC)
11200 93rd Ave. North
Maple Grove, MN. 55369
Work phone: 763-391-7183

-----Original Message-----
From: Next generation catalogs for libraries
[mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Conal Tuohy
Sent: Friday, August 31, 2007 10:31 AM
To: NGC4LIB_at_listserv.nd.edu
Subject: Re: [NGC4LIB] Resignation

Nathan, I think you are underestimating the difficulty of the experiment
you are proposing. The difficulty springs from the requirement that the
machine be able to read the works of the various David Johnsons.
However, if someone would scan and OCR these works (or acquire full text
from the publishers) then I think you are right that the rest would
indeed be super easy.

So better to forget the specific "David Johnson" example, and
demonstrate the ability of automated methods using some existing
full-text corpus. A number of researchers in the field of machine
learning have already done this and written up impressive results in
published papers. A few examples have already come up (such as Web of
Knowledge). While I'm at it, another one I remembered reading is "The
author-topic model for authors and documents" from
http://portal.acm.org/citation.cfm?id=1036902&jmp=cit

I think, rather than that the technology is not yet strictly feasible,
that the more important reasons why this technology is not already in
more common use in libraries are:

1) a lack of full text (though full text IS available in some areas, it
is often tied up in subscription databases such as those owned by Web of
Knowledge)
2) a lack of library funding, CS expertise, interest, and even
willingness to believe in the possibility (these things all go together)

By contrast some of these techniques are being actively developed by
internet search providers (who have the advertising dollar to pay for
it), and by IT vendors (who have an obvious interest), as well as by
researchers in other spaces such as, interestingly, genomics, which also
has to deal with large bodies of data which have been produced (by
natural selection) without adequate metadata :-)

BTW I don't see any irony in your proposed experiment relying on OCLC's
authority work. Since the experiment was precisely to test the
performance of machines in identifying authors, and your test dataset
was precisely a set of authors defined by OCLC, I don't see how you can
avoid making use of that human authority work in the experiment. Or was
there some other irony I missed? :-)

Cheers

Con

-----Original Message-----
From: Next generation catalogs for libraries on behalf of Rinne, Nathan
(ESC)
Sent: Sat 01/09/07 2:18
To: NGC4LIB_at_listserv.nd.edu
Subject: Re: [NGC4LIB] Resignation

Obviously, Jim is not one of the faithful.

Let me repeat this:

In order to help along the "doubting Thomases" among the catalogers, let
me make a plea.  I think it should be super, super easy to do.  Why
doesn't someone start with Conal Tuohy's claim about our current
capabilities (using Bayesian statistics) re: all of the David Johnsons
James Weinheimer informed us of?  I know something about science and
research, so this ought to be easy enough to empirically test.  First,
get all the works (only text, I assume?) of all the David Johnsons.  Of
course, *ironically* [note: this is an addition to this quote] *in order
to even get started here* I don't see how you would be able to avoid
needing to use something like OCLC's Worldcat (made possible with its
wonderful authority control, thank you!) in order to find most, if not
all, of these works.  Then all you would need to scan them and do the
test, and find out if it worked or not.  I think this would be very,
very helpful - and I want help.  Does anyone have the means of doing
this? (end)

Please note, this is not a demand, this is request.  I think this would
be very, very helpful.  And I think this would be very, very easy to do
as well (maybe not over a lunch break, but you know what I mean).

Regards,
Nathan Rinne
Media Cataloging Technician
ISD 279 - Educational Service Center (ESC)
11200 93rd Ave. North
Maple Grove, MN. 55369
Work phone: 763-391-7183

-----Original Message-----
From: Next generation catalogs for libraries
[mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of James Weinheimer
Sent: Friday, August 31, 2007 9:12 AM
To: NGC4LIB_at_listserv.nd.edu
Subject: Re: [NGC4LIB] Resignation

> -----Original Message-----
> From: Next generation catalogs for libraries
> [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Conal Tuohy
> Sent: Friday, August 31, 2007 3:02 AM
> To: NGC4LIB_at_listserv.nd.edu
> Subject: Re: [NGC4LIB] Resignation
> I'm assuming you're asking how a machine can decide that a given work
> was authored by one (or none) of the above?
>
> If the full text of the books is available, this is actually quite a
> feasible task which can be done by unsupervised machine-learning
> algorithms. Every author has an authorial "fingerprint" which can be
> recognised by attentive readers, and Bayesian statistical techniques
are
> even better at picking up such things. The key data for these
algorithms
> are the frequency of use and co-occurrences of particular words,
> sentence-lengths, etc. It in no way requires AI capable of
> "understanding" the subject of the text, in the sense that a human
> reader can. The statistical patterns which these algorithms recognise
> are ones which are generally below the conscious perception of human
> readers (who instead tend to focus on what a text actually means).
>
> This is an area where we should expect computers to out-perform
humans,
> frankly.

Then show us. I have read so many things of "should" in my life and
maybe
some things seem to make sense, but I haven't seen them work in
practice,
yet. (Alchemy made a lot of sense, too!) People talk about the great
"automatic translation" but what I've seen is still a disaster. This was
some time back, but a former professor I had had worked his entire life
on
automatic translation, only to declare it impossible at the end. The
best
you could do was to create a text for a human to edit. Like he said, if
you
need the human to edit it anyway, why go through it in the first place?
He
was speaking quite some time back. Automatic translation has come some
way,
but it's not there yet. And neither is automatic subject analysis.

We can experiment to our heart's desire, but we cannot draw conclusions
based on ifs maybes and shoulds. We have seen that "Should and maybe"
may
come around in 50 or more years--if ever, or it may be next week.

Regards,
Jim

James Weinheimer  j.weinheimer_at_aur.edu
Director of Library and Information Services
The American University of Rome
via Pietro Roselli, 4
00153 Rome, Italy
voice- 011 39 06 58330919 ext. 327
fax-011 39 06 58330992