Re: Resignation

From: Jonathan Rochkind <rochkind_at_nyob> Date: Fri, 31 Aug 2007 14:11:23 -0400 To: NGC4LIB_at_listserv.nd.edu

I kind of resent the implication, made repeatedly by several people,
that those who think computational aid may be of great help to authority
control and automatic classfication don't in fact understand the point
of authority control, or that it is a non-trivial thing to do.

I think authority control is important, and I don't need to discuss the
basics of what authority control is and why it's important. I don't need
anyone to lecture me on polysemy and synonomy and why authority control
is needed to provide good data to support services users want. I also
realize it is not at all a trivial thing, and there's a reason it
occupies so much of our resources. If it was a trivial thing to
automate, then I or someone else could whip it up in an afternoon for
you, and we'd have it already. Of course it's not trivial. It would take
very sophisticated engineering.

The thing is, lots of industries and sectors (not us) have been putting
lots of resources into research and development to create these
sophisticated techniques. The state of the art, as various people expert
in the field have told us, is indeed very sophisticated.  This is why
the time is right to start taking a serious look and spending serious
resources on investigating computational aid to the various aspects of
cataloging (from 'descriptive' cataloging, to classification and subject
assignment, to authority work).  [And I used the phrase 'computational
aid' very intentionally; there are certainly ways to have computational
aid to the cataloger replacing significant currently-human energy,
without it in fact being a perfect and complete replacement. Aid, not
substitution.]

Jonathan

Nancy Cochran wrote:
> Please, what is the difference between Weinheirmer Jim and James
> Weinheimer?  I see that both are responding to "Regisgnation" and other
> threads.
>
>
>
>> [Original Message]
>> From: Conal Tuohy <Conal.Tuohy_at_VUW.AC.NZ>
>> To: <NGC4LIB_at_listserv.nd.edu>
>> Date: 9/1/2007 10:31:44 AM
>> Subject: Re: [NGC4LIB] Resignation
>>
>> Nathan, I think you are underestimating the difficulty of the experiment
>>
> you are proposing. The difficulty springs from the requirement that the
> machine be able to read the works of the various David Johnsons. However,
> if someone would scan and OCR these works (or acquire full text from the
> publishers) then I think you are right that the rest would indeed be super
> easy.
>
>> So better to forget the specific "David Johnson" example, and demonstrate
>>
> the ability of automated methods using some existing full-text corpus. A
> number of researchers in the field of machine learning have already done
> this and written up impressive results in published papers. A few examples
> have already come up (such as Web of Knowledge). While I'm at it, another
> one I remembered reading is "The author-topic model for authors and
> documents" from http://portal.acm.org/citation.cfm?id=1036902&jmp=cit
>
>> I think, rather than that the technology is not yet strictly feasible,
>>
> that the more important reasons why this technology is not already in more
> common use in libraries are:
>
>> 1) a lack of full text (though full text IS available in some areas, it
>>
> is often tied up in subscription databases such as those owned by Web of
> Knowledge)
>
>> 2) a lack of library funding, CS expertise, interest, and even
>>
> willingness to believe in the possibility (these things all go together)
>
>> By contrast some of these techniques are being actively developed by
>>
> internet search providers (who have the advertising dollar to pay for it),
> and by IT vendors (who have an obvious interest), as well as by researchers
> in other spaces such as, interestingly, genomics, which also has to deal
> with large bodies of data which have been produced (by natural selection)
> without adequate metadata :-)
>
>> BTW I don't see any irony in your proposed experiment relying on OCLC's
>>
> authority work. Since the experiment was precisely to test the performance
> of machines in identifying authors, and your test dataset was precisely a
> set of authors defined by OCLC, I don't see how you can avoid making use of
> that human authority work in the experiment. Or was there some other irony
> I missed? :-)
>
>> Cheers
>>
>> Con
>>
>> -----Original Message-----
>> From: Next generation catalogs for libraries on behalf of Rinne, Nathan
>>
> (ESC)
>
>> Sent: Sat 01/09/07 2:18
>> To: NGC4LIB_at_listserv.nd.edu
>> Subject: Re: [NGC4LIB] Resignation
>>
>> Obviously, Jim is not one of the faithful.
>>
>> Let me repeat this:
>>
>> In order to help along the "doubting Thomases" among the catalogers, let
>> me make a plea.  I think it should be super, super easy to do.  Why
>> doesn't someone start with Conal Tuohy's claim about our current
>> capabilities (using Bayesian statistics) re: all of the David Johnsons
>> James Weinheimer informed us of?  I know something about science and
>> research, so this ought to be easy enough to empirically test.  First,
>> get all the works (only text, I assume?) of all the David Johnsons.  Of
>> course, *ironically* [note: this is an addition to this quote] *in order
>> to even get started here* I don't see how you would be able to avoid
>> needing to use something like OCLC's Worldcat (made possible with its
>> wonderful authority control, thank you!) in order to find most, if not
>> all, of these works.  Then all you would need to scan them and do the
>> test, and find out if it worked or not.  I think this would be very,
>> very helpful - and I want help.  Does anyone have the means of doing
>> this? (end)
>>
>> Please note, this is not a demand, this is request.  I think this would
>> be very, very helpful.  And I think this would be very, very easy to do
>> as well (maybe not over a lunch break, but you know what I mean).
>>
>> Regards,
>> Nathan Rinne
>> Media Cataloging Technician
>> ISD 279 - Educational Service Center (ESC)
>> 11200 93rd Ave. North
>> Maple Grove, MN. 55369
>> Work phone: 763-391-7183
>>
>>
>> -----Original Message-----
>> From: Next generation catalogs for libraries
>> [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of James Weinheimer
>> Sent: Friday, August 31, 2007 9:12 AM
>> To: NGC4LIB_at_listserv.nd.edu
>> Subject: Re: [NGC4LIB] Resignation
>>
>>
>>> -----Original Message-----
>>> From: Next generation catalogs for libraries
>>> [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Conal Tuohy
>>> Sent: Friday, August 31, 2007 3:02 AM
>>> To: NGC4LIB_at_listserv.nd.edu
>>> Subject: Re: [NGC4LIB] Resignation
>>> I'm assuming you're asking how a machine can decide that a given work
>>> was authored by one (or none) of the above?
>>>
>>> If the full text of the books is available, this is actually quite a
>>> feasible task which can be done by unsupervised machine-learning
>>> algorithms. Every author has an authorial "fingerprint" which can be
>>> recognised by attentive readers, and Bayesian statistical techniques
>>>
>> are
>>
>>> even better at picking up such things. The key data for these
>>>
>> algorithms
>>
>>> are the frequency of use and co-occurrences of particular words,
>>> sentence-lengths, etc. It in no way requires AI capable of
>>> "understanding" the subject of the text, in the sense that a human
>>> reader can. The statistical patterns which these algorithms recognise
>>> are ones which are generally below the conscious perception of human
>>> readers (who instead tend to focus on what a text actually means).
>>>
>>> This is an area where we should expect computers to out-perform
>>>
>> humans,
>>
>>> frankly.
>>>
>> Then show us. I have read so many things of "should" in my life and
>> maybe
>> some things seem to make sense, but I haven't seen them work in
>> practice,
>> yet. (Alchemy made a lot of sense, too!) People talk about the great
>> "automatic translation" but what I've seen is still a disaster. This was
>> some time back, but a former professor I had had worked his entire life
>> on
>> automatic translation, only to declare it impossible at the end. The
>> best
>> you could do was to create a text for a human to edit. Like he said, if
>> you
>> need the human to edit it anyway, why go through it in the first place?
>> He
>> was speaking quite some time back. Automatic translation has come some
>> way,
>> but it's not there yet. And neither is automatic subject analysis.
>>
>> We can experiment to our heart's desire, but we cannot draw conclusions
>> based on ifs maybes and shoulds. We have seen that "Should and maybe"
>> may
>> come around in 50 or more years--if ever, or it may be next week.
>>
>> Regards,
>> Jim
>>
>> James Weinheimer  j.weinheimer_at_aur.edu
>> Director of Library and Information Services
>> The American University of Rome
>> via Pietro Roselli, 4
>> 00153 Rome, Italy
>> voice- 011 39 06 58330919 ext. 327
>> fax-011 39 06 58330992
>>
>
>

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu