I think that what Nathan is proposing is perfectly logical and scientifically necessary. I proposed something similar at an institution once and it was completely ignored. If we want to say that any of this stuff is "scientific," we must follow scientific method somewhere along the way.
Creating a control group is a reasonable request, and I would say, should be a necessary first step before going onto anything else. If people are worried about copyright, let's just use google books and Live Search Academic. They should all have catalog records somewhere so that the results for authority control can be compared.
I personally don't want to forget the David Johnson example, because this is what people deal with every day. It would be great if everybody had names such as Smith, Jiamagurdni, but there are many more Jay, Joseph, and Jennifer Smiths. If the system can only differentiate the Jiamagurdni Smith, it doesn't save anything at all since it can be done semi-automatically today (no conflicts). It seems to me that if the proposed system can't do anything tough, it is just like my example of automatic translation needing to be revised by a human, and I don't see any use in it at all, except as an experimental attempt.
As I wrote before, I am not yet impressed by the Web of Science example.
Regards,
Jim Weinheimer
> Nathan, I think you are underestimating the difficulty of the experiment you
> are proposing. The difficulty springs from the requirement that the machine be
> able to read the works of the various David Johnsons. However, if someone would
> scan and OCR these works (or acquire full text from the publishers) then I
> think you are right that the rest would indeed be super easy.
>
> So better to forget the specific "David Johnson" example, and
> demonstrate the ability of automated methods using some existing full-text
> corpus. A number of researchers in the field of machine learning have already
> done this and written up impressive results in published papers. A few examples
> have already come up (such as Web of Knowledge). While I'm at it, another one I
> remembered reading is "The author-topic model for authors and
> documents" from http://portal.acm.org/citation.cfm?id=1036902&jmp=cit
>
> I think, rather than that the technology is not yet strictly feasible, that the
> more important reasons why this technology is not already in more common use in
> libraries are:
>
> 1) a lack of full text (though full text IS available in some areas, it is
> often tied up in subscription databases such as those owned by Web of Knowledge)
> 2) a lack of library funding, CS expertise, interest, and even willingness to
> believe in the possibility (these things all go together)
>
> By contrast some of these techniques are being actively developed by internet
> search providers (who have the advertising dollar to pay for it), and by IT
> vendors (who have an obvious interest), as well as by researchers in other
> spaces such as, interestingly, genomics, which also has to deal with large
> bodies of data which have been produced (by natural selection) without adequate
> metadata :-)
>
> BTW I don't see any irony in your proposed experiment relying on OCLC's
> authority work. Since the experiment was precisely to test the performance of
> machines in identifying authors, and your test dataset was precisely a set of
> authors defined by OCLC, I don't see how you can avoid making use of that human
> authority work in the experiment. Or was there some other irony I missed? :-)
>
> Cheers
>
> Con
>
> -----Original Message-----
> From: Next generation catalogs for libraries on behalf of Rinne, Nathan (ESC)
> Sent: Sat 01/09/07 2:18
> To: NGC4LIB_at_listserv.nd.edu
> Subject: Re: [NGC4LIB] Resignation
>
> Obviously, Jim is not one of the faithful.
>
> Let me repeat this:
>
> In order to help along the "doubting Thomases" among the catalogers,
> let
> me make a plea. I think it should be super, super easy to
> do. Why
> doesn't someone start with Conal Tuohy's claim about our current
> capabilities (using Bayesian statistics) re: all of the David Johnsons
> James Weinheimer informed us of? I know something about science and
> research, so this ought to be easy enough to empirically test. First,
> get all the works (only text, I assume?) of all the David
> Johnsons. Of
> course, *ironically* [note: this is an addition to this quote] *in order
> to even get started here* I don't see how you would be able to avoid
> needing to use something like OCLC's Worldcat (made possible with its
> wonderful authority control, thank you!) in order to find most, if not
> all, of these works. Then all you would need to scan them and do the
> test, and find out if it worked or not. I think this would be very,
> very helpful - and I want help. Does anyone have the means of doing
> this? (end)
>
> Please note, this is not a demand, this is request. I think this
> would
> be very, very helpful. And I think this would be very, very easy to
> do
> as well (maybe not over a lunch break, but you know what I mean).
>
> Regards,
> Nathan Rinne
> Media Cataloging Technician
> ISD 279 - Educational Service Center (ESC)
> 11200 93rd Ave. North
> Maple Grove, MN. 55369
> Work phone: 763-391-7183
>
>
> -----Original Message-----
> From: Next generation catalogs for libraries
> [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of James Weinheimer
> Sent: Friday, August 31, 2007 9:12 AM
> To: NGC4LIB_at_listserv.nd.edu
> Subject: Re: [NGC4LIB] Resignation
>
> > -----Original Message-----
> > From: Next generation catalogs for libraries
> > [mailto:NGC4LIB_at_listserv.nd.edu] On Behalf Of Conal Tuohy
> > Sent: Friday, August 31, 2007 3:02 AM
> > To: NGC4LIB_at_listserv.nd.edu
> > Subject: Re: [NGC4LIB] Resignation
> > I'm assuming you're asking how a machine can decide that a given work
> > was authored by one (or none) of the above?
> >
> > If the full text of the books is available, this is actually quite a
> > feasible task which can be done by unsupervised machine-learning
> > algorithms. Every author has an authorial "fingerprint" which
> can be
> > recognised by attentive readers, and Bayesian statistical techniques
> are
> > even better at picking up such things. The key data for these
> algorithms
> > are the frequency of use and co-occurrences of particular words,
> > sentence-lengths, etc. It in no way requires AI capable of
> > "understanding" the subject of the text, in the sense that a
> human
> > reader can. The statistical patterns which these algorithms recognise
> > are ones which are generally below the conscious perception of human
> > readers (who instead tend to focus on what a text actually means).
> >
> > This is an area where we should expect computers to out-perform
> humans,
> > frankly.
>
> Then show us. I have read so many things of "should" in my life and
> maybe
> some things seem to make sense, but I haven't seen them work in
> practice,
> yet. (Alchemy made a lot of sense, too!) People talk about the great
> "automatic translation" but what I've seen is still a disaster. This
> was
> some time back, but a former professor I had had worked his entire life
> on
> automatic translation, only to declare it impossible at the end. The
> best
> you could do was to create a text for a human to edit. Like he said, if
> you
> need the human to edit it anyway, why go through it in the first place?
> He
> was speaking quite some time back. Automatic translation has come some
> way,
> but it's not there yet. And neither is automatic subject analysis.
>
> We can experiment to our heart's desire, but we cannot draw conclusions
> based on ifs maybes and shoulds. We have seen that "Should and maybe"
> may
> come around in 50 or more years--if ever, or it may be next week.
>
> Regards,
> Jim
>
> James Weinheimer j.weinheimer_at_aur.edu
> Director of Library and Information Services
> The American University of Rome
> via Pietro Roselli, 4
> 00153 Rome, Italy
> voice- 011 39 06 58330919 ext. 327
> fax-011 39 06 58330992
Received on Fri Aug 31 2007 - 12:56:51 EDT