Re: Resignation

From: Conal Tuohy <conal.tuohy_at_nyob> Date: Tue, 4 Sep 2007 15:06:17 +1200 To: NGC4LIB_at_listserv.nd.edu

On Sat, 2007-09-01 at 04:56 +1200, Weinheimer Jim wrote:
> I think that what Nathan is proposing is perfectly logical and
> scientifically necessary. I proposed something similar at an
> institution once and it was completely ignored. If we want to say that
> any of this stuff is "scientific," we must follow scientific method
> somewhere along the way.

What's wrong with the scientific work which has already been published
in this area? I know I've read scientific literature detailing similar
experiments, and I don't see that performing the experiment with the
particular "David Johnsons" dataset which Nathan gave as an example
really adds any extra scientific character. It may well be that an
experiment in recognising the authorship of works chosen from a list of
"David Johnsons" could be a pedagogically useful example (e.g. for this
list, in particular), but otherwise it's just a specific application of
existing computing work - isn't it? Or what is it that you still think
is lacking (from a scientific perspective)?

> I personally don't want to forget the David Johnson example, because
> this is what people deal with every day. It would be great if
> everybody had names such as Smith, Jiamagurdni, but there are many
> more Jay, Joseph, and Jennifer Smiths. If the system can only
> differentiate the Jiamagurdni Smith, it doesn't save anything at all
> since it can be done semi-automatically today (no conflicts). It seems
> to me that if the proposed system can't do anything tough, it is just
> like my example of automatic translation needing to be revised by a
> human, and I don't see any use in it at all, except as an experimental
> attempt.

I'm not sure I understand the point you're making here, Jim.

Any system which has to guess the authorship of works will be liable to
error (whether the system is human or artificial). If an author's name
is common, the likelihood of error will be higher (again, this is true
for humans as well as for computers). So differentiating the works of
multiple David Johnsons will of course be harder than differentiating
the works of multiple Jiamagurdni Smiths. I don't think anyone could
dispute that, and it has nothing to do with the reason why I suggested
forgetting David Johnson.

The reason why I politely spurned the David Johnsons experiment was not
because there are a lot of David Johnsons and this would be
computationally hard (what do I care how hard it is? it's a computer's
job, not mine), but because there are a lot of David Johnson books and I
don't want to spend weeks scanning books, purely to show off
computational work which is already documented in the scientific
literature.

These techniques are applicable to journals etc, but not really yet to
books (in general) because the full text is just not available. Until
and unless the available corpus of digitised books reaches a certain
large scale, it's not going to be feasible or cost-effective for
libraries to use these statistical techniques to classify their
holdings. In the "book space", I would expect to see e.g. Amazon and
Google deploy these techniques first, for disambiguating authors,
identifying concepts, etc, simply because they have access to the full
text of books, and we don't. :-(

Cheers

Con

>
--
Conal Tuohy
New Zealand Electronic Text Centre
www.nzetc.org