Re: Resignation

From: Conal Tuohy <Conal.Tuohy_at_nyob> Date: Sat, 1 Sep 2007 13:23:36 +1200 To: NGC4LIB_at_listserv.nd.edu

Nathan wrote:

> Thanks for your reply.  I see that I overestimated the capabilities of
> scanning technology (Jim's example helped here).

I don't personally think that the capabilities of OCR software are the limiting factor. Note that Google's low OCR quality is notorious among "etext" practitioners. It is possible to do much better. In any case, the reason why Google is satisfied with such low quality output is that for their purposes they are able to "make do" with it. Text of such low quality can actually suffice as input for the Bayesian techniques under consideration. Remember, these algorithms are not attempting to READ the text in a human fashion. They are just looking at relative word frequencies. If they have to discard a bunch of unlikely-looking words, or if a lot of mis-spelled words creep into their input, etc, that is not necessarily going to disrupt their abilities too much.

So personally, the reason I'm not leaping to perform the suggested experiment is that there's a lot of work in tracking down all the books of the selected authors, and scanning them page by page. It would take weeks to do. At the end, I guess a lot of the electronic text would be unusable for anything else, either (for copyright reasons). Although I can see that some "doubting Thomases" might be impressed by the results of such an experiment, I don't have the resources personally to do it. However, I think there's enough proof of the capabilities of Bayesian methods in the scientific literature ... so I don't personally have a need to perform such an experiment just to satisfy myself it can be done. Which is not to say I don't intend to use these techniques in my own work though!

I work in a digitisation centre in a university library, and we have done some experiments to harvest subject classification from full text of digitised magazine articles. What has kept us from developing the idea further already is mostly a lack of time. There have been some technical problems with the software (in that the MatLab-based implementation we tried is designed to work with small pieces of text (abstracts, or newspaper articles, rather than novel-length books). I believe these are surmountable though, and I think that by using our university grid computing facility we will be able to scale up to deal with large corpora OK. I'm still hopeful that we'll have something "in production" within the next several months: certainly I'll post something to the list if and when that happens.

Regards

Con