Re: Resignation

From: Weinheimer Jim <j.weinheimer_at_nyob> Date: Thu, 30 Aug 2007 13:02:05 +0200 To: NGC4LIB_at_listserv.nd.edu

> I know several people here have said they don't think AI or smart
> software is up to the task, and that they represent no real risk to
> serious cataloging. Well, I can only say, please trust me in this! I
> worked in professional AI for over 7 years ; this stuff not only can
> be done, but has been done for some time. Most AI problems have
> historically been with access to various corpi to make them work, but
> those things are now changing dramatically. Now that all books are
> written on computers and available for analysis, even more so.

Sorry, I can't rely on faith for something of this importance, hoping that it will be worked out, and worked out soon. I have some experience in these things myself and I have never seen any AI system do well enough. This doesn't mean that it never will--and who knows?--there might be something in 5 or 10 years, but what I have seen is not nearly good enough yet. I think people will always want to query a catalog/database/index and be able to find items authored by specific individuals. How can AI differentiate between the following list of David Johnsons, and determine that it is not any of these and there needs to be a new one? (taken from the LCNAF. Just the first page!)

Johnson, David
Johnson, David, 1782-1855
Johnson, David, 1906-
Johnson, David, 1922-1987
Johnson, David, 1927-
Johnson, David, 1933-
Johnson, David, 1935-
Johnson, David, 1936-
Johnson, David, 1938-
Johnson, David, 1940-
Johnson, David, 1941-
Johnson, David, 1942 Aug. 2-
Johnson, David, 1942 June 23-
Johnson, David, 1942 Oct. 27-
Johnson, David, 1943-
Johnson, David, 1946-
Johnson, David, 1946 Aug. 23-
Johnson, David, 1946 May 5-

If we can show that people don't want to retrieve this sort of information anymore, that is one thing, but I think people expect this but don't really understand they are not getting it. I have never seen an AI system that can automatically distinguish among multiple authors. The same stands for titles and subjects. I can't imagine how it could.

But just because I can't imagine it right now doesn't mean it cannot be done. It might be someday. But what I think is more important is that I (and many others) *can* imagine that an AI type system could be built today--right now--that could seriously help the human expert in determining which one to use.

The big difference between myself and systems people is that I want to create systems that will help the humans create high quality records much, much more efficiently than they do today. Many systems people want to take the human out of the equation altogether. I personally don't know if I agree with the goal (I just finished watching Terminator 3 again!) and although we may be able to do it someday, it is still much too soon.

> AI systems will give a contextual subject heading and quote breakdown
> per chapter or paragraph, if you like, with links to domain models,
> other items of similarity | publisher | topic | author | field, etc,
> measured by time, acceptance, quotations and more. And we, we still
> argue whether to put the TOC in the MARC field or not. Trust me, the
> meta data of the future will *not* be in MARC, because it simply can't
> fit it in nor is structured for it.

There are two parts to cataloging: description and access points. The access points are conceptual, semantic points of reference, which I mentioned above, and the description. The description, especially with ISBD, comes directly from the item. Many parts of the description could probably be done automatically or semi-automatically today, and much more could be done if publishers and librarians would cooperate and change.

Concerning MARC, it can be retained for interoperation purposes, but there is a lot more we can do if we open it up, I agree.

> So, the quality of meta data really is what it comes down to, and
> right now, because we're tiresome librarians, we got supposedly good
> meta data. But anyone who sits down with a fully indexed set of
> subject headings and play seriously with it find flaw
s in it on a
> search by search basis; its a very rigid and sometimes random piece of
> work. Human cataloging is, well, human, and make a lot of mistakes
> which will be especially troublesome when computers try to use them as
> is. If we're to reap the benefit of the current fuzzy meta data, we
> need to find more fuzzy means of using them, changing them to more
> computer-friendly formats, because the current rigid indexing systems
> fails the litmus test.

Certainly there are mistakes in human cataloging, but they are different from the computer mistakes. I can think of a book that I cataloged when I first began which was about the legal status of new mothers and pregnant women in the [former] Soviet Union. I found a record for it cataloged by another library (a law library!) and I thought I would learn some good legal subjects, but the subject they had put in was:
Women--Soviet Union.
This is wrong, not "wrong' in one sense, but because the cataloger was lazy. The quality of human indexing is determined by what is called specificity and exhaustivity. This was wrong on both counts.

With computer systems, there is precision and recall, which are different. So, it is possible that an AI system could determine that the above book was about something totally off-base, such as information retrieval, while a human would never do that. What I am primarily concerned about though is that the above heading: Women--Soviet Union would be considered "good enough" for an AI system when it is absolutely not. This would be only a lowering of standards and I think our users--and society--needs something much better than this.

But who knows? Maybe in a couple of years, everything will be solved--but I doubt it.

Regards,
Jim Weinheimer