Re: Spell checking (was "Elitism - and Aristotle again!")

From: Jonathan Rochkind <rochkind_at_nyob> Date: Tue, 7 Aug 2007 09:35:27 -0400 To: NGC4LIB_at_listserv.nd.edu

Bernhard Eversberg wrote:
> Spell-checking can hardly be made quite as useful for
> catalogs as it is being experienced with search engines.
> It may even be counterproductive to employ spell-checking, with
> no way for the user to figure out what's going on.
I don't accept it as a given that spell-checking can't be made very
useful for catalogs. Why do you believe this?

I think most spell checking we have seen is NOT useful. I think there
are very interesting ideas that could be pursued to attempt useful
spell-suggestion in our catalogs, even if our vendors aren't doing it.

Spelt by Martin Haye is a package that originated from the library world
that actually uses contemporary spell-checking technology (instead of
anchient and inappropriate tech).  Now, libraries don't put a lot of
(any) resources into r&d in this type of area, so at the moment it's
still more of a research project, but I _think_ Martin may have
implemented it at his own library, it would be worth asking him if
there's a live demo site. At any rate, this is the kind of thing that I
think really is worth pursuing. I wouldn't assume that spell-suggestion
can never be useful in a catalog, I don't see any reason why this would
be so. But we (the library community) has got to collectively work to
create cutting-edge spell-suggestion that IS useful. What our vendors
are giving us (if anything) is generally not.

You think Google's spell-check, optimized for their environment, just
appeared out of the mind of Zeus? No, they put resources into developing
it.

Martin Haye's Spelt discussion list:
http://groups.google.com/group/spelt?lnk=sg

Jonathan
>
> Lots better: provide index browsing not just for controlled vocabulary
> but for title keywords as well. Then, users can immediately _see_
> what spellings there are and also what variants and mistakes, and also
> what's not there at all. It is even possible to make this kind of index
> truncatable! Here's a sample:
> User types "pharmaceutic" and gets
>        pharmaceutic (3)
>        pharmaceutica (115)
>        pharmaceuticae (30)
>        pharmaceutical (1926)
>        pharmaceutical-biotechnology
>        pharmaceutical/biomedical
>        pharmaceutically (4)
>        pharmaceuticals (275)
>        pharmaceuticam (12)
>        pharmaceuticarum (20)
>        pharmaceuticas (4)
>        pharmaceutice (24)
>        pharmaceuticen
>        pharmaceutices (14)
>        pharmaceutici (9)
>        pharmaceuticial
>        pharmaceuticis (45)
>        pharmaceutick
>         ...
> Now, user truncates that typing pharmaceutic? and gets
>
>        pharmaceutic... (2699)
>        pharmaceutik... (9)
>        pharmaceutin...
>        pharmaceutiq... (319)
>        pharmaceutis... (362)
>        pharmaceuto-... (2)
>        pharmaceutri... (3)
>        pharmaceutuc...
>        pharmacevtic... (11)
>        pharmacevtik...
>        pharmacevtiq...
>        pharmacevtis... (4)
>        pharmaceytis... (2)
>        pharmachem...
>        pharmachopoi...
>        pharmacia... (169)
>        pharmacia-ca...
>        pharmaciae... (36)
>
> Examples are from a database of 15 mio titles in many languages, the
> keyword index alone holding over 100 mio entries.
> Try it out here to get a feeling:
> http://www.biblio.tu-bs.de/db/vk/page.php?urG=TIT&urS=pharmaceut
>
> B.Eversberg
>

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu