Re: tool for finding close matches in vocabular list

From: Pikas, Christina K. <Christina.Pikas_at_nyob> Date: Fri, 21 Mar 2014 14:37:49 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

I use VantagePoint for that, but it's $$$$, even for academic users (https://www.thevantagepoint.com/). It does fuzzy matching over names and then lets you review and correct the groupings. You can also save the groupings as a thesaurus to apply them to another set if needed. 

Christina

------
Christina K. Pikas
Librarian
The Johns Hopkins University Applied Physics Laboratory
Baltimore: 443.778.4812
D.C.: 240.228.4812
Christina.Pikas_at_jhuapl.edu

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB_at_listserv.nd.edu] On Behalf Of Ken Irwin
Sent: Friday, March 21, 2014 2:25 PM
To: CODE4LIB_at_listserv.nd.edu
Subject: [CODE4LIB] tool for finding close matches in vocabular list

Hi folks,

I'm looking for a tool that can look at a list of all of subject terms in a poorly-controlled index as possible candidates for term consolidation. Our student newspaper index has about 16,000 subject terms and they include a lot of meaningless typographical and nomenclatural difference, e.g.:

Irwin, Ken
Irwin, Kenneth
Irwin, Mr. Kenneth
Irwin, Kenneth R.

Basketball - Women
Basketball - Women's
Basketball-Women
Basketball-Women's

I would love to have some sort of pattern-matching tool that's smart about this sort of thing that could go through the list of terms (as a text list, database, xml file, or whatever structure it wants to ingest) and spit out some clusters of possible matches.

Does anyone know of a tool that's good for that sort of thing?

The index is just a bunch of MySQL tables - there is no real controlled-vocab system, though I've recently built some systems to suggest known SH's to reduce this sort of redundancy.

Any ideas?

Thanks!
Ken