Re: tool for finding close matches in vocabular list

From: Owen Stephens <owen_at_nyob> Date: Fri, 21 Mar 2014 18:43:14 +0000 To: CODE4LIB_at_LISTSERV.ND.EDU

As Roy suggests, Open Refine is designed for this type of work and could easily deal with the volume you are talking about here. It can cluster terms using a variety of algorithms and easily apply a set of standard transformations.

The screencasts and info at http://freeyourmetadata.org/cleanup/ might be a good starting point if you want to see what Refine can do

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: owen_at_ostephens.com
Telephone: 0121 288 6936

On 21 Mar 2014, at 18:24, Ken Irwin <kirwin_at_WITTENBERG.EDU> wrote:

> Hi folks,
> 
> I'm looking for a tool that can look at a list of all of subject terms in a poorly-controlled index as possible candidates for term consolidation. Our student newspaper index has about 16,000 subject terms and they include a lot of meaningless typographical and nomenclatural difference, e.g.:
> 
> Irwin, Ken
> Irwin, Kenneth
> Irwin, Mr. Kenneth
> Irwin, Kenneth R.
> 
> Basketball - Women
> Basketball - Women's
> Basketball-Women
> Basketball-Women's
> 
> I would love to have some sort of pattern-matching tool that's smart about this sort of thing that could go through the list of terms (as a text list, database, xml file, or whatever structure it wants to ingest) and spit out some clusters of possible matches.
> 
> Does anyone know of a tool that's good for that sort of thing?
> 
> The index is just a bunch of MySQL tables - there is no real controlled-vocab system, though I've recently built some systems to suggest known SH's to reduce this sort of redundancy.
> 
> Any ideas?
> 
> Thanks!
> Ken