Re: The next generation of discovery tools (new LJ article)

From: Jonathan Rochkind <rochkind_at_nyob> Date: Mon, 28 Mar 2011 10:58:54 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

It is true that the _user experience_ of TF-IDF type algorithm ranking 
is often that you get a few highly relevant results, and then the 
results trail off into around-equally-non-relevant.

I would be wary of assuming that this is reflected in the _math_ though. 
Jim, by "my own experience too is that this is correct", do you mean 
you've actually looked at the distribution of calculated relevance 
scores in the result, or just that your own judgements of relevance of 
hits would distribute like that, trail off into non-relevance?

I would really not assume that the math reflects this.  Relevance 
algorithms are pretty good at ranking a set of documents so the "best" 
come first for a given query.  They are NOT that good at assigning an 
absolute value of "goodness" to a document, for a given query.  If they 
WERE good at this latter thing, then it would be easy to simply "trim 
off" all those trailing irrelevant ones, all the ones who's score was 
less than a certain absolute value or something, and provide a much 
better result set.  But the algorithms just don't do that, the score 
mostly useful in a relative comparison to other documents -- it's not an 
objective measure, a distance on the number line at one part of the 
distribution may not correspond to the same perceived difference in 
usefulness as that same distance on the number line at another part of 
the distribution, changing the corpus can change the scores quite a bit, 
possibly including the distribution of scores, etc.

Now, I'm not saying it would be impossible. It would be an interesting 
area of research. Perhaps some such research has already been done, if 
anyone knows of any cites to the 'information retrieval' literature, 
please share.  But as a general rule, algorithms good at relative 
rankings of result sets aren't neccesarily good or useful for objective 
measures -- and just because the algorithm succeeds at putting the most 
useful docs first in the judgement of a majority of users does NOT mean 
that algorithm is assigning an objective number to each doc that could 
be compared to any other doc in the result set in a way that would match 
the judgement of a majority of users.

Even though your _evaluation_ of relevance might look like:  100, 98, 
87, 54, 35, 12, 4, 1, 1, 1, 1, 1, 1, 1, 1,

The actual numbers might look like:

100, 70, 69, 68, 67, 66, 65, 64, 30, 39, 28, 27,26,10,9,8,7

It still succeeds in giving you a ranking that matches your judgement, 
but it does NOT give you a ranking that allows you to compare any 
document to any other document in a way that would match your human 
judgement.

On 3/25/2011 2:22 PM, Weinheimer Jim wrote:
>   Karen Coyle wrote:
> <snip>
> I have always assumed (and I would love for someone to post some real
> data related to this) that after a very small number of high ranked
> results the remainder of the set is "flat" -- that is, many items with
> the same value. What makes this flat section difficult is that there
> is no plausible order -- if your set is ranked:
>
> 100, 98, 87, 54, 35, 12, 4, 1, 1, 1, 1, 1, 1, 1, 1, ....
>
> and you go to pick up results for page 2, they will all have the same
> rank and they will be in no useful order. (probably FIFO).
> </snip>
>
> My own experience too is that this is correct. Something that may be relevant to this discussion or not, I have worked a bit with a Firefox plugin, called Cloudlet http://www.getcloudlet.com/, where it takes a search in Google, Yahoo, and some other databases, and returns a word cloud. In the Wired article at http://www.wired.com/epicenter/2008/12/firefox-add-ons/, they mention that to get better results, you should change your account to get 100 results per page, but otherwise I haven't discovered any more details concerning how it works. I've concentrated on trying to find out if it is genuinely useful.
>
> I still haven't decided if it is or is not, but something within me says that it *has* to be useful. My concern is: when I click on a word in the cloud, I don't really know what I'm looking at.
>
> In any case, this is a different take on the same idea as what we are discussing here.
>
> Anyway, a suggestion for Karen is to relate the search to Google Scholar, which arranges results by number of citations (mostly). For more specific searches, i.e. not only single words but multiple terms, the citations die out after a couple of pages or so.
>
> James L. Weinheimer  j.weinheimer_at_aur.edu
> Director of Library and Information Services
> The American University of Rome
> Rome, Italy
> First Thus: http://catalogingmatters.blogspot.com/
>