Re: The next generation of discovery tools (new LJ article)

From: Jonathan Rochkind <rochkind_at_nyob> Date: Fri, 25 Mar 2011 13:12:25 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

> Each of those records will have a score -- which is just a floating-point decimal number.  The scores are going to be different between the two systems, 
> naturally.  But, since they are both based on Lucene, they are not wildly different.

This is not a safe assumption. Even within the very same Solr/lucene instance/index, different searches can have wildly different relevancy scores (which are not in fact normalized floating point numbers 0-1, but are integers).  The relevancy score is not meant to be any kind of absolute value, but is only relative to other documents within the result set for an identical query on an identical index. 

"Scores for results for a given query are only useful in comparison to other results for that exact same query. Trying to compare scores across queries or trying to understand what the actual score means (i.e. 2.34345 for a specific document) may not be an effective exercise." http://lucidworks.lucidimagination.com/display/LWEUG/Understanding+and+Improving+Relevance

Still, I can think of some potential ways around that, and the basic approach you outline might work, I'm not dismissing it out of hand. But I'm not sure if it would or not, there are a number of other potential gotchas and challenges I can think of. Really, there's only one way to find out, actual implementation, where you'll run into problems and subsequent solutions you won't think about in just a thought experiment. 

Which is why I'm most interested in seeing a write-up from the folks who have actually implemented a multiple-system-result-set-merging algorithm using a local Solr and a remote 'discovery layer' succesfully -- please write it up! The Code4Lib Journal would certainly be interested. 

Jonathan