Re: The next generation of discovery tools (new LJ article)

From: Joshua Greben <jgreben_at_nyob> Date: Fri, 25 Mar 2011 15:16:34 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

In our assessment, based on what we thought would be most useful to the end-user, It was more important to get a good mix of relevant 'Books' and 'Articles' for each displayed result set rather than provide an absolutely accurate ordering based on a normalized score.

One thing I keep hearing again and again from many reference librarians here in Florida is that despite the existence of these unified indexes or blended indexes, they still ultimately direct their students to the appropriate databases for their research topics. The mega-indexes are good tools to have, especially for those researchers who do not know how to choose an appropriate database and do not seek the help of librarians, but they do not recommend using them for the purposes of comprehensive literature reviews and due diligence.

Part of our project is going to (hopefully) address the question of how you intelligently direct the researcher to a few specific appropriate databases based on their query and relevant results, thus augmenting or simulating a reference interview.

We would love to do a write up of what we did in Mango for the Code4Lib Journal, especially if it means that we would have a chance for even a small hiatus from any new development projects (but that is probably just wishful thinking on my part:))

Hopefully this summer will be a good time to focus on a write-up of this project. We are actively working on providing the same solution using the EDS and Summon APIs as well, so it would be good to be able to describe and compare the three. More to come!

Josh

On Mar 25, 2011, at 1:12 PM, Jonathan Rochkind wrote:

>> Each of those records will have a score -- which is just a floating-point decimal number.  The scores are going to be different between the two systems, 
>> naturally.  But, since they are both based on Lucene, they are not wildly different.
> 
> This is not a safe assumption. Even within the very same Solr/lucene instance/index, different searches can have wildly different relevancy scores (which are not in fact normalized floating point numbers 0-1, but are integers).  The relevancy score is not meant to be any kind of absolute value, but is only relative to other documents within the result set for an identical query on an identical index. 
> 
> "Scores for results for a given query are only useful in comparison to other results for that exact same query. Trying to compare scores across queries or trying to understand what the actual score means (i.e. 2.34345 for a specific document) may not be an effective exercise." http://lucidworks.lucidimagination.com/display/LWEUG/Understanding+and+Improving+Relevance
> 
> Still, I can think of some potential ways around that, and the basic approach you outline might work, I'm not dismissing it out of hand. But I'm not sure if it would or not, there are a number of other potential gotchas and challenges I can think of. Really, there's only one way to find out, actual implementation, where you'll run into problems and subsequent solutions you won't think about in just a thought experiment. 
> 
> Which is why I'm most interested in seeing a write-up from the folks who have actually implemented a multiple-system-result-set-merging algorithm using a local Solr and a remote 'discovery layer' succesfully -- please write it up! The Code4Lib Journal would certainly be interested. 
> 
> Jonathan 
>