Re: The next generation of discovery tools (new LJ article)

From: Joshua Greben <jgreben_at_nyob> Date: Fri, 25 Mar 2011 13:14:33 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

David,

Your description of how we got this to work is right on. Jean Moises from FCLA did the majority of the coding in this area. In the case of skipping ahead pages, we simply do not provide that option for a blended result set. We are currently working on the same solution using the EDS and Summon APIs as well.

Hopefully in the near future I will be able to share some links where you will be able to see the result of this work.

In May I will be at ELUNA in Milwaukee where I will be able to do some ad hoc demos to any Ex Libris customers who are attending.

Josh

On Mar 25, 2011, at 12:11 PM, Walker, David wrote:

> This was an interesting thread, so I'm hoping to keep it alive with some speculative ramblings.  Maybe Josh will still yet respond. :-)
> 
>> So do you grab ALL the results from both Solr 
>> and Primo, so you can merge them?  I'm surprised 
>> that doesn't create a performance problem.
> 
> I'm sure grabing all the results would, in fact, create a *huge* performance problem.  But I'm also thinking it's unnecessary to do that.  
> 
> Here's my simple idea of how this might work:
> 
> In order to show the first page of results -- let's assume 10 records per page -- all you need to grab are the first 10 records from Solr and the first 10 from Primo Central.  (This assumes both result sets are sorted by relevance, of course.)
> 
> Each of those records will have a score -- which is just a floating-point decimal number.  The scores are going to be different between the two systems, naturally.  But, since they are both based on Lucene, they are not wildly different. 
> 
> In fact, I think it's mostly just a matter of scale.  If you take the top possible score in your Solr instance (mine appears to be 14.0 based on how I've done field boosting and such) and the top possible score in Primo Central (which appears to be 1.0), you can work out a simple formula to "normalize" them to the same scale.   
> 
> You might, in fact, want to boost your local collection a bit, or otherwise make this more complex (this may be more art than science).  But that seems (to my simple way of thinking, anyway) the basic issue.
> 
> You then order these 20 results (10 from Solr, 10 from Primo Central) by this "normalized" score.  Take the top 10, and display them.
> 
> In some cases, all 10 results from Solr might score higher than the 10 from Primo Central  -- or vice versa -- which is why we need to grab as many from each system as you intend to display per page.
> 
> In many cases, though, you'll have a mix of Solr and Primo Central results.  So let's say our first page included seven from Solr and three from Primo Central.  When the user clicks to the next page, we'll grab 10 more results from Solr (starting with number eight), and 10 more from Primo Central  (starting with number four), and do the same operation again.
> 
> I'm not sure what you do in the event the user decides to skip to page eight.
> 
> I am pretty sure there are more sophisticated ways of doing this, maybe even down at the Solr level, where Josh was working.  
> 
> One of the lightning talks at Code4Lib this year also talked about this (briefly, of course), and had some interesting ideas.  This was with Summon, rather than Primo Central, but the former is also based on Lucene (Solr, in fact), so I think all the ideas would transfer:
> 
>  http://www.slideshare.net/villadsen/summasummon-something-something
> 
> --Dave
> 
> ==================
> David Walker
> Library Web Services Manager
> California State University
> http://xerxes.calstate.edu
> ________________________________________
> From: Next generation catalogs for libraries [NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind [rochkind_at_JHU.EDU]
> Sent: Monday, March 21, 2011 10:36 AM
> To: NGC4LIB_at_LISTSERV.ND.EDU
> Subject: Re: [NGC4LIB] The next generation of discovery tools (new LJ article)
> 
> You're doing this with in-house code? Interesting.
> 
> So do you grab ALL the results from both Solr and Primo, so you can
> merge them?  I'm surprised that doesn't create a performance problem.
> I'm also curious how you manage to normalize relevancy from the two
> systems.
> 
> All in all, this is interesting work, with a lot of tricky details, and
> I think you should write up the details for the Code4Lib Journal/ :)
> 
> Jonathan
> 
> On 3/21/2011 1:17 PM, Joshua Greben wrote:
>> Hi Jonathan,
>> 
>> Yes, the blending in this context is exactly as you describe. More specifically, we take the top relevancy-ranked results from our Solr engine, and the top relevancy-ranked results from the Primo Central API. Each set of results comes with a relevancy score for each document in the set of results. Naturally, the scores are different so we normalize them. The results that come from the Solr engine are interfaced by using a Java object that the Solr software creates (a Collection of Solr document records). The results from the Primo Central API are parsed with the help of a schema provided by Ex Libris as part of the Primo Central API. The Primo Central record is then converted into a data type that matches what is already in the Solr Java object. We insert the Primo Central results into that object using the results from the normalized relevancy algorithm to find the right place to insert, and that is the way we are able to display a combined relevancy-ranked result set.
>> 
>> Josh
>> 
>> 
>> Joshua Greben
>> Systems Librarian/Analyst
>> Florida Center for Library Automation
>> 5830 NW 39th Ave,
>> Gainesville, FL 32606
>> 352-392-9020 ext 246
>> jgreben_at_ufl.edu
>> 
>> 
>> 
>> 
>> 
>> On Mar 21, 2011, at 12:37 PM, Jonathan Rochkind wrote:
>> 
>>> On 3/18/2011 2:41 PM, Jean Phillips wrote:
>>>> At FCLA and other places there are people working on the ability to include megaindexes of articles and others into their locally developed or open source Discovery Tool.  We've recently been able to blend the results from Ex Libris's Primo Central Index in with our local repository of metadata from the catalog and digital collections sources.
>>> When you say "blend"... what do you mean exactly?  You really mean blend, hits from the external index interspersed with hits from your local repository, with some relevance algorithm merging the two sets?
>>> 
>>> How ever did you manage that?
>>> 
>