Re: The next generation of discovery tools (new LJ article)

From: Jonathan Rochkind <rochkind_at_nyob> Date: Mon, 28 Mar 2011 17:07:48 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

On 3/28/2011 4:20 PM, Karen Coyle wrote:
> Can we just conclude that, with a few exceptions, ranking is a crapshoot?

Well, it depends on what you mean.  That's a dangerous statement, 
because it sounds like the kind some hardcore old school librarians use 
to say we shoudln't do relevance ranking at all, I mean why provide a 
feature that's "a crapshoot", just sort by year instead.  I don't think 
it's true that relevancy ranking provides no value, which is what "a 
crapshoot" implies.

Instead, relevance ranking, in actual testing (in general, not sure 
about library domain specifically), does _very well_ at putting the 
documents most users most of the time will find most valuable first. It 
does very well at providing an _order_.  Thus the name "ranking".

It just doesn't do very well at providing an objective _measure_ of 
relevance, that can be compared accross searches or corpuses.  But 
that's not what it's for. It does pretty well at what it is actually 
for, ranking: it's not a "crapshoot".

Another thought I had (blog essay coming soon!), is that relevance 
ranking is an attempt out of the recall vs. precision bind.  Recall and 
precision are almost always fighting with each other, increase one, 
decrease the other.  Relevance ranking allows you to get out of this 
bind by having a relatively high recall/low precision result set, but 
where the higher-precision hits are likely to come _first_.   The reason 
you can do this, but not simply exclude the low-precision results 
(without hurting recall), is precisely because of these characteristics 
of relevancy ranking we're talking about --- relevancy ranking puts the 
higher-precision results first, _without_ having any idea of exactly 
what point results become "bad" -- it just knows that the better the 
result, the higher it will be in the list, it doesn't have any clues 
about where "good" becomes "bad".

(If you could know around where "good" becomes "bad", you could provide 
a search that is both high recall AND high precision, but that's a MUCH 
harder problem, that nobody has managed to do, in the general case at 
least. That's not what relevancy ranking does. Doesn't mean it's not 
valuable, or it's a "crapshoot").

> kc
>
> Quoting "Walker, David"<dwalker_at_CALSTATE.EDU>:
>
>>> Again, does anyone have data to support/refute this?
>> So I just threw together some charts based on searches I ran in one
>> of our Solr instances.  The query I entered is reflected in the name
>> of the file:
>>
>>    http://library.calstate.edu/media/htm/charts/teaching-children-autism.html
>>    http://library.calstate.edu/media/htm/charts/autism.html
>>    http://library.calstate.edu/media/htm/charts/global-warming.html
>>
>> This is for a library catalog of just over a million bib records.
>>
>> --Dave
>>
>> ==================
>> David Walker
>> Library Web Services Manager
>> California State University
>> http://xerxes.calstate.edu
>> ________________________________________
>> From: Next generation catalogs for libraries
>> [NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Karen Coyle [lists_at_kcoyle.net]
>> Sent: Friday, March 25, 2011 10:50 AM
>> To: NGC4LIB_at_LISTSERV.ND.EDU
>> Subject: Re: [NGC4LIB] The next generation of discovery tools (new
>> LJ article)
>>
>> Quoting "Walker, David"<dwalker_at_CALSTATE.EDU>:
>>
>>
>>> In order to show the first page of results -- let's assume 10
>>> records per page -- all you need to grab are the first 10 records
>>> from Solr and the first 10 from Primo Central.  (This assumes both
>>> result sets are sorted by relevance, of course.)
>> I have always assumed (and I would love for someone to post some real
>> data related to this) that after a very small number of high ranked
>> results the remainder of the set is "flat" -- that is, many items with
>> the same value. What makes this flat section difficult is that there
>> is no plausible order -- if your set is ranked:
>>
>> 100, 98, 87, 54, 35, 12, 4, 1, 1, 1, 1, 1, 1, 1, 1, ....
>>
>> and you go to pick up results for page 2, they will all have the same
>> rank and they will be in no useful order. (probably FIFO).
>>
>> This, to me, is the essential problem with ranking. We all see it with
>> Google, where by the time you get a few pages in you aren't seeing
>> much utility to the order. Because bibliographic data gives you less
>> information to use in ranking, I suspect you will hit the tail pretty
>> quickly.
>>
>> Again, does anyone have data to support/refute this? I guessing the
>> result is Zipfian. (Zipf-ish?)[1]
>>
>> kc
>>
>> [1]
>> http://openlibrary.org/books/OL6048217M/Human_behavior_and_the_principle_of_least_effort
>>
>>> Each of those records will have a score -- which is just a
>>> floating-point decimal number.  The scores are going to be different
>>> between the two systems, naturally.  But, since they are both based
>>> on Lucene, they are not wildly different.
>>>
>>> In fact, I think it's mostly just a matter of scale.  If you take
>>> the top possible score in your Solr instance (mine appears to be
>>> 14.0 based on how I've done field boosting and such) and the top
>>> possible score in Primo Central (which appears to be 1.0), you can
>>> work out a simple formula to "normalize" them to the same scale.
>>>
>>> You might, in fact, want to boost your local collection a bit, or
>>> otherwise make this more complex (this may be more art than
>>> science).  But that seems (to my simple way of thinking, anyway) the
>>> basic issue.
>>>
>>> You then order these 20 results (10 from Solr, 10 from Primo
>>> Central) by this "normalized" score.  Take the top 10, and display
>>> them.
>>>
>>> In some cases, all 10 results from Solr might score higher than the
>>> 10 from Primo Central  -- or vice versa -- which is why we need to
>>> grab as many from each system as you intend to display per page.
>>>
>>> In many cases, though, you'll have a mix of Solr and Primo Central
>>> results.  So let's say our first page included seven from Solr and
>>> three from Primo Central.  When the user clicks to the next page,
>>> we'll grab 10 more results from Solr (starting with number eight),
>>> and 10 more from Primo Central  (starting with number four), and do
>>> the same operation again.
>>>
>>> I'm not sure what you do in the event the user decides to skip to
>>> page eight.
>>>
>>> I am pretty sure there are more sophisticated ways of doing this,
>>> maybe even down at the Solr level, where Josh was working.
>>>
>>> One of the lightning talks at Code4Lib this year also talked about
>>> this (briefly, of course), and had some interesting ideas.  This was
>>> with Summon, rather than Primo Central, but the former is also based
>>> on Lucene (Solr, in fact), so I think all the ideas would transfer:
>>>
>>>    http://www.slideshare.net/villadsen/summasummon-something-something
>>>
>>> --Dave
>>>
>>> ==================
>>> David Walker
>>> Library Web Services Manager
>>> California State University
>>> http://xerxes.calstate.edu
>>> ________________________________________
>>> From: Next generation catalogs for libraries
>>> [NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind
>>> [rochkind_at_JHU.EDU]
>>> Sent: Monday, March 21, 2011 10:36 AM
>>> To: NGC4LIB_at_LISTSERV.ND.EDU
>>> Subject: Re: [NGC4LIB] The next generation of discovery tools (new
>>> LJ article)
>>>
>>> You're doing this with in-house code? Interesting.
>>>
>>> So do you grab ALL the results from both Solr and Primo, so you can
>>> merge them?  I'm surprised that doesn't create a performance problem.
>>> I'm also curious how you manage to normalize relevancy from the two
>>> systems.
>>>
>>> All in all, this is interesting work, with a lot of tricky details, and
>>> I think you should write up the details for the Code4Lib Journal/ :)
>>>
>>> Jonathan
>>>
>>> On 3/21/2011 1:17 PM, Joshua Greben wrote:
>>>> Hi Jonathan,
>>>>
>>>> Yes, the blending in this context is exactly as you describe. More
>>>> specifically, we take the top relevancy-ranked results from our
>>>> Solr engine, and the top relevancy-ranked results from the Primo
>>>> Central API. Each set of results comes with a relevancy score for
>>>> each document in the set of results. Naturally, the scores are
>>>> different so we normalize them. The results that come from the Solr
>>>> engine are interfaced by using a Java object that the Solr software
>>>> creates (a Collection of Solr document records). The results from
>>>> the Primo Central API are parsed with the help of a schema provided
>>>> by Ex Libris as part of the Primo Central API. The Primo Central
>>>> record is then converted into a data type that matches what is
>>>> already in the Solr Java object. We insert the Primo Central
>>>> results into that object using the results from the normalized
>>>> relevancy algorithm to find the right place to insert, and that is
>>>> the way we are able to display a combined relevancy-ranked result
>>>> set.
>>>>
>>>> Josh
>>>>
>>>>
>>>> Joshua Greben
>>>> Systems Librarian/Analyst
>>>> Florida Center for Library Automation
>>>> 5830 NW 39th Ave,
>>>> Gainesville, FL 32606
>>>> 352-392-9020 ext 246
>>>> jgreben_at_ufl.edu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mar 21, 2011, at 12:37 PM, Jonathan Rochkind wrote:
>>>>
>>>>> On 3/18/2011 2:41 PM, Jean Phillips wrote:
>>>>>> At FCLA and other places there are people working on the ability
>>>>>> to include megaindexes of articles and others into their locally
>>>>>> developed or open source Discovery Tool.  We've recently been
>>>>>> able to blend the results from Ex Libris's Primo Central Index in
>>>>>> with our local repository of metadata from the catalog and
>>>>>> digital collections sources.
>>>>> When you say "blend"... what do you mean exactly?  You really mean
>>>>> blend, hits from the external index interspersed with hits from
>>>>> your local repository, with some relevance algorithm merging the
>>>>> two sets?
>>>>>
>>>>> How ever did you manage that?
>>>>>
>>
>>
>> --
>> Karen Coyle
>> kcoyle@kcoyle.net http://kcoyle.net
>> ph: 1-510-540-7596
>> m: 1-510-435-8234
>> skype: kcoylenet
>>
>
>