Re: The next generation of discovery tools (new LJ article)

From: Jonathan Rochkind <rochkind_at_nyob> Date: Mon, 28 Mar 2011 17:16:57 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

On 3/28/2011 4:47 PM, Karen Coyle wrote:
> I don't think we should give up ranking -- the fact that it works some
> of the time is a reason to keep doing it. But I surely hope that we
> don't consider it the solution to the large retrieved set problem.

Yeah, I consider it _part_ of a solution to that problem, but it really 
shines in providing a solution to a different problem, the recall vs. 
precision problem.  Do you provide a search with very high recall but 
low precision (almost everything you want is in there, but a whole lot 
of things you don't want are too?).  Or do you provide a search with 
very high precision but low recall (almost nothing you don't want is in 
there, but there's a LOT of useful stuff that's not in there either).

Relevancy ranking lets you provide a high-recall low-precision search 
(almost everything you want is there, but a whole bunch of things you 
don't want are too), but does a very good job of making it likely the 
stuff you want will be FIRST.   So the first % of hits in such a search, 
taken on their own, can be fairly high recall AND high precision, 
something that you can't generally accomplish any other way.

You still need other tools for limiting large large result sets.  Some 
of the most typical include facet-based 'partitioning' (relies on having 
good metadata, is why Google doesnt' do much of it; Google does some, 
but people don't generally use em much, which I don't think neccesarily 
extends to our databases and users and uses.  We DO have decent 
metadata, and smaller databases than Google, etc.)   Another typical one 
is allowing the user to make their search lower recall but higher 
precision by doing a "fielded" search (just search in Title fields, not 
All Fields); or doing a phrase search; or doing more sophisticated 
generalized types of 'phrase' searches,  like (word A within 5 words of 
word B), which is kind of a generalization of phrase search. (Most of 
our interfaces don't have a way to enter a search like that, although 
Solr/Lucene can do such searches).

Every time you increase precision and decrease recall (such as by any of 
the methods above), you may be excluding SOME items that would be 
useful.  But you are getting rid of items that are NOT useful too, to 
give you a more manageable result set.  It is an interative process, 
interacting with the search interface, changing your search in various 
directions, to eventually find what you want. (See Marcia Bates 
"Berrypicking" model). We can try to provide more sophisticated tools 
for this process, and we can make it as easy to do as possible -- I 
don't think we can eliminate it, until we have artificially intelligent 
computers that are also psychic (that is, never).

Sophisticated searching will ALWAYS be a skill. Librarians take heart. 
We're not going to come up with an interface that reads your mind and 
gives you exactly the documents you want, without need for any search 
skills.

> kc
>
> Quoting "Beacom, Matthew"<matthew.beacom_at_YALE.EDU>:
>
>> Karen,
>>
>> I don't see how the evidence David provided or Jonathan's analysis
>> would lead us to conclude that ranking is a crapshoot. The 1st ten
>> in any half-sensible ranking of a half-sensible search will not be
>> as likely to be relevant as the 10th ten (they will be more likely
>> to be relevant), which is what I think you meant by "a crapshoot."
>>
>> The rankings are crude approximations of relevancy, but they are
>> often pretty helpful. And a savvy searcher, who is after more than
>> the first likely thing that comes up, may be able to sort through
>> what rose up in the rankings to perform more suitable searches or
>> re-sort the results by another vector or reduce the results by some
>> facet or facets.
>>
>> Matthew
>>
>> -----Original Message-----
>> From: Next generation catalogs for libraries
>> [mailto:NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Karen Coyle
>> Sent: Monday, March 28, 2011 4:20 PM
>> To: NGC4LIB_at_LISTSERV.ND.EDU
>> Subject: Re: [NGC4LIB] The next generation of discovery tools (new
>> LJ article)
>>
>> Thank you, David. This confirms Jonathan's analysis, that the set is
>> compared to itself and therefore does not flatten out tail-like as I
>> expected. That said, the most important part of what Jonathan said was
>> that there is no particular correlation between Solr's determination
>> of ranking and what the user experiences when looking at the results
>> in a linear fashion.
>>
>> Can we just conclude that, with a few exceptions, ranking is a crapshoot?
>>
>> kc
>>
>
>