Re: Cablegate from Wikileaks: a case study

From: Michele Newberry <fclmin_at_nyob>
Date: Mon, 6 Dec 2010 16:15:13 -0500
To: NGC4LIB_at_LISTSERV.ND.EDU
Jonathan,
   Thanks for this well articulated explanation.  Because we are a 
Solr-based system too, it is particularly relevant (in the original 
meaning) to our situation.  But this isn't unique to Solr.  We had 
Endeca underlying our NGC until this past summer when we swapped out the 
engine but kept the same chassis. This was purely for cost-cutting 
reasons - Endeca was performing fine but it wasn't "free".   To see it 
in action: http://catalog.fcla.edu  where your one-word searches should 
get you what you want everytime.  More than one word might get you what 
you really need.  ;-)

  - Michele

On 12/6/2010 3:58 PM, Jonathan Rochkind wrote:
> Ranking in search results anywhere, as with all software, is based on 
> defining certain rules that will be followed to order the results.
>
> Even in Google, citation (link) analysis is actually just one small 
> part of the rules; it's the one they are most famous for, because it's 
> the big innovation that let their search results satisfy users so much 
> better than the competition when they first started. Even then it 
> wasn't the only component in the rules, and since then Google has 
> added all sorts of other tweaks too.   The details are carefully 
> guarded by google, but if you search around (on google, heh) you can 
> find various articles (from google and from third party observers) 
> explaining some of the basic concepts.  Commercial online businesses 
> are of course especially interested in reverse engineering googles 
> rules to get their pages to the top.
>
> Even looking at the link analysis aspect of google, the key innovation 
> there was not actually to put "popular"  (oft linked to) sites first. 
> They do that, but the actual genius element of google citation/link 
> analysis was that the words people use in a link pointing to a site 
> are very useful metadata describing that site -- the words many people 
> use to describe a site, will probably average out to often match the 
> words a searcher might use when looking for content like that site. 
> That's really the genius element there.
>
> No matter what, this kind of ranking is an attempt to give users 
> something that satisfies their needs when they enter some words in a 
> search box -- all you've got is the words, which are kind of a 
> surrogate for what the user really wants, and you try to figure out 
> how to make most of the users satisfied most of the time. The fact 
> that we all use Google every single day and find it invaluable shows 
> that this is possible.  But it's _definitely_ an 'opinionated' 
> endeavor, trying to make your system satisfy as well as possible in as 
> many searches as possible -- it's not like there is some physical 
> quantity "satisfactoriness" which just has to be measured or 
> something, it's software tricks to try and take a users query and, on 
> average, in as many cases as you can, give the user what will satisfy 
> them. [For that matter, it occurs to me now, philosophically very very 
> similar to the reference interview -- the user _says_ something, and 
> you need to get at what they really want/need.  The difference is that 
> in software it all needs to be encoded into 
> guidelines/rules/instructions/heuristics for the software, you don't 
> get to have a human creatively having a conversation with another human].
>
> Some folks at the university of wisconsin provided an accessible 
> summary of how their Solr-based experimental catalog does results 
> ordering, that you may find helpful. (Solr is open source software for 
> textual search that many of us are using to develop library catalog 
> search interfaces). 
> http://forward.library.wisconsin.edu/moving-forward/?p=713
>
> There are some other things going on that that blog post doesn't 
> mention to. In particular, one of the key algorithms in Solr (or 
> really in the lucene software Solr is based on) involves examining 
> term frequency. If your query consists of several words, and one of 
> them is very rare across the corpus -- matches for that word will be 
> boosted higher.  Also, a document containing that search word many 
> times will be boosted higher than a document containing that word just 
> one time. There is an actual simple mathematical formula behind that 
> -- that has turned out to in general/average be valuable in ranking 
> search results to give the users what they meant, it's kind of the 
> foundational algorithm in solr/lucene. But in a particular 
> application/domain, additional tweaking is often required, and all 
> that stuff mixes in together, resulting in something that is described 
> by fairly complex mathematical formula, and is not an exact science 
> (as Google's own ranking is not either!).
>
> Additionally a Solr search engine, when giving a multiple word query, 
> might boost documents higher when those words are found in proximity 
> to each other. Or might allow results that don't match all the words, 
> but boost higher the more words found.
>
> Anyhow, I spend some time describing this, because it is indeed 
> absolutely crucial that catalogers, reference/research librarians, and 
> pretty much all librarians understand the basic ideas here.  These are 
> the tools of our trade now, whether you're using a free web tool like 
> Google, a licensed vendor database like EBSCO, or a locally installed 
> piece of software (open source or proprietary) like a library catalog.
>
>
> On 12/6/2010 3:20 PM, Weinheimer Jim wrote:
>> Michele Newberry wrote:
>> <snip>
>> Oh my goodness -- isn't this exactly what we do when we "tweak" our
>> relevance ranking algorithms in our own systems?  We call it the "On the
>> road" tweak -- doing what we need to do to make this obvious titles
>> appear on the first page of the results preferably near the very top.
>> You could also call it the "Gone with the wind" tweak or even the
>> "Nature" tweak.
>> </snip>
>>
>> On 12/6/2010 2:10 PM, Jonathan Rochkind wrote:
>> <snip>
>> Of COURSE Google's algorithms are the result of subjective human
>> judgements in how to best apply available technology to meet user
>> needs. This should surprise nobody that knows that software isn't magic,
>> it just does exactly what programmers tell it to.
>> </snip>
>>
>> Interesting reactions. Google very clearly tweaked its results based 
>> on a story from the NY Times, and the purpose was to downgrade 
>> certain results based on what they considered to be the "greater 
>> good" or something like that. The articles very clearly pointed out 
>> that being able to do this is *incredibly powerful* in terms of 
>> societal impact, and I agree. After all, people trust Google.
>>
>> I confess, I have never understood relevance ranking in library 
>> catalogs, although I do understand the concept rather clearly in 
>> general search engines such as Google, which is based on various 
>> types of citation analysis. In this article, Google pretty much 
>> admitted that they tweak results based on political considerations 
>> (i.e. articles in the NY Times). How would Google have tweaked things 
>> during the US Civil War? Or during WWI? What else is Google doing 
>> today that we don't know? Do libraries tweak results based on 
>> political considerations? I hope not.
>>
>> I brought up these articles as examples of some very difficult 
>> matters that the entire information world needs to deal with today, 
>> since these matters often have very tangible consequences for society.
>>
>> James L. Weinheimer  j.weinheimer_at_aur.edu
>> Director of Library and Information Services
>> The American University of Rome
>> Rome, Italy
>> First Thus: http://catalogingmatters.blogspot.com/
>>
>

-- 
~NOTE EMAIL ADDRESS CHANGE TO FCLMIN_at_UFL.EDU~~~~~~~~~~~~~~~~~~~
Michele Newberry        Assistant Director for Library Services
Florida Center for Library Automation              352-392-9020
5830 NW 39th Avenue                          352-392-9185 (fax)
Gainesville, FL  32606                           fclmin_at_ufl.edu
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Received on Mon Dec 06 2010 - 16:17:20 EST