Re: Cablegate from Wikileaks: a case study

From: Jonathan Rochkind <rochkind_at_nyob> Date: Mon, 6 Dec 2010 15:58:46 -0500 To: NGC4LIB_at_LISTSERV.ND.EDU

Ranking in search results anywhere, as with all software, is based on 
defining certain rules that will be followed to order the results.

Even in Google, citation (link) analysis is actually just one small part 
of the rules; it's the one they are most famous for, because it's the 
big innovation that let their search results satisfy users so much 
better than the competition when they first started. Even then it wasn't 
the only component in the rules, and since then Google has added all 
sorts of other tweaks too.   The details are carefully guarded by 
google, but if you search around (on google, heh) you can find various 
articles (from google and from third party observers) explaining some of 
the basic concepts.  Commercial online businesses are of course 
especially interested in reverse engineering googles rules to get their 
pages to the top.

Even looking at the link analysis aspect of google, the key innovation 
there was not actually to put "popular"  (oft linked to) sites first. 
They do that, but the actual genius element of google citation/link 
analysis was that the words people use in a link pointing to a site are 
very useful metadata describing that site -- the words many people use 
to describe a site, will probably average out to often match the words a 
searcher might use when looking for content like that site. That's 
really the genius element there.

No matter what, this kind of ranking is an attempt to give users 
something that satisfies their needs when they enter some words in a 
search box -- all you've got is the words, which are kind of a surrogate 
for what the user really wants, and you try to figure out how to make 
most of the users satisfied most of the time. The fact that we all use 
Google every single day and find it invaluable shows that this is 
possible.  But it's _definitely_ an 'opinionated' endeavor, trying to 
make your system satisfy as well as possible in as many searches as 
possible -- it's not like there is some physical quantity 
"satisfactoriness" which just has to be measured or something, it's 
software tricks to try and take a users query and, on average, in as 
many cases as you can, give the user what will satisfy them. [For that 
matter, it occurs to me now, philosophically very very similar to the 
reference interview -- the user _says_ something, and you need to get at 
what they really want/need.  The difference is that in software it all 
needs to be encoded into guidelines/rules/instructions/heuristics for 
the software, you don't get to have a human creatively having a 
conversation with another human].

Some folks at the university of wisconsin provided an accessible summary 
of how their Solr-based experimental catalog does results ordering, that 
you may find helpful. (Solr is open source software for textual search 
that many of us are using to develop library catalog search interfaces). 
http://forward.library.wisconsin.edu/moving-forward/?p=713

There are some other things going on that that blog post doesn't mention 
to. In particular, one of the key algorithms in Solr (or really in the 
lucene software Solr is based on) involves examining term frequency. If 
your query consists of several words, and one of them is very rare 
across the corpus -- matches for that word will be boosted higher.  
Also, a document containing that search word many times will be boosted 
higher than a document containing that word just one time. There is an 
actual simple mathematical formula behind that -- that has turned out to 
in general/average be valuable in ranking search results to give the 
users what they meant, it's kind of the foundational algorithm in 
solr/lucene. But in a particular application/domain, additional tweaking 
is often required, and all that stuff mixes in together, resulting in 
something that is described by fairly complex mathematical formula, and 
is not an exact science (as Google's own ranking is not either!).

Additionally a Solr search engine, when giving a multiple word query, 
might boost documents higher when those words are found in proximity to 
each other. Or might allow results that don't match all the words, but 
boost higher the more words found.

Anyhow, I spend some time describing this, because it is indeed 
absolutely crucial that catalogers, reference/research librarians, and 
pretty much all librarians understand the basic ideas here.  These are 
the tools of our trade now, whether you're using a free web tool like 
Google, a licensed vendor database like EBSCO, or a locally installed 
piece of software (open source or proprietary) like a library catalog.

On 12/6/2010 3:20 PM, Weinheimer Jim wrote:
> Michele Newberry wrote:
> <snip>
> Oh my goodness -- isn't this exactly what we do when we "tweak" our
> relevance ranking algorithms in our own systems?  We call it the "On the
> road" tweak -- doing what we need to do to make this obvious titles
> appear on the first page of the results preferably near the very top.
> You could also call it the "Gone with the wind" tweak or even the
> "Nature" tweak.
> </snip>
>
> On 12/6/2010 2:10 PM, Jonathan Rochkind wrote:
> <snip>
> Of COURSE Google's algorithms are the result of subjective human
> judgements in how to best apply available technology to meet user
> needs. This should surprise nobody that knows that software isn't magic,
> it just does exactly what programmers tell it to.
> </snip>
>
> Interesting reactions. Google very clearly tweaked its results based on a story from the NY Times, and the purpose was to downgrade certain results based on what they considered to be the "greater good" or something like that. The articles very clearly pointed out that being able to do this is *incredibly powerful* in terms of societal impact, and I agree. After all, people trust Google.
>
> I confess, I have never understood relevance ranking in library catalogs, although I do understand the concept rather clearly in general search engines such as Google, which is based on various types of citation analysis. In this article, Google pretty much admitted that they tweak results based on political considerations (i.e. articles in the NY Times). How would Google have tweaked things during the US Civil War? Or during WWI? What else is Google doing today that we don't know? Do libraries tweak results based on political considerations? I hope not.
>
> I brought up these articles as examples of some very difficult matters that the entire information world needs to deal with today, since these matters often have very tangible consequences for society.
>
> James L. Weinheimer  j.weinheimer_at_aur.edu
> Director of Library and Information Services
> The American University of Rome
> Rome, Italy
> First Thus: http://catalogingmatters.blogspot.com/
>