Re: Cablegate from Wikileaks: a case study

From: Karen Coyle <lists_at_nyob> Date: Mon, 6 Dec 2010 14:34:51 -0800 To: NGC4LIB_at_LISTSERV.ND.EDU

I'm less sanguine about Google's choices. We know for a fact that they  
downgrade the ranking of sexually explicit materials -- NOT because  
they think users don't want those materials (we know they do) but  
because Google doesn't want to be seen as a purveyor of dirty stuff.  
Amazon has done the same thing by singling out sexually explicit  
materials and eliminating any ranking that would allow them to turn up  
on ranked lists. (There was the big flap when Amazon "accidentally"  
eliminated all "gay" literature, whether sexually explicit or not, and  
had to back down because of the outcry.) This, to my mind, goes beyond  
tweaking things to bring in more advertising revenue. Google claims  
that porn searching was burdening the system, which is why they had to  
do that, but I really doubt that's the reason. This is a cultural  
choice and a business choice. The difficulty is that as private  
corporations neither of them has to reveal that these choices are  
being made, so there's no way to be aware of them while searching.  
(I'd be happy if Wikileaks took on a few corporations - Google, BP,  
various banks....)

kc

Quoting Jonathan Rochkind <rochkind_at_JHU.EDU>:

> Ranking in search results anywhere, as with all software, is based  
> on defining certain rules that will be followed to order the results.
>
> Even in Google, citation (link) analysis is actually just one small  
> part of the rules; it's the one they are most famous for, because  
> it's the big innovation that let their search results satisfy users  
> so much better than the competition when they first started. Even  
> then it wasn't the only component in the rules, and since then  
> Google has added all sorts of other tweaks too.   The details are  
> carefully guarded by google, but if you search around (on google,  
> heh) you can find various articles (from google and from third party  
> observers) explaining some of the basic concepts.  Commercial online  
> businesses are of course especially interested in reverse  
> engineering googles rules to get their pages to the top.
>
> Even looking at the link analysis aspect of google, the key  
> innovation there was not actually to put "popular"  (oft linked to)  
> sites first. They do that, but the actual genius element of google  
> citation/link analysis was that the words people use in a link  
> pointing to a site are very useful metadata describing that site --  
> the words many people use to describe a site, will probably average  
> out to often match the words a searcher might use when looking for  
> content like that site. That's really the genius element there.
>
> No matter what, this kind of ranking is an attempt to give users  
> something that satisfies their needs when they enter some words in a  
> search box -- all you've got is the words, which are kind of a  
> surrogate for what the user really wants, and you try to figure out  
> how to make most of the users satisfied most of the time. The fact  
> that we all use Google every single day and find it invaluable shows  
> that this is possible.  But it's _definitely_ an 'opinionated'  
> endeavor, trying to make your system satisfy as well as possible in  
> as many searches as possible -- it's not like there is some physical  
> quantity "satisfactoriness" which just has to be measured or  
> something, it's software tricks to try and take a users query and,  
> on average, in as many cases as you can, give the user what will  
> satisfy them. [For that matter, it occurs to me now, philosophically  
> very very similar to the reference interview -- the user _says_  
> something, and you need to get at what they really want/need.  The  
> difference is that in software it all needs to be encoded into  
> guidelines/rules/instructions/heuristics for the software, you don't  
> get to have a human creatively having a conversation with another  
> human].
>
> Some folks at the university of wisconsin provided an accessible  
> summary of how their Solr-based experimental catalog does results  
> ordering, that you may find helpful. (Solr is open source software  
> for textual search that many of us are using to develop library  
> catalog search interfaces).  
> http://forward.library.wisconsin.edu/moving-forward/?p=713
>
> There are some other things going on that that blog post doesn't  
> mention to. In particular, one of the key algorithms in Solr (or  
> really in the lucene software Solr is based on) involves examining  
> term frequency. If your query consists of several words, and one of  
> them is very rare across the corpus -- matches for that word will be  
> boosted higher.  Also, a document containing that search word many  
> times will be boosted higher than a document containing that word  
> just one time. There is an actual simple mathematical formula behind  
> that -- that has turned out to in general/average be valuable in  
> ranking search results to give the users what they meant, it's kind  
> of the foundational algorithm in solr/lucene. But in a particular  
> application/domain, additional tweaking is often required, and all  
> that stuff mixes in together, resulting in something that is  
> described by fairly complex mathematical formula, and is not an  
> exact science (as Google's own ranking is not either!).
>
> Additionally a Solr search engine, when giving a multiple word  
> query, might boost documents higher when those words are found in  
> proximity to each other. Or might allow results that don't match all  
> the words, but boost higher the more words found.
>
> Anyhow, I spend some time describing this, because it is indeed  
> absolutely crucial that catalogers, reference/research librarians,  
> and pretty much all librarians understand the basic ideas here.   
> These are the tools of our trade now, whether you're using a free  
> web tool like Google, a licensed vendor database like EBSCO, or a  
> locally installed piece of software (open source or proprietary)  
> like a library catalog.
>
>
> On 12/6/2010 3:20 PM, Weinheimer Jim wrote:
>> Michele Newberry wrote:
>> <snip>
>> Oh my goodness -- isn't this exactly what we do when we "tweak" our
>> relevance ranking algorithms in our own systems?  We call it the "On the
>> road" tweak -- doing what we need to do to make this obvious titles
>> appear on the first page of the results preferably near the very top.
>> You could also call it the "Gone with the wind" tweak or even the
>> "Nature" tweak.
>> </snip>
>>
>> On 12/6/2010 2:10 PM, Jonathan Rochkind wrote:
>> <snip>
>> Of COURSE Google's algorithms are the result of subjective human
>> judgements in how to best apply available technology to meet user
>> needs. This should surprise nobody that knows that software isn't magic,
>> it just does exactly what programmers tell it to.
>> </snip>
>>
>> Interesting reactions. Google very clearly tweaked its results  
>> based on a story from the NY Times, and the purpose was to  
>> downgrade certain results based on what they considered to be the  
>> "greater good" or something like that. The articles very clearly  
>> pointed out that being able to do this is *incredibly powerful* in  
>> terms of societal impact, and I agree. After all, people trust  
>> Google.
>>
>> I confess, I have never understood relevance ranking in library  
>> catalogs, although I do understand the concept rather clearly in  
>> general search engines such as Google, which is based on various  
>> types of citation analysis. In this article, Google pretty much  
>> admitted that they tweak results based on political considerations  
>> (i.e. articles in the NY Times). How would Google have tweaked  
>> things during the US Civil War? Or during WWI? What else is Google  
>> doing today that we don't know? Do libraries tweak results based on  
>> political considerations? I hope not.
>>
>> I brought up these articles as examples of some very difficult  
>> matters that the entire information world needs to deal with today,  
>> since these matters often have very tangible consequences for  
>> society.
>>
>> James L. Weinheimer  j.weinheimer_at_aur.edu
>> Director of Library and Information Services
>> The American University of Rome
>> Rome, Italy
>> First Thus: http://catalogingmatters.blogspot.com/
>>
>

-- 
Karen Coyle
kcoyle@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet