Ranking in search results anywhere, as with all software, is based on
defining certain rules that will be followed to order the results.
Even in Google, citation (link) analysis is actually just one small part
of the rules; it's the one they are most famous for, because it's the
big innovation that let their search results satisfy users so much
better than the competition when they first started. Even then it wasn't
the only component in the rules, and since then Google has added all
sorts of other tweaks too. The details are carefully guarded by
google, but if you search around (on google, heh) you can find various
articles (from google and from third party observers) explaining some of
the basic concepts. Commercial online businesses are of course
especially interested in reverse engineering googles rules to get their
pages to the top.
Even looking at the link analysis aspect of google, the key innovation
there was not actually to put "popular" (oft linked to) sites first.
They do that, but the actual genius element of google citation/link
analysis was that the words people use in a link pointing to a site are
very useful metadata describing that site -- the words many people use
to describe a site, will probably average out to often match the words a
searcher might use when looking for content like that site. That's
really the genius element there.
No matter what, this kind of ranking is an attempt to give users
something that satisfies their needs when they enter some words in a
search box -- all you've got is the words, which are kind of a surrogate
for what the user really wants, and you try to figure out how to make
most of the users satisfied most of the time. The fact that we all use
Google every single day and find it invaluable shows that this is
possible. But it's _definitely_ an 'opinionated' endeavor, trying to
make your system satisfy as well as possible in as many searches as
possible -- it's not like there is some physical quantity
"satisfactoriness" which just has to be measured or something, it's
software tricks to try and take a users query and, on average, in as
many cases as you can, give the user what will satisfy them. [For that
matter, it occurs to me now, philosophically very very similar to the
reference interview -- the user _says_ something, and you need to get at
what they really want/need. The difference is that in software it all
needs to be encoded into guidelines/rules/instructions/heuristics for
the software, you don't get to have a human creatively having a
conversation with another human].
Some folks at the university of wisconsin provided an accessible summary
of how their Solr-based experimental catalog does results ordering, that
you may find helpful. (Solr is open source software for textual search
that many of us are using to develop library catalog search interfaces).
http://forward.library.wisconsin.edu/moving-forward/?p=713
There are some other things going on that that blog post doesn't mention
to. In particular, one of the key algorithms in Solr (or really in the
lucene software Solr is based on) involves examining term frequency. If
your query consists of several words, and one of them is very rare
across the corpus -- matches for that word will be boosted higher.
Also, a document containing that search word many times will be boosted
higher than a document containing that word just one time. There is an
actual simple mathematical formula behind that -- that has turned out to
in general/average be valuable in ranking search results to give the
users what they meant, it's kind of the foundational algorithm in
solr/lucene. But in a particular application/domain, additional tweaking
is often required, and all that stuff mixes in together, resulting in
something that is described by fairly complex mathematical formula, and
is not an exact science (as Google's own ranking is not either!).
Additionally a Solr search engine, when giving a multiple word query,
might boost documents higher when those words are found in proximity to
each other. Or might allow results that don't match all the words, but
boost higher the more words found.
Anyhow, I spend some time describing this, because it is indeed
absolutely crucial that catalogers, reference/research librarians, and
pretty much all librarians understand the basic ideas here. These are
the tools of our trade now, whether you're using a free web tool like
Google, a licensed vendor database like EBSCO, or a locally installed
piece of software (open source or proprietary) like a library catalog.
On 12/6/2010 3:20 PM, Weinheimer Jim wrote:
> Michele Newberry wrote:
> <snip>
> Oh my goodness -- isn't this exactly what we do when we "tweak" our
> relevance ranking algorithms in our own systems? We call it the "On the
> road" tweak -- doing what we need to do to make this obvious titles
> appear on the first page of the results preferably near the very top.
> You could also call it the "Gone with the wind" tweak or even the
> "Nature" tweak.
> </snip>
>
> On 12/6/2010 2:10 PM, Jonathan Rochkind wrote:
> <snip>
> Of COURSE Google's algorithms are the result of subjective human
> judgements in how to best apply available technology to meet user
> needs. This should surprise nobody that knows that software isn't magic,
> it just does exactly what programmers tell it to.
> </snip>
>
> Interesting reactions. Google very clearly tweaked its results based on a story from the NY Times, and the purpose was to downgrade certain results based on what they considered to be the "greater good" or something like that. The articles very clearly pointed out that being able to do this is *incredibly powerful* in terms of societal impact, and I agree. After all, people trust Google.
>
> I confess, I have never understood relevance ranking in library catalogs, although I do understand the concept rather clearly in general search engines such as Google, which is based on various types of citation analysis. In this article, Google pretty much admitted that they tweak results based on political considerations (i.e. articles in the NY Times). How would Google have tweaked things during the US Civil War? Or during WWI? What else is Google doing today that we don't know? Do libraries tweak results based on political considerations? I hope not.
>
> I brought up these articles as examples of some very difficult matters that the entire information world needs to deal with today, since these matters often have very tangible consequences for society.
>
> James L. Weinheimer j.weinheimer_at_aur.edu
> Director of Library and Information Services
> The American University of Rome
> Rome, Italy
> First Thus: http://catalogingmatters.blogspot.com/
>
Received on Mon Dec 06 2010 - 16:00:08 EST