Re: The next generation of discovery tools (new LJ article)

From: Till Kinstler <kinstler_at_nyob> Date: Thu, 31 Mar 2011 13:07:27 +0200 To: NGC4LIB_at_LISTSERV.ND.EDU

Am 30.03.2011 22:49, schrieb Jonathan Rochkind:
> On 3/30/2011 11:33 AM, Till Kinstler wrote:
>> 1/0) is dating back to the 1970s. And some conclusions from that made it
>> even into libraryland as early as the 1980s (s. for example writings by
>> Charles Hildreth, one article from 1987 even being titled "Beyond
>> Boolean: ...").
> 
> Thanks for the psuedo-cite, I'll track it down and add it to my white
> paper trying to explain the point of relevancy ranking in a library
> context: 
> http://bibwild.wordpress.com/2011/03/28/information-retrieval-and-relevance-ranking-for-librarians/
> 
> 
> If you have any other such cites,

Hmmm, I am just finishing a book chapter on that, to be published in
http://www.facetpublishing.co.uk/title.php?id=716-6&category_code=820
(yes, sorry, it's a plain old book :-)). I explain the information
retrieval background of search engines there and try to put it into the
context of library search. I think it is helpful to understand the
basics of retrieval models to understand the paradigm shift from Boolean
"exact match" to "best match" searching now finally happening in library
search. And to understand why it makes sense...
I have a list of references for the chapter, so perhaps I'll just share
it (don't have access to it at the moment, but can publish it later).
BTW: Eric has a very nice introduction to TF*IDF weighting on the web:
http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-i-for-librarians/

> Also interested in your opinion
> of my essay in general, Tim.

I'll have a look...

> Although I have to admit, I don't think my time is particularly
> efficiently spent trying to improve on the relevancy ranking algorithm
> itself of lucene 

No, but Lucene and Solr allow lots of improvement/hacking of the basic
term statistics based ranking through boosting and additional
calculations, without touching the Lucene code. Things like boosting
certain collections (eg. those in a branch library based on users'
location, cookie recipes before christmas, ...), or boosting online
documents at night (when the library is closed) or in remote access and
boosting locally available books when someone searches from a local
terminal in a library (online access might be less convenient then than
walking to a shelf...). Don't know, many ideas (and of course they might
not make sense), we should just experiment...

> Work is being done.

Yes, but the general usefulness of relevance ranking is still often
questioned, at least in my environment. I think, the question is not,
whether to do it at all, but how to do it...

> The "facet
> limit" tools we all provide are one such technique, but I think we can
> make em work better and be more powerful without being more confusing.

Yes, that's another important field, definitely. The common "a bunch of
facets to the right or left of search results" can't be the final thing.
For example some facet values can be better visualized than displayed in
lists, like we did with the year of publication slider on
http://finden.nationallizenzen.de/ (I think, you did work on that in
Blacklight as well).
Or we use facets to find the most frequent author in a result set, look
the name up in Wikipedia and show the first lines of the article
together with search results "by" that author adjacent to the "normal"
search results. So you can disambiguate nicely the "works of/works by"
problem in non-fielded search, e.g.:
http://finden.nationallizenzen.de/Search/Results?lookfor=lise+meitner
Here, authority data (or even better: linked data) for linking to
external resources would be very useful...
Or how to design a user interface for the "author facet" that allows
intuitive selection of "works by A AND B" as well as "works by A OR B"?
One problem with facets is library data, again. If the OCLC report on
MARC tag usage says, that less than 15% of records have eg. a DDC number
or any other subject classification, it just doesn't make sense to offer
DDC facets, I think. By browsing the DDC facet, you loose 85% of items
right away (because they are just not browsable by DDC)... OK, local
data should be be more homogeneous in terms of classification and
subject headings than the aggregated WorldCat, but still... Be careful
on what "fields" you show facets...
Yes, facets is definitely another interesting field. But same as with
ranking: Play with them to find out what you can do beyond today's out
of the box solutions...

Till

-- 
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de