Recommender systems (was bibtip (How it works))

From: Stephens, Owen <o.stephens_at_nyob> Date: Mon, 19 May 2008 09:41:37 +0100 To: NGC4LIB_at_LISTSERV.ND.EDU

Some interesting discussion on bibtip and recommender systems triggered
a few thoughts:

1. This is a difficult problem, and commercial organisations are willing
to put significant resource into it. Would some library organisation
(OCLC, Library consortiums, national or international groups of
libraries) be interested in running a competition along the lines of the
Netflix competition http://www.netflixprize.com/?

2. We would be in a much better position to build a critical mass of
recommender (and other) information if we had a single source of bib
records which acted as a hub for linking. OpenLibrary has a goal that
might fulfil this, but Worldcat is clearly a good starting point as
well. We need to exploit linking (another example of high cost
behaviour, so more likely to be meaningful). Wikipedia entries appear
highly ranked in Google rankings and it must be partially related to it
becoming a de facto standard for linking basic reference information -
if we could emulate that so there was a central resource to which people
'just linked' when they cited bib information, then this would really
start to exploit the latent information available in the web. (We are
really late on this one, and have lots of catching up to do - I would
guess Amazon must be the main receiver of 'bib' linking on the web
currently).

Kevin said: "When it comes to things that really require critical mass -
like tagging, reviews and ratings - we need to begin to develop
platforms that can link users, usage and bib data across universities (I
am really not qualified to comment on the needs of public libraries in
this context). We have the technology now to begin doing this."

I would argue all the technology is there, and has been for years. If
there was a recognised hub of bib information which people used to link
to, and tag (using existing bookmarking services like delicious, digg
etc.), this would work right now - technology is simply not an issue.
Being slightly less ambitious, the number of catalogues that offer the
ability to easily bookmark a bib record without including session info
in the URL is lamentable - which would be so simple from a technology
perspective.

3. Just to highlight another recommender system that is being developed,
bX, developed by Herbert van de Sompel and others. I've blogged a recent
announcement about this from Ex Libris at
http://www.meanboyfriend.com/overdue_ideas/2008/05/bx-and-sfx-for.html
(at the bottom) - but this is an attempt to exploit user behaviour
information gathered by OpenURL resolvers. I guess it will also be
susceptible to some of the problems discussed by Tim and Kevin, however,
my guess is that this captures higher cost behaviour than looking at
full OPAC records.

Owen

-----Original Message-----
From: Next generation catalogs for libraries
[mailto:NGC4LIB_at_LISTSERV.ND.EDU] On Behalf Of Kevin M Kidd
Sent: 17 May 2008 05:09
To: NGC4LIB_at_LISTSERV.ND.EDU
Subject: Re: [NGC4LIB] bibtip (How it works)

Hello Tim,
Thanks for the very interesting reply. I apologize for calling the "ant
navigation" analogy faulty. Reinforcement is indeed a something that
needs to be considered in such systems, I am just not sure it is - in
the case of BibTip -  as big a problem as you believe. For a detailed
technical explanation of the algorithm used in BibTip, see Research and
Advanced Technology for Digital Libraries. 2003. (Lecture Notes in
Computer Science; 2769), by Dr. Andreas Geyer-Schulz, et. al.

Beyond that, I am not entirely clear what else you are disagreeing with.
My responses are below:

>So, I'm very interested in this topic and think BibTip is an
interesting test. That said let me disagree with most of your email.

>>> In fact, your "ant navigation" analogy is a faulty one in this case.
BibTip works astoundingly well, and it is not because it simply follows
"where users go". Instead, BibTip uses "Repeat Buying Theory" as a
framework to statistically analyze user search behavior. Repeat Buying
Theory is a highly successful and well-tested statistical framework to
describe the regularity of repeat-buying behavior of consumers within a
distinct period of time.

>So, I don't want to get snippy, but I am pretty familiar with the
statistical problems of this approach and of others, and have done
extensive work-and with appropriately large datasets-on LibraryThing.
Pardon the contradiction, but nothing about my description of the "ant
tracking" problem was faulty, so let me explain it again.

Kevin: Large datasets are obviously better, as is the length of time the
that the co-browsing data has been collected and analyzed. BibTip has
been running at Karlsruhe since 2002 and their cataloged collection size
is more than 15 million documents in 23 libraries. Statistical problems
aside, it has been quite a successful experiment. I encourage you to
search their catalog at
http://www.ubka.uni-karlsruhe.de/hylib/suchmaske.html

>>> The developers of BibTip at Karlsruhe University very skillfully
adapted this theory to the session-based search behavior of library OPAC
users. They key is that BibTip only records the inspection of the full
details of an individual bib record selected from a larger list of
search results. It does not "follow" the user.

>I understand it doesn't "follow the user" on two legs, but it records
what books discrete users visit and then makes statistical inferences
from it. This amounts to a picture of where are and where they go,
albeit without the order-of-events data which, actually, would improve
it.

Kevin: As I understand it, the algorithm is not concerned at all about
discrete users. It is concerned simply with discrete pairs of records
and session identifiers. Record pairs are analyzed, no other aspect of
user behavior matters. I would be interested to know why order-of-events
matters in this context?

>>> In this framework, clicking-on and reading the full details of a
given record is an economic choice. The choice of one record over all of
the others in a given list is viewed as an economic choice, very similar
to individual's choice to purchase one thing over another during a given
trip to the store. There is a real cost in time (e.g. an economic cost)
for the user each time he/she selects and views a record. It can be
assumed that the "search cost" to a user is high enough that he/she is
willing only to view the details of a record which is truly of interest.
Users, in effect, are self-selecting. That is, users with common
interests will select the same documents, and, since recommendations are
only provided to users from the full details view, we can surmise that
recommendations are only offered to interested users.

>All this is obvious. But systems built on statistics have flaws. One
of them is reinforcement-the ant problem. The more you recommend
something the more people will follow your recommendations and the
more co-occurrences there will be. Success breeds success. The ant
trail, once started, has a tendency to get stronger. This is true in
any recommendation system, unless you adjust for it explicitly-which
has statistical problems too.

>At base, the quality of the recommendations from a screen-watching
model are related to three factors: (1) how good is the searching? and
(2) how costly is failure?

>Take Amazon purchases. Finding a book on Amazon is easy, and the cost
is high. So there are few mistakes-people tend to buy the books they
want, so the signal is strong. The main problems relate to agency,
people buying books for other people. That's hard to correct for, so
you just hope that you can discern signal from noise when the quantity
of data is great.

>Unfortunately, the OPAC is not Amazon. Leaving aside scale-WorldCat
receives 0.7% of the hits Amazon does!-there's the issue of search
quality. Bad OPAC search means that people spend a lot of time on
detail pages they weren't aiming for. Search for "Da Vinci Code" at
SPL, for example,and you get a page full of results without an actual
paper copy of the Da Vinci Code. Results matter. People hate reading
results, they hate revising searches and they hate looking at
subsequent search pages. Many would rather dive into a record quickly
and back out. They're rather dive in and see if they can leverage
partial success. Personally, I've learned to click on some version I
don't want-the Spanish or the eBook-knowing there's an author link
there I can leverage to get to the paper version I really want. I
suspect I am not alone. And each time I do it, I create noisy data for
a recommendation system.

Kevin: Noise is indeed an issue, though, since BibTip functions without
needing to know how something was searched (e.g. it does not record the
search terms that got a user to a particular record), I am unclear as to
how the quality of a particular search matters. Diving into a record
quickly and backing-out quickly falls well within the repeat-buying
theory model. Indeed, repeat buying theory predicts random co-browsing
(diving-in) very well - in fact, that is the very point of the theory!
Recommendations are based upon those records that fall outside of
regular random co-browsing - the outliers. To quote Dr. Geyer-Schulz
(one of the developers of BibTip):

"Ehrenberg's theory faithfully models the noise part of buying
processes. That is, repeat-buying theory is capable of predicting random
co-purchases of consumer goods. Intentionally bought combinations of
consumer goods--a six-pack of beer, spareribs, potatoes, and barbecue
sauce for dinner, for example--are outliers. In this sense, Ehrenberg's
theory acts as a filter to suppress noise (stochastic regularity) in
buying behavior." [From: Andreas Geyer-Schulz, Andreas Neumann und Anke
Thede. An Architecture for Behavior-Based Library Recommender Systems.
Information Technology and Libraries 22(4), p.169 (2003).]

That is, *most* of the given transactions are noise. Searches terms and
strategies are irrelevant. The co-browsing of records that lies outside
of the what is called the logarithmic series distribution is the
browsing that needs to be examined for potential recommendations.

I would point out that your example of clicking on a Spanish or an
e-book version of a record to get to a paper version would not
necessarily constitute noise in this model.

>>> In order to build relationships among given documents, BibTip
analyzes record pairs.

>That's the beginning of a good system. Among various improvements,
Amazon keeps track of order, because order matters. If 50 people who
look at the Spanish-language Harry Potter also look at the English,
that's interesting. That 49 of 50 went from the Spanish to the
English, that's more interesting. It suggests the Spanish should
recommend the English more highly than the reverse.

Kevin: In this model, both the Spanish and the English versions would be
recommended (and correctly, I think).

>>> For each record X that has been viewed in the full details view of
the OPAC, a "purchase history" is built. This is simply a list of all of
the sessions in which record X has been viewed. Record X is then
compared with all other records (Y) which have been viewed in the same
session as X. For each pair of records (X,Y) that have been viewed in
the same session, a second purchase history is built. The number of
users who have viewed record X and another record Y in the same session
is statistically analyzed and the probability of a "co-inspection" of
records X and Y in a given session is calculated. A recommendation for
record X (That is, users who liked X also liked.) is created when record
Y has been viewed more often in the same session that can be expected
from random selections.

>I do hope they have a threshold-that a single incidence of
co-occurrence will never trigger a suggestion. Otherwise it's a
privacy problem waiting to happen. John Blyberg discussed this problem
when SOPAC was released.

Kevin: See my response about noise above.

>>> This "repeat buying theory" is remarkably good at automatically
determining relevant recommendations for a given item. It takes some
time for enough data to be collected so that good recommendations are
available for a substantial part of a collection, but what is the hurry?
Of course, the longer you have the algorithm running, the better your
recommendations become. The more users you have, the better your
recommendations become. But, time is on our side in this case ;-)

>Well, except that you also need to expire data, or weight it less over
time. That, ten years ago people were examining the Bible and the
Bible Code is not an accurate predictor of what Bible readers want
today.

Kevin: I am not sure about the need to create a special process to
expire data in this context. Expiration of the data occurs naturally as
a result of the recalculation of the algorithm (again, see Lecture Notes
in Computer Science 2769). Presumably people have been examining the
bible and other related materials in the intervening 10 years? Over
time, the recommendations will reflect user preferences.

>>> Frustratingly, for all the talk here and elsewhere of the features
of next generation catalogs, I rarely find anything that convinces me
that librarians understand that collecting/harvesting and re-using user
(and usage) data is the key to most (if not all) of the services we want
these new catalogs to provide. Without seriously thinking about the
implications of harnessing collective intelligence - and taking steps
*now* to build systems that do - we are not going to get very far.
BibTip as a service is a big step in the right direction.

>Absolutely. I agree. I agree completely. I am a fan of the idea.
Nobody should take my words as a wet blanket on the fires of
experimentation. But moving past my desire to promote interesting
things and into analysis and experience with the topic, I am skeptical
on two fronts:

>1. There are real privacy implications to collecting user data. I
think they can be solved, but they cannot be dismissed. And solving
them hurts your data quality/quantity.

Kevin: My knee-jerk reaction is to dismiss the privacy implications.
But, I know that this cannot be done, as you say. I do believe that,
given the current information environment, our patrons will be much more
amenable to user/usage data collection. There are many, many
possibilities here, from user-built profiles (we have to give them a
reason to want to build those profiles, though) to more
algorithmically-analyzed usage data (BibTip is a great example of this).

>2. I am skeptical that libraries can accumulate enough high-quality
data to compete against other systems.

Kevin: You rightly point out that the critical mass problem is a big
one. But, I don't know that we *really* need to compete with anyone.
There are 14,000 students at Boston College and I can think of a lot of
things we can do with data we could readily begin collecting. When it
comes to things that really require critical mass - like tagging,
reviews and ratings - we need to begin to develop platforms that can
link users, usage and bib data across universities (I am really not
qualified to comment on the needs of public libraries in this context).
We have the technology now to begin doing this.

>For curiosity's sake, I am tempted to try the idea, leveraging the
LibraryThing for Libraries traffic. But, for the reasons above and
having experimented a lot around with recommendation data from
different sources and of different qualities, I'm very skeptical that
OPAC-based path-watching will ever be a significant source of
recommendations. But you may label me an interested party-we sell
recommendation data to libraries.

Kevin: As you can tell, I completely disagree that OPAC-derived data
will never be a significant source of recommendations. BibTip is the
proof that a significant system can be built - right now with modest
technology and modest collection sizes.

That said, I would be very interested to know how LibraryThing - a
system which I admire very much - builds its recommendations.

Thanks again,
Kevin
--------------------------------------
Kevin M. Kidd, MA, MLIS
Library Applications & Systems Manager
Boston College Libraries
Phone: 617-552-1359
Fax: 617-552-1089
e-Mail: kevin.kidd_at_bc.edu
Blog: http://datadrivenlibrary.blogspot.com/