Re: Whose elephant is it, anyway? (the OLE project)

From: Mark Triggs <mtriggs_at_nyob> Date: Fri, 13 Mar 2009 05:43:49 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

[Apologies if this comes through twice.  I sent this about 14 hours ago and
haven't seen it arrive yet, so...]

Hi all,

I've been watching this discussion with some interest because I'm the
guy who implemented the browse functionality in the NLA's catalogue.  I
just thought I'd jump in and confirm/deny a few things here.

Our current title and uniform title browses were among the first browses
we attempted to implement, so they're currently a bit of a legacy
feature.  We implemented these using a combination of Solr range queries
and sorting and they mostly sort of work, but perhaps not quite as
smoothly as the other browses (as evidenced by the 'internal server
error' that Owen managed to produce ;o).  I'm on holidays at the moment,
but this will probably be revisited when I'm back at work.

Our other browses (names, subjects, callnumbers and series) make use of
a combination of SQLite databases and Lucene indexes.  Each browse
consists of an SQLite database with a single table of two columns: a
sort key and the text of the browse heading.  When we receive a request
to browse from a certain point we can get back the pageful of headings
to display by using a simple SQL SELECT statement.  For each heading
listed we determine the number of titles matched and any
cross-references by performing Lucene term queries (fast) on indexes of
our bib data and authority data respectively.  All of this is handled by
a Solr browse handler I've written, so all our VuFind code needs to know
is to hit the browse handler and style the XML it gets back.

Regarding scalability, our largest browse is the callnumber browse,
which consists of about 3 million entries (for 4 million bib records).
I've tested this SQLite approach up to 20 million entries and it
continued to perform well, so I'm not terribly worried for now.  Finding
the point to browse from is effectively just searching a big sorted text
file, so I would expect O(log N) growth here anyway.  Plus, our largest
SQLite database still fits entirely in memory, so that's nice too.

In terms of indexing performance, the SQLite databases take about 5-10
minutes to build in total, and they're built from scratch every time.
For each type of browse we pull all the browse headings from our bib and
authority data, remove any duplicates then load them all into an SQLite
database.  My browse handler notices when these databases have been
updated and automatically reopens them, so the update is transparent.
Currently we just do these updates once per night, as this is how often
we update our main bib indexes and it makes sense to keep the updates
synchronised, but I don't see any problem with doing this more often if
it made sense.

I'm happy to answer any questions about our implementation either on or
off list.

Cheers,

Mark

"Stephens, Owen" <o.stephens_at_IMPERIAL.AC.UK> writes:

> Bernhard,
>
> Just to understand what you are looking for in terms of Browse. The
> NLA implementation of VuFind has what I would regard as a Browse
> function - you can Browse the following:
>
> Names at http://catalogue.nla.gov.au/Browse/Names?browse=names&from=
> Subjects at
> http://catalogue.nla.gov.au/Browse/Subjects?browse=subjects&from=
> Callnumbers at
> http://catalogue.nla.gov.au/Browse/Subjects?browse=subjects&from=
> Series at
> http://catalogue.nla.gov.au/Browse/Series?browse=series&from=
>
> All these options are available in the user interface at
> http://catalogue.nla.gov.au/Browse/Home ('Browse' is an option in the
> horizontal menu under the main 'catalogue' banner)
>
> This page also offers Title and Uniform Title browsing, but these seem
> not to work in the same way at the moment (I've sent feedback about
> this)
>
> Is this browsing as you mean it? If not, what would you require
> additionally?
>
> (also you question the scalability - what scale are you thinking of?
> I'd guess that NLA is reasonably large - but I can't easily find a
> figure for the number of bib records - but obviously it may not be as
> large as other national libraries or consortium collections)