Re: Whose elephant is it, anyway? (the OLE project)

From: Till Kinstler <kinstler_at_nyob>
Date: Fri, 13 Mar 2009 09:26:27 +0100
To: NGC4LIB_at_LISTSERV.ND.EDU
Sorry, this posting is a technical one. But the answers to Bernhards 
questions may be of interest for others, too...

Bernhard Eversberg schrieb:

> How it is affected by the physical growth of the data. Does it get
> slower with every million data, and how much?

Depends on your hardware. I just expanded a Solr index from 5 million to 
about 20 million records. Its size is now about 80 GB (there is a lot of 
redundancy) and at the moment it lives on a USB disk. Searching slowed 
down a bit. With 5 million records, with 50 simultaneous users sending 
searches we had average response times of about 100 to 200 ms, now we 
have 250 to 300 ms. My guess is, that the USB interface to the disk is 
the limitting factor, we will investigate that.

> How long is it to create the index?

We are indexing about 200 to 500 (bibliographic) records per second. 
Indexing speed in Solr depends on the amount of text you put into it and 
what processing you do during indexing.
When I add fulltext article data (each file about 10 kB of text) 
indexing rate drops to about 50 to 80 records per second, because there 
is much more text to process than in a bibliographic record.

> Is real-time updating possible?

You can update records or add records anytime, it doesn't hurt. But they 
are only findable after sending a "commit" command to Solr. Such a 
commit may (depending on index size, Solr configuration and hardware) 
take up to some/many seconds (during the commit the index still is 
searchable, so it is not a "system blackout"). So it is not real 
real-time updating, because in a library environment you don't want to 
issue a commit after every single record update. But sending a commit 
every 10 minutes or so, would be a good strategy.

> How many hours per million records for a complete
> re-index?

For indexing rates see above. There is no significant difference whether 
it's re-indexing, updating or new indexing...

> Does this time grow linearly or exponentially?

About linearly.

Regards and sorry for this rather technical post,
Till

-- 
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
Received on Fri Mar 13 2009 - 04:29:13 EDT