On Aug 30, 2007, at 5:16 PM, Sperr, Edwin wrote:
> I wonder how well such approaches would work in an environment
> where the
> length of the texts is variable and the texts themselves often
> meandering from point to point? Is there another test corpus that
> models library requirements better? Anybody banging at the Project
> Guttenberg docs yet?
I have done this to a small degree. Here's how:
1. Downloaded about 14,000 texts, mostly from Project Gutenberg.
2. Used a couple of tools to extract relevant sentences and
statistically significant words.
3. Updated a database of titles, creators, texts, and keywords
accordingly.
4. Generated XHTML out of the whole thing.
5. Full-text indexed it.
6. Created a browsable/searchable interface.
Considering I was only one person with relatively small hardware, I
think I did pretty well. It is not perfect but a definite step in the
right direction. Just think what I could have done if I had a ten or
so people. Hmmm... Try:
http://infomotions.com/alex/
--
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
University Libraries of Notre Dame
(574) 631-8604
Received on Tue Sep 04 2007 - 09:27:31 EDT