Re: indexing word documents using solr

From: Eric Lease Morgan <emorgan_at_nyob>
Date: Wed, 11 Feb 2015 09:32:20 -0500
To: CODE4LIB_at_LISTSERV.ND.EDU
On Feb 10, 2015, at 11:46 AM, Erik Hatcher <erikhatcher_at_MAC.COM> wrote:

> bin/post -c collection_name /path/to/file.doc

The almost trivial command to index a Word document in Solr, above, is most certainly appealing, but I’m wondering about the underlying index’s schema.

Tika makes every effort to extract as much metadata from Word documents as possible. This metadata includes dates, titles, authors, names of applications, last edit, etc. Some of this data can be very useful. The metadata can be packaged up as an XML file/stream and then sent to Solr for indexing. "Tastes great. Less filling.” But my question is, “To what degree does Solr know what to do with the metadata when the (kewl) command, above, is seemingly so generic? Does one need to create a Solr schema to specifically accommodate the Tika-created metadata, or do such things also come for ‘free’?”

— 
Eric Morgan
Received on Wed Feb 11 2015 - 09:32:36 EST