Re: indexing word documents using solr

From: Eric Lease Morgan <emorgan_at_nyob> Date: Wed, 11 Feb 2015 09:32:20 -0500 To: CODE4LIB_at_LISTSERV.ND.EDU

On Feb 10, 2015, at 11:46 AM, Erik Hatcher <erikhatcher_at_MAC.COM> wrote:

> bin/post -c collection_name /path/to/file.doc

The almost trivial command to index a Word document in Solr, above, is most certainly appealing, but I‚Äôm wondering about the underlying index‚Äôs schema.

Tika makes every effort to extract as much metadata from Word documents as possible. This metadata includes dates, titles, authors, names of applications, last edit, etc. Some of this data can be very useful. The metadata can be packaged up as an XML file/stream and then sent to Solr for indexing. "Tastes great. Less filling.‚Äù But my question is, ‚ÄúTo what degree does Solr know what to do with the metadata when the (kewl) command, above, is seemingly so generic? Does one need to create a Solr schema to specifically accommodate the Tika-created metadata, or do such things also come for ‚Äòfree‚Äô?‚Äù

‚Äî 
Eric Morgan