Re: indexing word documents using solr [diacritics, resolved (i think) ]

From: Eric Lease Morgan <emorgan_at_nyob>
Date: Mon, 16 Feb 2015 16:58:08 -0500
To: CODE4LIB_at_LISTSERV.ND.EDU
I know the documents I’m indexing are written in Spanish, and adding the following filters to my field definition, I believe I have resolved my problem:

  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />

In other words, my searchable content is defined thus:

  <field name=“text" type="text_general" indexed="true" stored="true" multiValued="false" />

And “text_general” is defined to include the filters in both the index and query sections:

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory" />
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory" />
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
      <filter class="solr.LowerCaseFilterFactory" />
      <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
    </analyzer>
  </fieldType>
Received on Mon Feb 16 2015 - 16:58:30 EST