Re: indexing word documents using solr [diacritics, resolved (i think) ]

From: Eric Lease Morgan <emorgan_at_nyob> Date: Fri, 20 Feb 2015 10:40:44 -0600 To: CODE4LIB_at_LISTSERV.ND.EDU

On Feb 16, 2015, at 4:54 PM, Levy, Michael <mlevy_at_ushmm.org> wrote:

> I think you can accomplish what you want by using ICUFoldingFilterFactory
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
> 
> which should simply perform ICU (cf http://site.icu-project.org/) based character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
> 
> In schema.xml I generally have in both index and query:
> 
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.ICUFoldingFilterFactory" />

For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but nonetheless, my interface works as expected. And I was able to do this after a combination of things. First, I needed to tell the indexer my content was Spanish, and after doing so, Solr parses things correctly. Second, I needed to explicitly tell my Web browser that the search form and returned content were using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and even in the HTML form. Geesh! Through this whole process I also learned about Solr‚Äôs edismax (extended dismax) handler. Edismax supports free form queries as well as Boolean logic.  solr++  But also solr+- because Solr is getting more and more and more complicated. ‚ÄîEric ‚ÄúLost In Chicago‚Äù Morgan