Some characters breaks phrase search in text field

Question

I have a text field, which contains titles of tv-series or movies. In several cases I want to perform a phrase query on what I'd say a pretty normal text field. This works fine for most phrase terms, but in some reproducable cases it doesn't, but simply returns nothing. It seems to be related to some "special" characters, but not all special characters I'd assume are affected.

Title:("Mission: Impossible") works
Title:("Disney A.N.T.") doesn't work
Title:("Stephen King's Shining") doesn't work
Title:("Irgendwie L. A.") works

After trying several other titles I'd assume, that it is somehow related to dot . and apostroph ' and maybe other I don't know yet. I have no idea, where to look know

relevant schema.xml

<fieldType name="title" class="solr.TextField" sortMissingLast="true" positionIncrementGap="100" autoGeneratePhraseQueries="false"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.GermanNormalizationFilterFactory"/> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" /> <filter class="solr.TrimFilterFactory" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.GermanNormalizationFilterFactory"/> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

It's usually a good idea to start with the output on the analysis page of the solr admin to see how the query gets changed when querying and indexing, and see each token and if they match. — MatsLindh
– MatsLindh, Commented Jan 12, 2016 at 14:34

Karsten R. · Accepted Answer · 2016-01-13 13:13:46Z

Your Question is about phrases on a field where the analyzer of type "index" contains a solr.WordDelimiterFilterFactory but in type "query" it does not.

MatsLindh told us, the first step is to open the analysis screen.

In this case the position value is important.

With your attributes in solr.WordDelimiterFilterFactory the token "King's" is converted to "king's" "king" "kings" "s" and the last "s" is on !second! position.

This does not explain solr.StandardTokenizerFactory So if you are search for the phrase "Stephen King's Shining" without solr.WordDelimiterFilterFactory the token "Shining" is on position three but if you are indexing with solr.WordDelimiterFilterFactory the token "Shining" is on position four, so only "Stephen King's Shining"~2 (with Slop) will match, but not "Stephen King's Shining".

This does not explain your problem with "Disney A.N.T.". But be aware that solr.StandardTokenizerFactory would remove the last dot, and solr.WhitespaceTokenizerFactory does not.

Collectives™ on Stack Overflow

Some characters breaks phrase search in text field

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related