1

I am using text_general type to search in solr index with below configuration.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory"/> <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1" stemEnglishPossessive="1"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

I indexed a lot of content and searching with keywords : PLEASE , Please and please.

PLEASE keyword query returns very small resultset.

q=%22PLEASE%22&q.op=OR&df=text&qt=%2Fselect&sort=content_name+desc&fq=content_source%3ASharepoint&AuthenticatedUserName=lalit

But Please & please gives large resultset.

q=%22please%22&q.op=OR&df=text&qt=%2Fselect&sort=content_name+desc&fq=content_source%3ASharepoint&AuthenticatedUserName=lalit

q=%22Please%22&q.op=OR&df=text&qt=%2Fselect&sort=content_name+desc&fq=content_source%3ASharepoint&AuthenticatedUserName=lalit

Even when i am using WordDelimiterFilterFactory, it should consider PLEASE, Please & please as same keyword?

Any idea.

2 Answers 2

1

You have a fundamental conflict in your use of tokenizers and filters. The SnowBallPorterFilterFactory requires lowercase input to work correctly:

public final class PorterStemFilter extends TokenFilter

Transforms the token stream as per the Porter stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or LowerCaseTokenizer farther down the Tokenizer chain in order for this to work properly!

http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html

This should lead you to run your LowerCaseFilterFactory somewhere before you run the stream into the SnowBallPorterFilterFactory.

You are also using the WordDelimiterFilterFactory after stemming - which means that new words generated after running thru the WordDelimiterFilterFactory will not get stemmed.

Fixing it is not as simple as putting LowerCaseFilterFactory up front, because while that will fix the SnowBallPorterFilterFactory issue it will interfere with the WordDelimiterFilterFactory generating new words on case change.

I'd suggest trying the following order:

StandardTokenizerFactory

WordDelimiterFilterFactory

LowerCaseFilterFactory

SynonymFilterFactory

StopFilterFactory

SnowballPorterFilterFactory

When you start to use as many filters as this it's hard to get one perfect order but I believe this will address your current issues. As always, I'd suggest running many tests with common words from your document set to see how well it matches your desired output.

Sign up to request clarification or add additional context in comments.

Comments

0

You are running into this problem because you are stemming before lowercasing. I found that different stemmers will stem the word differently depending on the case. Review the output below from the Analysis Tab in Solr Admin UI. You can see that PLEASE was stemmed differently (or not at all) than Please and please, as such, the latter two have different result sets.

ST PLEASE | please | Please SF PLEASE | pleas | Pleas WDF PLEASE | pleas | Pleas SF PLEASE | pleas | Pleas LCF please | pleas |pleas 

To fix this issue I would recommend to run stemmer as the last step, after lowercasing. This would ensure that your word delimiter filter still works.

Hope it helps.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.