1

I am a new developer working with MongoDb Atlas, I am currently working on text searches over a collection with news texts. In this phase I am building the pipelines for those searches, so working with Compass mainly, but I am facing a problem: some of the searches contain punctuation signs, picking the worst case, one of the searches is the name of channel "E!"; now in Portuguese the "e" stands for "and", so, it's a very frequent "word" the search, this invalidates searching for the keyword "e" and them filter them later on, i have over 5M articles, I need this filter in the search.

Presenting a practical example: these are two articles, and I am trying to get the first one, and not the second, filtering them by the keyword "E!":

{ "_id": { "$numberLong": "1" }, "newsText": "\"E! News\" EBB Entretenimento. A vida das celebridades no programa - como Tom Cruise - que conta tudo sobre o quem é quem nos EUA.\n", "date": 20240823 }, { "_id": { "$numberLong": "2" }, "newsText": " canal 1. Programa de Entrevistas com Tom Cruise.\n", "date": 20240823 } 

By reading some threads on Mongodb community forum, and Mongodb Docs, i believe that i need to do a wildcard search, something like this one:

 { $search : { index : "default", compound : { must : [ { wildcard: { query: "E!", path : "newsText", allowAnalyzedField :true } }, { phrase : { path : "newsText", query: ["Programa", "Tom Cruise"] } } ] }, sort : { date : -1 } } } 

but that alone does not work, i get no results but if I delete the wildcard block i get the two articles. I believe that I also need to update my search index (correct me if i am wrong).

Currently my index is build to ignore accents (very usual on Portuguese), so that when I need to search for the name "andré" (for example) that on some texts is written as "andre" I don't need to use the two forms, this is my current index:

{ "analyzer": "diacriticFolder", "mappings": { "dynamic": false, "fields": { "date": { "type": "number", "representation": "int64", "indexDoubles": false, "indexIntegers": true }, "newsText": { "type": "string", "analyzer": "diacriticFolder", "indexOptions": "offsets", "store": true, "norms": "include" } } }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] } 

Can someone give me a hand? Thanks in advance!

If there is any need for extra info or detail needed I will gladly answer.

2 Answers 2

1

Note that there is a Portuguese language analyzer available (lucene.portuguese), however that removes "stop words" (like a, o, e / English "the" and "and") so that E! isn't searchable since the "E" gets dropped.

Here's an example using the building blocks, minus the stop word filter, of the Portuguese analyzer: https://search-playground.mongodb.com/tools/code-sandbox/snapshots/68e55ee400e1518b21dc8689

The custom.portuguese analyzer uses the standard tokenizer, Portuguese stemmer, and then lowercases the tokens for case-insensitive search. You won't need to use the wildcard operator for the queries you've described.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for search playground, didn't knew that website. with the help of that, some chatgpt, and other posts from the Mongodb Community forum found the solution, will put it in a post now...
0

Ok after some time in the Search playground, i am answering my own question.

So first of all, had to change my index to:

{ "analyzer": "diacriticFolder", "mappings": { "dynamic": false, "fields": { "newsText": { "analyzer": "diacriticFolder", "type": "string", "multi": { "defaultAnalyzer": { "analyzer": "diacriticFolder", "type": "string", "indexOptions": "offsets", "store": true, "norms": "include" }, "wlcdAnalyzer": { "analyzer": "lucene.keyword", "type": "string" } } } } }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] } 

and after that also in the search the wildcard block had to be changed:

 wildcard: { query: "*E!*", path: { value: "newsText", "multi" : "wlcdAnalyzer" }, allowAnalyzedField: true } 

this not only kept the functionality of ignoring accents, but in special cases, where the punctuation (in this case the exclamation mark) must me respected, that is enforced.

1 Comment

I would encourage you not to use leading wildcard characters (wildcard: *E!* as that will effectively perform a full index scan. From your example content above, simply a text query of "E!" should suffice.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.