Mongodb/atlas search with punctuation signals

Question

I am a new developer working with MongoDb Atlas, I am currently working on text searches over a collection with news texts. In this phase I am building the pipelines for those searches, so working with Compass mainly, but I am facing a problem: some of the searches contain punctuation signs, picking the worst case, one of the searches is the name of channel "E!"; now in Portuguese the "e" stands for "and", so, it's a very frequent "word" the search, this invalidates searching for the keyword "e" and them filter them later on, i have over 5M articles, I need this filter in the search.

Presenting a practical example: these are two articles, and I am trying to get the first one, and not the second, filtering them by the keyword "E!":

{ "_id": { "$numberLong": "1" }, "newsText": "\"E! News\" EBB Entretenimento. A vida das celebridades no programa - como Tom Cruise - que conta tudo sobre o quem é quem nos EUA.\n", "date": 20240823 }, { "_id": { "$numberLong": "2" }, "newsText": " canal 1. Programa de Entrevistas com Tom Cruise.\n", "date": 20240823 }

By reading some threads on Mongodb community forum, and Mongodb Docs, i believe that i need to do a wildcard search, something like this one:

 { $search : { index : "default", compound : { must : [ { wildcard: { query: "E!", path : "newsText", allowAnalyzedField :true } }, { phrase : { path : "newsText", query: ["Programa", "Tom Cruise"] } } ] }, sort : { date : -1 } } }

but that alone does not work, i get no results but if I delete the wildcard block i get the two articles. I believe that I also need to update my search index (correct me if i am wrong).

Currently my index is build to ignore accents (very usual on Portuguese), so that when I need to search for the name "andré" (for example) that on some texts is written as "andre" I don't need to use the two forms, this is my current index:

{ "analyzer": "diacriticFolder", "mappings": { "dynamic": false, "fields": { "date": { "type": "number", "representation": "int64", "indexDoubles": false, "indexIntegers": true }, "newsText": { "type": "string", "analyzer": "diacriticFolder", "indexOptions": "offsets", "store": true, "norms": "include" } } }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] }

Can someone give me a hand? Thanks in advance!

If there is any need for extra info or detail needed I will gladly answer.

Mongo_Erik · Accepted Answer · 2025-10-07 18:50:48Z

Note that there is a Portuguese language analyzer available (lucene.portuguese), however that removes "stop words" (like a, o, e / English "the" and "and") so that E! isn't searchable since the "E" gets dropped.

Here's an example using the building blocks, minus the stop word filter, of the Portuguese analyzer: https://search-playground.mongodb.com/tools/code-sandbox/snapshots/68e55ee400e1518b21dc8689

The custom.portuguese analyzer uses the standard tokenizer, Portuguese stemmer, and then lowercases the tokens for case-insensitive search. You won't need to use the wildcard operator for the queries you've described.

thanks for search playground, didn't knew that website. with the help of that, some chatgpt, and other posts from the Mongodb Community forum found the solution, will put it in a post now...

user2784897 · Accepted Answer · 2025-10-08 15:22:03Z

Ok after some time in the Search playground, i am answering my own question.

So first of all, had to change my index to:

{ "analyzer": "diacriticFolder", "mappings": { "dynamic": false, "fields": { "newsText": { "analyzer": "diacriticFolder", "type": "string", "multi": { "defaultAnalyzer": { "analyzer": "diacriticFolder", "type": "string", "indexOptions": "offsets", "store": true, "norms": "include" }, "wlcdAnalyzer": { "analyzer": "lucene.keyword", "type": "string" } } } } }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] }

and after that also in the search the wildcard block had to be changed:

 wildcard: { query: "*E!*", path: { value: "newsText", "multi" : "wlcdAnalyzer" }, allowAnalyzedField: true }

this not only kept the functionality of ignoring accents, but in special cases, where the punctuation (in this case the exclamation mark) must me respected, that is enforced.

I would encourage you not to use leading wildcard characters (wildcard: *E!* as that will effectively perform a full index scan. From your example content above, simply a text query of "E!" should suffice.

Collectives™ on Stack Overflow

Mongodb/atlas search with punctuation signals

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related