I am a new developer working with MongoDb Atlas, I am currently working on text searches over a collection with news texts. In this phase I am building the pipelines for those searches, so working with Compass mainly, but I am facing a problem: some of the searches contain punctuation signs, picking the worst case, one of the searches is the name of channel "E!"; now in Portuguese the "e" stands for "and", so, it's a very frequent "word" the search, this invalidates searching for the keyword "e" and them filter them later on, i have over 5M articles, I need this filter in the search.
Presenting a practical example: these are two articles, and I am trying to get the first one, and not the second, filtering them by the keyword "E!":
{ "_id": { "$numberLong": "1" }, "newsText": "\"E! News\" EBB Entretenimento. A vida das celebridades no programa - como Tom Cruise - que conta tudo sobre o quem é quem nos EUA.\n", "date": 20240823 }, { "_id": { "$numberLong": "2" }, "newsText": " canal 1. Programa de Entrevistas com Tom Cruise.\n", "date": 20240823 } By reading some threads on Mongodb community forum, and Mongodb Docs, i believe that i need to do a wildcard search, something like this one:
{ $search : { index : "default", compound : { must : [ { wildcard: { query: "E!", path : "newsText", allowAnalyzedField :true } }, { phrase : { path : "newsText", query: ["Programa", "Tom Cruise"] } } ] }, sort : { date : -1 } } } but that alone does not work, i get no results but if I delete the wildcard block i get the two articles. I believe that I also need to update my search index (correct me if i am wrong).
Currently my index is build to ignore accents (very usual on Portuguese), so that when I need to search for the name "andré" (for example) that on some texts is written as "andre" I don't need to use the two forms, this is my current index:
{ "analyzer": "diacriticFolder", "mappings": { "dynamic": false, "fields": { "date": { "type": "number", "representation": "int64", "indexDoubles": false, "indexIntegers": true }, "newsText": { "type": "string", "analyzer": "diacriticFolder", "indexOptions": "offsets", "store": true, "norms": "include" } } }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] } Can someone give me a hand? Thanks in advance!
If there is any need for extra info or detail needed I will gladly answer.