Problem
Given a Dictionary with
user_idand a list ofalert_words(words and phrases too look for in an sentence) and a stringcontent. We have to look if thealert_wordsappears in thecontentand return the list ofuser_idswho'salert_wordsappears in thecontentExample
input = { 1 : ['how are you'], 2 : ['you are'] }
content = 'hello, how are you'
output = [1]
user_id= 1 has 'how are you' whileuser_id= 2 has the words but not in the correct order so only user 1 is returned.
Solution
I'm using Google's pygtrie implementation of Trie data structure to achieve this. [pygtrie documentation]
Algorithm:
- For each word in the given sentence
- Check if the word is a key, if yes add it to the list of user_ids
- check if the word has a subtrie i.e. that current word is the starting of a alert_word. So add it to another set
alert_phrases - for each word in
alert_phraseswe check if we can extent with the current word and do the same set of operations if it is a key/subtrie
Code
import pygtrie from typing import Dict, List, Set def build_trie(realm_alert_words : Dict[int, List[int]]) -> pygtrie.StringTrie: trie = pygtrie.StringTrie() for user_id, words in realm_alert_words.items(): for word in words: alert_word = trie._separator.join(word.split()) if trie.has_key(alert_word): user_ids_for_word = trie.get(alert_word) user_ids_for_word.update([user_id]) else: trie[alert_word] = set([user_id]) return trie def get_user_ids_with_alert_words(trie : pygtrie.StringTrie, content : str) -> Set[int]: """Returns the list of user_id's who have alert_words present in content""" content_words = content.split() alert_phrases = set() user_ids_in_messages = set() for possible_alert_word in content_words: #has_node returns 1(HAS_VALUE) if the exact key is found, 2(HAS_SUBTRIE) if the key is a sub trie, # 3 if it's both 0 if it's none #https://pygtrie.readthedocs.io/en/latest/#pygtrie.Trie.has_node alert_word_in_trie = trie.has_node(possible_alert_word) if alert_word_in_trie & pygtrie.Trie.HAS_VALUE: user_ids = trie.get(possible_alert_word) user_ids_in_messages.update(user_ids) deep_copy_alert_phrases = set(alert_phrases) # Check if extending the phrases with the current word in content is a subtrie or key. And # Remove the word if it is not a subtrie as we are interested only in continuos words in the content for alert_phrase in deep_copy_alert_phrases: alert_phrases.remove(alert_phrase) extended_alert_phrase = alert_phrase + trie._separator + possible_alert_word alert_phrase_in_trie = trie.has_node(extended_alert_phrase) if alert_phrase_in_trie & pygtrie.Trie.HAS_VALUE: user_ids = trie.get(extended_alert_phrase) user_ids_in_messages.update(user_ids) if alert_phrase_in_trie & pygtrie.Trie.HAS_SUBTRIE: alert_phrases.add(extended_alert_phrase) if alert_word_in_trie & pygtrie.Trie.HAS_SUBTRIE: alert_phrases.add(possible_alert_word) return user_ids_in_messages Tests
input = {1 : ['hello'], 7 : ['this possible'], 2 : ['hello'], 3 : ['hello'], 5 : ['how are you'], 6 : ['hey']} alert_word_trie = build_trie(input) content = 'hello how is this possible how are you doing today' result = get_user_ids_with_alert_words(alert_word_trie, content) assert(result == set([1, 2, 3, 5, 7])) input = {1 : ['provisioning', 'Prod deployment'], 2 : ['test', 'Prod'], 3 : ['prod'], 4 : ['deployment'] } alert_word_trie = build_trie(input) content = 'Hello, everyone. Prod deployment has been completed' result = get_user_ids_with_alert_words(alert_word_trie, content) assert(result == set([1, 2, 4])) input = {1 : ['provisioning/log.txt'] } alert_word_trie = build_trie(input) content = 'Hello, everyone. Errors logged at provisioning/log.txt ' result = get_user_ids_with_alert_words(alert_word_trie, content) assert(result == set([1])) The two methods are part of a larger classes which have some not so related code. You get a list of user_ids and their alert_words from the database and you process every message content based on the trie already build up.
This is for a chat application so frequency of running the get_user_id_with_alert_words is high when the build_trie is relatively less since it will be cached.
contentisHello, how [some words] are [more words] you, what the result should be? \$\endgroup\$ina list is of linear time complexity. \$\endgroup\$inis O(n) for lists (and strings): wiki.python.org/moin/TimeComplexity \$\endgroup\$O(m)wheremis the length of the word and finding it in a list would beO(n)wherenis the total number of elements in the list which can grow as opposed to the number of words in the message. \$\endgroup\$