To answer your first question, yes, it is a valid approach: it works, and the logic is fairly straightforward.
Looking at the code, you should encapsulate your logic into function, making it easily reusable, including for testing:
def build_prefix_dict(words): ''' Create prefix dictionary for the given word list: ''' prefixes = defaultdict(list) for word in words: for i in range(1, len(word) + 1): prefix = word[0:i] prefixes[prefix].append(word) return prefixes def find_words(prefix_dict, user_input: str) -> list: ''' Find all words in a prefix dictionary starting with the given prefix ''' return prefix_dict.get(user_input, [])
Note that I replaced your comments with proper docstrings and changed the default value for find_words to an empty list, making it a bit friendlier to work with down the line.
Now, how does it compare to other solutions?
Since you mentioned tries, I implemented a simple Trie class:
class Trie: ''' A simple prefix tree (or trie) ''' def __init__(self): self._root = dict() self._len = 0 def append(self, word: str): ''' Append a word to the Trie ''' if word in self: return current_node = self._root for letter in word: if letter not in current_node: current_node[letter] = dict() current_node = current_node[letter] current_node[None] = None self._len += 1 def __len__(self): return self._len def __contains__(self, word: str): current_node = self._root for letter in word: if letter not in current_node: return False current_node = current_node[letter] return None in current_node def __iter__(self): yield from self._iterate(self._root, '') def _iterate(self, node, prefix): if None in node: yield prefix for letter, child in node.items(): if child is None: continue yield from self._iterate(child, prefix + letter) def find_words(self, prefix: str): ''' Iterator for words starting with the given prefix ''' current_node = self._root for letter in prefix: if letter not in current_node: return current_node = current_node[letter] yield from self._iterate(current_node, prefix)
It uses dicts as nodes, with each key representing edges leaving the node, and None used as a word end marker. It allows to enumerate words quite easily, starting from the root or any prefix.
For testing purposes, I also wrapped your code into functions, which I recommend doing anyway:
Finally, a simple solution would be to use a flat list of word and find words with a prefix with list comprehension:
[word for word in words if word.startswith('<prefix>')]
Now, let's have a look at performance. I had text file with the French Scrabble dictionary on hand (411430 words, from 2 to 15 letters), so I used this for testing.
For benchmarking, I tried finding words starting with the empty string (411430 hits), starting with COU (1637 hits) and with COUAC (2 hits). I also looked at the time taken for building the object:
'' 'COU' 'COUAC' build flat list 50ms 350ms 35ms 90ms trie 500ms 2ms 0.001ms 1000ms prefix dict 0.002ms 0.003ms 0.003ms 2500ms
As expected, the prefix dictionary takes constant time, no matter the size of the result, and is 4 to 5 orders of magnitudes faster than searching a flat list, and 5 orders of magnitudes faster than the trie in the worst case (and on par with best case).
Now, it takes significantly longer than a flat list to build. So looking at speed alone, the best option could be a flat list or a prefix dictionary, depending on how often it needs to be built vs how many searches are performed.
Memory wise, the dictionary I used weights 28MB as a flat list and 185MB as a trie or a prefix dictionary. A 6x memory usage vs a 10^5x speedup can definitely be an acceptable tradeoff for using a prefix trie, depending on your use case.
(I'll admit I'm surprised the trie isn't performing better, but it's basically many nested dictionaries, which imply a lot of overhead in Python.)
As for alternative approaches, you can look into a better implementation of the trie. From previous experience, I know for a fact that memory footprint and speed can be improved by several orders of magnitude by flattening the tree, at the cost of higher complexity and initial building time.
You can also look into directed acyclic word graph (DAWG), aka deterministic acyclic finite state automaton (DAFSA) for a more compact memory footprint and similar performance than a trie, at the cost of even more complexity.