Simple variation on a Trie for prefix search in a large list of words

Question

I've been looking into ways of performing a prefix search in a large list of words (for example, return all words that start with "as"). I noticed that a Trie is recommended often for this purpose.

I came up with a simple variation on the Trie-idea though. The premise of the idea is that in Python, key lookup in a dict is extremely fast. So how about simply storing all occurring prefixes of the entire word list as keys, and as values have the matching words in a list per prefix?

from collections import defaultdict import nltk # For the example I use the entire English word list of the nltk library, but this could be any list of words: nltk.download('words') english_words = [word.lower() for word in nltk.corpus.words.words()] # Create prefix-dict for the words in english_words: prefixes = defaultdict(list) for word in english_words: for i in range(1, len(word) + 1): prefix = word[0:i] prefixes[prefix].append(word) # Find all words with the same prefix: def find_words(user_input: str) -> list | str:     return prefixes.get(user_input, "no matches found")

I was worried that the prefix-dict would blow up to ridiculous dimensions (the theoretical maximum amount of keys would be 26 + 26^2 + 26^3 + 26^4....), but the word list used in the example containing 236736 words gave me a prefix-dict with a still-reasonable 758681 keys, and a not-terrible memory use of 131.7 MiB (according to tracemalloc).

My question is, is this a valid approach? What would be the downsides of this approach compared to a Trie as it is usually implemented, when used for a prefix search? And are there better/faster alternatives to this idea in Python?

gazoh · Accepted Answer · 2023-04-13 10:18:42Z

To answer your first question, yes, it is a valid approach: it works, and the logic is fairly straightforward.

Looking at the code, you should encapsulate your logic into function, making it easily reusable, including for testing:

def build_prefix_dict(words): ''' Create prefix dictionary for the given word list: ''' prefixes = defaultdict(list) for word in words: for i in range(1, len(word) + 1): prefix = word[0:i] prefixes[prefix].append(word) return prefixes def find_words(prefix_dict, user_input: str) -> list: ''' Find all words in a prefix dictionary starting with the given prefix ''' return prefix_dict.get(user_input, [])

Note that I replaced your comments with proper docstrings and changed the default value for find_words to an empty list, making it a bit friendlier to work with down the line.

Now, how does it compare to other solutions?

Since you mentioned tries, I implemented a simple Trie class:

class Trie: ''' A simple prefix tree (or trie) ''' def __init__(self): self._root = dict() self._len = 0 def append(self, word: str): ''' Append a word to the Trie ''' if word in self: return current_node = self._root for letter in word: if letter not in current_node: current_node[letter] = dict() current_node = current_node[letter] current_node[None] = None self._len += 1 def __len__(self): return self._len def __contains__(self, word: str): current_node = self._root for letter in word: if letter not in current_node: return False current_node = current_node[letter] return None in current_node def __iter__(self): yield from self._iterate(self._root, '') def _iterate(self, node, prefix): if None in node: yield prefix for letter, child in node.items(): if child is None: continue yield from self._iterate(child, prefix + letter) def find_words(self, prefix: str): ''' Iterator for words starting with the given prefix ''' current_node = self._root for letter in prefix: if letter not in current_node: return current_node = current_node[letter] yield from self._iterate(current_node, prefix)

It uses dicts as nodes, with each key representing edges leaving the node, and None used as a word end marker. It allows to enumerate words quite easily, starting from the root or any prefix.

For testing purposes, I also wrapped your code into functions, which I recommend doing anyway:

Finally, a simple solution would be to use a flat list of word and find words with a prefix with list comprehension:

[word for word in words if word.startswith('<prefix>')]

Now, let's have a look at performance. I had text file with the French Scrabble dictionary on hand (411430 words, from 2 to 15 letters), so I used this for testing.

For benchmarking, I tried finding words starting with the empty string (411430 hits), starting with COU (1637 hits) and with COUAC (2 hits). I also looked at the time taken for building the object:

 '' 'COU' 'COUAC' build flat list 50ms 350ms 35ms 90ms trie 500ms 2ms 0.001ms 1000ms prefix dict 0.002ms 0.003ms 0.003ms 2500ms

As expected, the prefix dictionary takes constant time, no matter the size of the result, and is 4 to 5 orders of magnitudes faster than searching a flat list, and 5 orders of magnitudes faster than the trie in the worst case (and on par with best case).

Now, it takes significantly longer than a flat list to build. So looking at speed alone, the best option could be a flat list or a prefix dictionary, depending on how often it needs to be built vs how many searches are performed.

Memory wise, the dictionary I used weights 28MB as a flat list and 185MB as a trie or a prefix dictionary. A 6x memory usage vs a 10^5x speedup can definitely be an acceptable tradeoff for using a prefix trie, depending on your use case.

(I'll admit I'm surprised the trie isn't performing better, but it's basically many nested dictionaries, which imply a lot of overhead in Python.)

As for alternative approaches, you can look into a better implementation of the trie. From previous experience, I know for a fact that memory footprint and speed can be improved by several orders of magnitude by flattening the tree, at the cost of higher complexity and initial building time.

You can also look into directed acyclic word graph (DAWG), aka deterministic acyclic finite state automaton (DAFSA) for a more compact memory footprint and similar performance than a trie, at the cost of even more complexity.

Good answer! "Memory wise, the dictionary I used weights 28MB as a flat list and 185MB as a trie or a prefix dictionary" - out of sheer curiosity, how did u measure this? — Grajdeanu Alex
– Grajdeanu Alex, Commented Apr 13, 2023 at 13:33
@GrajdeanuAlex I used the last snippet shown in this SO answer (recursive visitor): stackoverflow.com/a/30316760/9552817 — gazoh
– gazoh, Commented Apr 13, 2023 at 13:36

RootTwo · Accepted Answer · 2023-04-14 07:21:41Z

`bisect_left()`

If the word list is a sorted sequence of lowercase words, this can be done using the bisect module from the standard library. bisect_left(words_list, prefix) returns the index of the first word starting with the prefix. And bisect_left(word_list, prefix + '{') returns the index of the first word that doesn't start with the prefix. The sort order of prefix + '{'is after any word that starts withprefix` but before any larger prefix. For example, 'abz' < 'ab{' < 'ac'

import bisect def find_words(words, prefix: str) -> list: ''' Return slice of `words` that all begin with `prefix`. `words` must be a sorted list of lowercase words. ''' start = bisect.bisect_left(words, prefix) end = bisect.bisect_right(words, prefix + 'z', start) return words[start:end]

Depending on your use case, you could just return the first and last indices instead of the slice.

On my machine, this takes about 10x longer than the prefix dict, but you save the time and memory overhead of building the prefix dict.

Python 3.10

Starting with Python 3.10, the functions in the bisect module gained a key parameter like the sort function. providing a suitable key function would eliminate the need for the word list to be a uniform case. (This machine has 3.9, so this isn't tested.)

def find_words(words, prefix: str) -> list: ''' Return slice of `words` that all begin with `prefix`. ''' keyfunc = lambda w:w[:len(prefix)].lower() start = bisect.bisect_left(words, prefix, key=keyfunc) end = bisect.bisect_right(words, prefix, start, key=keyfunc) return words[start:end]

Stack Exchange Network

Simple variation on a Trie for prefix search in a large list of words

2 Answers 2

`bisect_left()`

Python 3.10

You must log in to answer this question.

Hot Network Questions

Simple variation on a Trie for prefix search in a large list of words

2 Answers 2

bisect_left()

Python 3.10

You must log in to answer this question.

Related

Hot Network Questions

`bisect_left()`