TLDR
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
Theory
If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
Pack the logic into a function, give this function as argument to re.sub and you're done!
Code
import re with open('/usr/share/dict/american-english') as wordbook: banned_words = set(word.strip().lower() for word in wordbook) def delete_banned_words(matchobj): word = matchobj.group(0) if word.lower() in banned_words: return "" else: return word sentences = ["I'm eric. Welcome here!", "Another boring sentence.", "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000 word_pattern = re.compile('\w+') for sentence in sentences: sentence = word_pattern.sub(delete_banned_words, sentence)
Converted sentences are:
' . ! . GiraffeElephantBoat sfgsdg sdwerha aswertwe
Note that:
- the search is case-insensitive (thanks to
lower()) - replacing a word with
"" might leave two spaces (as in your code) - With python3,
\w+ also matches accented characters (e.g. "ångström"). - Any non-word character (tab, space, newline, marks, ...) will stay untouched.
Performance
There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.
In comparison, Liteye's answer needed 160s for 10 thousand sentences.
With n being the total amound of words and m the amount of banned words, OP's and Liteye's code are O(n*m).
In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).
Regex union test
What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?
It's pretty hard to grasp the way the regex engine works, so let's write a simple test.
This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :
- one is clearly not a word (it begins with
#) - one is the first word in the list
- one is the last word in the list
- one looks like a word but isn't
import re import timeit import random with open('/usr/share/dict/american-english') as wordbook: english_words = [word.strip().lower() for word in wordbook] random.shuffle(english_words) print("First 10 words :") print(english_words[:10]) test_words = [ ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"), ("First word", english_words[0]), ("Last word", english_words[-1]), ("Almost a word", "couldbeaword") ] def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print("\nUnion of %d words" % 10**exp) union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp])) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %-17s : %.1fms" % (description, time))
It outputs:
First 10 words : ["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime'] Union of 10 words Surely not a word : 0.7ms First word : 0.8ms Last word : 0.7ms Almost a word : 0.7ms Union of 100 words Surely not a word : 0.7ms First word : 1.1ms Last word : 1.2ms Almost a word : 1.2ms Union of 1000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 9.6ms Almost a word : 10.1ms Union of 10000 words Surely not a word : 1.4ms First word : 1.8ms Last word : 96.3ms Almost a word : 116.6ms Union of 100000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 1227.1ms Almost a word : 1404.1ms
So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:
O(1) best case O(n/2) average case, which is still O(n) O(n) worst case
These results are consistent with a simple loop search.
A much faster alternative to a regex union is to create the regex pattern from a trie.
multiprocessing(i.e. multiple Python processes).