205

I have two lists:

  • a list of about 750K "sentences" (long strings)
  • a list of about 20K "words" that I would like to delete from my 750K sentences

So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b word-boundary metacharacter:

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words] 

Then I loop through my "sentences":

import re for sentence in sentences: for word in compiled_words: sentence = re.sub(word, "", sentence) # put sentence into a growing list 

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it's not much of an improvement.

I'm using Python 3.5.2

22
  • 2
    The first answer here has some good sample code: stackoverflow.com/questions/2846653/… just divide up your sentence array by the number of CPU cores you have then run that many threads Commented Mar 12, 2017 at 1:53
  • 7
    You can also try a non-regex implementation - traverse your input word by word and match every with a set. This is single pass and hash lookups are pretty quick. Commented Mar 12, 2017 at 2:45
  • 2
    How long are these sentences, incidentally? 750k lines doesn't sound like a dataset that should be taking hours to process. Commented Mar 12, 2017 at 2:57
  • 2
    @MohammadAli: Don't bother with that example for CPU-bound work. Python has a big lock that it takes when executing bytecode (the Global Interpreter Lock), so you can't benefit from threads for CPU work. You'd need to use multiprocessing (i.e. multiple Python processes). Commented Mar 12, 2017 at 7:09
  • 2
    You need an industrial strength tool to do this. A regex trie is generated from a ternary tree of a list of strings. There is never more than 5 steps to failure making this the fastest method to do this type of matching. Examples: 175,000 word dictionary or similar to your banned list just the 20,000 S-words Commented Jan 3, 2020 at 10:04

10 Answers 10

210

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP's, it's approximately 1000 times faster than the accepted answer.

If you don't care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn't do a very good job of optimizing the pattern.

It's possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren't really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza'] 

Regex union

The list is converted to a trie:

{ 'f': { 'o': { 'o': { 'x': { 'a': { 'r': { '': 1 } } }, 'b': { 'a': { 'r': { '': 1 }, 'h': { '': 1 } } }, 'z': { 'a': { '': 1, 'p': { '': 1 } } } } } } } 

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b" 

Regex trie

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn't match), instead of trying the 5 words. It's a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

Code

Here's a slightly modified gist, which we can use as a trie.py library:

import re class Trie(): """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern. The corresponding Regex should match much faster than a simple Regex union.""" def __init__(self): self.data = {} def add(self, word): ref = self.data for char in word: ref[char] = char in ref and ref[char] or {} ref = ref[char] ref[''] = 1 def dump(self): return self.data def quote(self, char): return re.escape(char) def _pattern(self, pData): data = pData if "" in data and len(data.keys()) == 1: return None alt = [] cc = [] q = 0 for char in sorted(data.keys()): if isinstance(data[char], dict): try: recurse = self._pattern(data[char]) alt.append(self.quote(char) + recurse) except: cc.append(self.quote(char)) else: q = 1 cconly = not len(alt) > 0 if len(cc) > 0: if len(cc) == 1: alt.append(cc[0]) else: alt.append('[' + ''.join(cc) + ']') if len(alt) == 1: result = alt[0] else: result = "(?:" + "|".join(alt) + ")" if q: if cconly: result += "?" else: result = "(?:%s)?" % result return result def pattern(self): return self._pattern(self.dump()) 

Test

Here's a small test (the same as this one):

# Encoding: utf-8 import re import timeit import random from trie import Trie with open('/usr/share/dict/american-english') as wordbook: banned_words = [word.strip().lower() for word in wordbook] random.shuffle(banned_words) test_words = [ ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"), ("First word", banned_words[0]), ("Last word", banned_words[-1]), ("Almost a word", "couldbeaword") ] def trie_regex_from_words(words): trie = Trie() for word in words: trie.add(word) return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE) def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print("\nTrieRegex of %d words" % 10**exp) union = trie_regex_from_words(banned_words[:10**exp]) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %s : %.1fms" % (description, time)) 

It outputs:

TrieRegex of 10 words Surely not a word : 0.3ms First word : 0.4ms Last word : 0.5ms Almost a word : 0.5ms TrieRegex of 100 words Surely not a word : 0.3ms First word : 0.5ms Last word : 0.9ms Almost a word : 0.6ms TrieRegex of 1000 words Surely not a word : 0.3ms First word : 0.7ms Last word : 0.9ms Almost a word : 1.1ms TrieRegex of 10000 words Surely not a word : 0.1ms First word : 1.0ms Last word : 1.2ms Almost a word : 1.2ms TrieRegex of 100000 words Surely not a word : 0.3ms First word : 1.2ms Last word : 0.9ms Almost a word : 1.6ms 

For info, the regex begins like this:

(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|toir(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...

It's really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here's a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:

Enter image description here

Sign up to request clarification or add additional context in comments.

16 Comments

Looks that for original purpose, there is no need for a non capturing group. At least the meaning of the non capturing group should be mentionned
@XavierCombelle: You're right that I should mention the capturing group : the answer has been updated. I see it the other way around though : parens are needed for regex alternation with | but capturing groups aren't needed for our purpose at all. They'd just slow down the process and use more memory without benefit.
@EricDuminil This post is perfect, thank you so much :)
@MohamedALANI: Compared to which solution?
@PV8: It should only match complete words, yes, thanks to the \b (word boundary). If the list is ['apple', 'banana'], it will replace words which are exactly apple or banana, but not nana, bana or pineapple.
|
175

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you're done!

Code

import re with open('/usr/share/dict/american-english') as wordbook: banned_words = set(word.strip().lower() for word in wordbook) def delete_banned_words(matchobj): word = matchobj.group(0) if word.lower() in banned_words: return "" else: return word sentences = ["I'm eric. Welcome here!", "Another boring sentence.", "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000 word_pattern = re.compile('\w+') for sentence in sentences: sentence = word_pattern.sub(delete_banned_words, sentence) 

Converted sentences are:

' . ! . GiraffeElephantBoat sfgsdg sdwerha aswertwe 

Note that:

  • the search is case-insensitive (thanks to lower())
  • replacing a word with "" might leave two spaces (as in your code)
  • With python3, \w+ also matches accented characters (e.g. "ångström").
  • Any non-word character (tab, space, newline, marks, ...) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye's answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP's and Liteye's code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It's pretty hard to grasp the way the regex engine works, so let's write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

  • one is clearly not a word (it begins with #)
  • one is the first word in the list
  • one is the last word in the list
  • one looks like a word but isn't


import re import timeit import random with open('/usr/share/dict/american-english') as wordbook: english_words = [word.strip().lower() for word in wordbook] random.shuffle(english_words) print("First 10 words :") print(english_words[:10]) test_words = [ ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"), ("First word", english_words[0]), ("Last word", english_words[-1]), ("Almost a word", "couldbeaword") ] def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print("\nUnion of %d words" % 10**exp) union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp])) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %-17s : %.1fms" % (description, time)) 

It outputs:

First 10 words : ["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime'] Union of 10 words Surely not a word : 0.7ms First word : 0.8ms Last word : 0.7ms Almost a word : 0.7ms Union of 100 words Surely not a word : 0.7ms First word : 1.1ms Last word : 1.2ms Almost a word : 1.2ms Union of 1000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 9.6ms Almost a word : 10.1ms Union of 10000 words Surely not a word : 1.4ms First word : 1.8ms Last word : 96.3ms Almost a word : 116.6ms Union of 100000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 1227.1ms Almost a word : 1404.1ms 

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

  • O(1) best case
  • O(n/2) average case, which is still O(n)
  • O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.

10 Comments

You were correct. My indentation was wrong. I fixed it in the original question. As for the comment that 50 sentences / second is slow, all I can say is that I am providing a simplified example. The real data-set is more complicated than I am describing, but it didn't seem relevant. Also, concatenation of my "words" into a single regex massively improved the speed. Also, I am "squeezing" out double-spaces after the replacements.
@user36476 Thanks for the feedback, I removed the corresponding part. Could you please try my suggestion? I dare say it's much faster than the accepted answer.
@user36476: I've seen your comment update now, and changed the intro.
Since you removed that misleading O(1) claim, your answer definitely deserves an up vote.
@idmean: True, that wasn't very clear. It was just referring to the lookup : "Is this word a banned word?".
|
143

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric's answer is faster.

18 Comments

It's not just the C impl (which makes a big difference) but you're also matching with a single pass. Variants of this question come up pretty often, it's a little odd there isn't (or maybe there is, hiding somewhere?) a canonical SO answer for it with this pretty sensible idea.
@Liteye your suggestion turned a 4-hour job into a 4-minute job! I was able to join all 20,000+ regexes into a single gigantic regex and my laptop didn't bat an eye. Thanks again.
@Bakuriu: s/They actually use/They actually could in theory sometimes use/. Do you have any reason to believe Python's implementation is doing anything other than a loop here?
@Bakuriu: I'd be really interested to know if that's the case, but I don't think the regex solution takes linear time. If it doesn't build a Trie out of the union, I don't see how it could happen.
@Bakuriu: That's not a reason. I was asking if you have a reason to believe the implementation actually behaves that way, not whether you have a reason to believe it could behave that way. Personally I have yet to come across a single mainstream programming language's regex implementation that works in linear time the same way you'd expect a classical regex to, so if you know Python does this, you should show some evidence.
|
17

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it's a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then "discarding" the result of this work before the next pass.

Comments

15

Well, here's a quick and easy solution, with test set.

Best strategy:

  • re.sub("\w+",repl,sentence) searches for words.
  • "repl" can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.
  • This is the simplest and fastest solution (see function replace4 in example code below).

Second best strategy:

  • The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.
  • Implementation: (see function replace3 in example code below).

Timings for example functions:

replace1: 0.62 sentences/s replace2: 7.43 sentences/s replace3: 48498.03 sentences/s replace4: 61374.97 sentences/s (...and 240,000/s with PyPy) 

...and code:

#! /bin/env python3 # -*- coding: utf-8 import time, random, re def replace1( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns: sentence = re.sub( "\\b"+search+"\\b", repl, sentence ) def replace2( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns_comp: sentence = re.sub( search, repl, sentence ) def replace3( sentences ): pd = patterns_dict.get for n, sentence in enumerate( sentences ): #~ print( n, sentence ) # Split the sentence on non-word characters. # Note: () in split patterns ensure the non-word characters ARE kept # and returned in the result list, so we don't mangle the sentence. # If ALL separators are spaces, use string.split instead or something. # Example: #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf") #~ ['ab', ' ', 'céé', '? . ', 'd2eéf'] words = re.split(r"([^\w]+)", sentence) # and... done. sentence = "".join( pd(w,w) for w in words ) #~ print( n, sentence ) def replace4( sentences ): pd = patterns_dict.get def repl(m): w = m.group() return pd(w,w) for n, sentence in enumerate( sentences ): sentence = re.sub(r"\w+", repl, sentence) # Build test set test_words = [ ("word%d" % _) for _ in range(50000) ] test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ] # Create search and replace patterns patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ] patterns_dict = dict( patterns ) patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ] def test( func, num ): t = time.time() func( test_sentences[:num] ) print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t))) print( "Sentences", len(test_sentences) ) print( "Words ", len(test_words) ) test( replace1, 1 ) test( replace2, 10 ) test( replace3, 1000 ) test( replace4, 1000 ) 

EDIT: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ): pd = patterns_dict.get def repl(m): w = m.group() return pd(w.lower(),w) 

4 Comments

Upvote for the tests. replace4 and my code have similar performances.
Not sure what def repl(m): is doing and how you are assigning m in the function replace4
Also i am getting error error: unbalanced parenthesis for the line patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]
While replace3 and replace4 function addresses the original issue (to replace words), replace1 and replace2 are more general-purpose, as those work even if the needle is a phrase (a sequence of words) and not just a single word.
7

Perhaps Python is not the right tool here. Here is one with the Unix toolchain

sed G file | tr ' ' '\n' | grep -vf blacklist | awk -v RS= -v OFS=' ' '{$1=$1}1' 

assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)

sed 's/.*/\\b&\\b/' words > blacklist 

Comments

5

How about this:

#!/usr/bin/env python3 from __future__ import unicode_literals, print_function import re import time import io def replace_sentences_1(sentences, banned_words): # faster on CPython, but does not use \b as the word separator # so result is slightly different than replace_sentences_2() def filter_sentence(sentence): words = WORD_SPLITTER.split(sentence) words_iter = iter(words) for word in words_iter: norm_word = word.lower() if norm_word not in banned_words: yield word yield next(words_iter) # yield the word separator WORD_SPLITTER = re.compile(r'(\W+)') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) def replace_sentences_2(sentences, banned_words): # slower on CPython, uses \b as separator def filter_sentence(sentence): boundaries = WORD_BOUNDARY.finditer(sentence) current_boundary = 0 while True: last_word_boundary, current_boundary = current_boundary, next(boundaries).start() yield sentence[last_word_boundary:current_boundary] # yield the separators last_word_boundary, current_boundary = current_boundary, next(boundaries).start() word = sentence[last_word_boundary:current_boundary] norm_word = word.lower() if norm_word not in banned_words: yield word WORD_BOUNDARY = re.compile(r'\b') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) corpus = io.open('corpus2.txt').read() banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()] sentences = corpus.split('. ') output = io.open('output.txt', 'wb') print('number of sentences:', len(sentences)) start = time.time() for sentence in replace_sentences_1(sentences, banned_words): output.write(sentence.encode('utf-8')) output.write(b' .') print('time:', time.time() - start) 

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes' solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn't compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don't care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu's wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I've defined sentences as anything separated by ". ".

$ # replace_sentences_1() $ python3 filter_words.py number of sentences: 862462 time: 24.46173644065857 $ pypy filter_words.py number of sentences: 862462 time: 15.9370770454 $ # replace_sentences_2() $ python3 filter_words.py number of sentences: 862462 time: 40.2742919921875 $ pypy filter_words.py number of sentences: 862462 time: 13.1190629005 

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.

3 Comments

Python 3 is a given in the question. I upvoted this but I think it might be worth sacrificing some of the detail and 'optimal' implementation in this code to make it less verbose.
If I understand it correctly, it's basically the same principle as my answer, but more verbose? Splitting and joining on \W+ is basically like sub on \w+, right?
I wonder if my solution below (function replace4) is faster than pypy ;) I'd like to test on your files!
3

Practical approach

A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

With join/split tricks you can avoid loops at all which should speed up the algorithm.

  • Concatenate a sentences with a special delimeter which is not contained by the sentences:
  • merged_sentences = ' * '.join(sentences) 

  • Compile a single regex for all the words you need to rid from the sentences using | "or" regex statement:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag 

  • Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ') 

    Performance

    "".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

    for (i = 0; i < seqlen; i++) { [...] sz += PyUnicode_GET_LENGTH(item); 

    Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.


    b.t.w. don't use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.

    3 Comments

    In case the sentences are (were) stored in a text file, they are already separated by a newline. So the whole file could be read in as one big string (or buffer), words removed, and then written back again (or this could be done in the file directly using memory mapping). Otoh, to remove a word, the remainder of the string has to be moved back to fill the gap, so that would be a problem with one very large string. An alternative would be to write the parts between the words back to another string or file (which would include the newlines) – or just move those parts in a mmapped file (1) ..
    .. That last approach (moving/writing the parts between the words) combined with Eric Duminil’s set lookup could be really fast, perhaps without even using regex at all. (2)
    .. Or maybe regex is already optimized to only move those parts when replacing multiple words, I don't know.
    1

    Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here's one) to locate all your "bad" words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.

    Comments

    0

    Use the LRU Cache! I got a 60-fold speed-up for regex searches in a production environment. Caching (aka, memoization) works by saving the results of expensive computations, so you'll get the best speed up if there's duplication in your data set (which there likely is).

    import re from functools import lru_cache @lru_cache def expensive_regex(str): return re.findall('pattern', str) 

    Check out: 127x Regex Speed Up in 2 Lines of Code, written by yours truly :)

    Comments

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.