List of strings, replace all words from other list [duplicate]

Question

Python novice here.

I have a list of documents, and another list of search terms. I would now like to iterate over each document, and replace all occurrences of any of the search terms with something like <placeholder>. It should, however, only match full words, so text.replace probably does not work?

So, something like this:

document_list = ['I like apples', 'I like bananas', 'I like apples and bananas and pineapples', 'I like oranges, but not blood oranges.'] search_list = ['apples', 'bananas', 'blood oranges'] Out: ['I like <placeholder>', 'I like <placeholder>', 'I like <placeholder> and <placeholder> and pineapples', 'I like oranges, but not <placeholder>.']

Right now, I have something like

for document in document_list: for term in search_list: document = re.sub(r'\b{}\b'.format(term),'<placeholder>',document)

This seems to work, but is really (and I mean really) slow.If I were to run this on my full dataset of ~10k documents, with a search_list of probably ~5k terms, it would take several days to finish. Is there any better way to approach this problem and make it faster?

Thanks a lot in advance!

Edit1: Maybe it's worth mentioning that the terms in search_listcan also consist of multiple words. Edited the example accordingly.

Edit2: Thanks for pointing to the other thread, had not found that one before. Sorry about that. As mentioned below, I'd still be curious to hear other, non-regex solutions just to learn about them. The actual problem has been soved through the other thread, though. =)

Sure, I'm open for whatever works best. Regex was just the first (and only) thing that came to my mind. — I_love_Norway
– I_love_Norway, Commented Nov 10, 2018 at 18:03

javidcf · Accepted Answer · 2018-11-09 17:08:10Z

This is one possibility:

import re document_list = ['I like apples', 'I like bananas', 'I like apples and bananas and pineapples'] search_list = ['apples', 'bananas'] search_re = re.compile(r'\b(' + '|'.join(search_list) + r')\b') replacement = r'<placeholder>' document_replaced = [search_re.sub(replacement, doc) for doc in document_list] print(*document_replaced, sep='\n)

Output:

I like <placeholder> I like <placeholder> I like <placeholder> and <placeholder> and pineapples

Collectives™ on Stack Overflow

List of strings, replace all words from other list [duplicate]

1 Answer 1

Comments

Hot Network Questions