0

Python novice here.

I have a list of documents, and another list of search terms. I would now like to iterate over each document, and replace all occurrences of any of the search terms with something like <placeholder>. It should, however, only match full words, so text.replace probably does not work?

So, something like this:

document_list = ['I like apples', 'I like bananas', 'I like apples and bananas and pineapples', 'I like oranges, but not blood oranges.'] search_list = ['apples', 'bananas', 'blood oranges'] Out: ['I like <placeholder>', 'I like <placeholder>', 'I like <placeholder> and <placeholder> and pineapples', 'I like oranges, but not <placeholder>.'] 

Right now, I have something like

for document in document_list: for term in search_list: document = re.sub(r'\b{}\b'.format(term),'<placeholder>',document) 

This seems to work, but is really (and I mean really) slow.If I were to run this on my full dataset of ~10k documents, with a search_list of probably ~5k terms, it would take several days to finish. Is there any better way to approach this problem and make it faster?

Thanks a lot in advance!

Edit1: Maybe it's worth mentioning that the terms in search_listcan also consist of multiple words. Edited the example accordingly.

Edit2: Thanks for pointing to the other thread, had not found that one before. Sorry about that. As mentioned below, I'd still be curious to hear other, non-regex solutions just to learn about them. The actual problem has been soved through the other thread, though. =)

2
  • Are you open to non-regex solutions? Commented Nov 9, 2018 at 17:02
  • Sure, I'm open for whatever works best. Regex was just the first (and only) thing that came to my mind. Commented Nov 10, 2018 at 18:03

1 Answer 1

0

This is one possibility:

import re document_list = ['I like apples', 'I like bananas', 'I like apples and bananas and pineapples'] search_list = ['apples', 'bananas'] search_re = re.compile(r'\b(' + '|'.join(search_list) + r')\b') replacement = r'<placeholder>' document_replaced = [search_re.sub(replacement, doc) for doc in document_list] print(*document_replaced, sep='\n) 

Output:

I like <placeholder> I like <placeholder> I like <placeholder> and <placeholder> and pineapples 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.