0

I'm looking for an effective way to solve this problem

Let say we want to find a list of words in a string ignoring the case, but instead of storing the matched string we want a string with the same case as the original list.

For example :

words_to_match = ['heLLo', 'jumP', 'TEST', 'RESEARCH stuff'] text = 'hello this is jUmp test jump and research stuff' # Result should be {'TEST', 'heLLo', 'jumP', 'RESEARCH stuff'} 

Here is my current approach:

words_to_match = ['heLLo', 'jumP', 'TEST', 'RESEARCH stuff'] 

I convert this to following regex :

regex = re.compile(r'\bheLLo\b|\bjumP\b|\bTEST\b|\bRESEARCH stuff\b', re.IGNORECASE) 

Then

word_founds = re.findall(regex,'hello this is jUmp test jump and research stuff') normalization_dict = {w.lower():w for w in words_to_match} # normalization dict : {'hello': 'heLLo', 'jump': 'jumP', 'test': 'TEST', 'research stuff': 'RESEARCH stuff'} final_list = [normalization_dict[w.lower()] for w in word_founds] # final_list : ['heLLo', 'jumP', 'TEST', 'jumP', 'RESEARCH stuff'] final_result = set(final_list) # final_result : {'TEST', 'heLLo', 'jumP', 'RESEARCH stuff'} 

This is my expected result, I just want to know if there is a faster/more elegant way to solve this problem.

2
  • How many words you got ? Commented Mar 27, 2019 at 20:42
  • To be clear I got the result that I wanted , I just want to know if there is something more elegant/faster. Commented Mar 27, 2019 at 20:45

2 Answers 2

2

This can be done in a single line, if you're still okay with using regex.

results = set(word for word in re.findall(r"[\w']+", text) if word.lower() in [w.lower() for w in words_to_match]) 

All it's used for here is splitting the text variable based on word boundaries.

Edit:

You could also use:

import string results = set(word for word in "".join(c if c not in string.punctuation else " " for c in text).split() if word.lower() in [w.lower() for w in words_to_match]) 

if you want to avoid importing re, but then you have to use string.

Edit 2: (after properly reading the question, hopefully)

results = set(word for word in words_to_match if word.lower() in text.lower()) 

This works with multi-word searches as well.

Edit 3:

results = set(word for word in words_to_match if re.search(r"\b" + word.lower() + r"\b", text.lower())) 
Sign up to request clarification or add additional context in comments.

5 Comments

Hi Collin, tank you for your solution, It helps me to figure out that I need to edit my question because basically inside the words_to_match I can have multi-words.
Even in the single word case your 1st solution get the case of the matched string and not the one of the orignal list.
Oh, I'm sorry, I misread the question. That actually makes things a little easier; see my edit.
EDIT 2: Unfortunately it doesn't work (no word boundaries) see the comments on @Philosophist answer
Thanks it works :) ! However doing a search for each word is quiet slow compared to what I proposed, this is a toy example, but in reality the 'words_to_match' list is very long.
0

Try this:

words_to_match = ['heLLo', 'jumP', 'TEST'] text = 'hello this is jUmp test jump' result = set() for str in words_to_match: if str.lower() in text.lower(): result.add(str) 

4 Comments

@abcdaire Ah, doh'es me. Fixed.
With your solution there are no word boundaries, if you try with 'hellouu should not be captured' , it will capture 'hellouu' as heLLo
Ohh... hmm. Yeah, it doesn't get more elegant than using regex for that.
Yes I guess , I basically wanted to see if someone can come up with a solution that avoid the normalization_dict, thanks for your effort :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.