1

I'm a working professional doing my Mtech and trying to do a project on machine learning. I;m new to python as well as ML. I have a column called as Found and this has multiple values. I want to delete all the rows which is not matching a specific condition mentioned based on found column

 df['Found'] 0 developement 1 func-test 2 func-test 3 regression 4 func-test 5 integration 6 func-test 7 func-test 8 regression 9 func-test 

I want to keep the rows which has Found value as "anything that has test and regression

I wrote the following code.

remove_list = [] for x in range(df.shape[0]): text = df.iloc[x]['Found'] if not re.search('test|regression', text, re.I): remove_list.append(x) print(remove_list) df.drop(remove_list, inplace = True) print(df) 

but the remove_list is empty. am i doing anything wrong here? or is there a better way of achieving this?

[] Identifier Status Priority Severity Found Age \ 0 Bug 1 V NaN 2 development 1 1 Bug 2 R NaN 6 func-test 203 2 Bug 3 V NaN 2 func-test 9 3 Bug 4 D NaN 3 regression 4 4 Bug 5 V NaN 2 func-test 9 

I even tried this but i get the following error:

for x in range(df.shape[0]): if not re.search('test|regression|customer', df.iloc[x]['Found'], re.I): df.drop(x, inplace = True) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-77-14f97ad6d00a> in <module> 1 for x in range(df.shape[0]): ----> 2 if not re.search('test|regression|customer', df.iloc[x]['Found'], re.I): 3 df.drop(x, inplace = True) ~/Desktop/Anaconda/anaconda3/envs/nlp_course/lib/python3.7/re.py in search(pattern, string, flags) 183 """Scan through string looking for a match to the pattern, returning 184 a Match object, or None if no match was found.""" --> 185 return _compile(pattern, flags).search(string) 186 187 def sub(pattern, repl, string, count=0, flags=0): TypeError: expected string or bytes-like object 

1 Answer 1

1

You can do this concisely with .str.contains() and boolean indexing:

df = df[df['Found'].str.contains('test|regression')] # Identifier Status Priority Severity Found Age # 1 Bug 2 R NaN 6 func-test 203 # 2 Bug 3 V NaN 2 func-test 9 # 3 Bug 4 D NaN 3 regression 4 # 4 Bug 5 V NaN 2 func-test 9 

If you need to handle nan, prepend replace(np.nan, ''):

df = df[df['Found'].replace(np.nan, '').str.contains('test|regression')] 

And as @sophocles mentioned, you could also make it case-insensitive with case=False:

df = df[df['Found'].str.contains('test|regression', case=False)] 
Sign up to request clarification or add additional context in comments.

5 Comments

you can even make it case insensitive, with str.contains('test|regression',case=False)]..
if I am not mistaken, I think the case=False should be in the bracket. Please test it :)
Yes.. case=False is within bracket. and it works fine.
Looks like i have to remove the NaN values before passing it on to a dataframe. df = df[df['Found'].notna()] and then df = df[df['Found'].str.contains('test|regression', case=False)] works. anyway to handle it dynamically. I cannot use dropna() since i hace a column priority with most of NaN values
@Sai You can handle the NaNs inline by chaining an extra replace() like so: df = df[df['Found'].replace(np.nan, '').str.contains('test|regression')]

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.