Python - Find a substring within a string using an IF statement when iterating through a pandas DataFrame with a FOR loop

Question

I have a DataFrame that looks like this...

 Variable 0 Religion - Buddhism 1 Source: Clickerz 2 Religion - Islam 3 Source: SRZ FREE 4 Ethnicity - Mixed - White & Black African

I want to manipulate the variablecolumn to create a new column which looks like this...

 Variable New Column 0 Religion - Buddhism Buddhism 1 Source: Clickerz Clickerz 2 Religion - Islam Islam 3 Source: SRZ FREE SRZ FREE 4 Ethnicity - Mixed - White & Black African Mixed - White and Black African

So that I can eventually have a DataFrame that looks like this...

 Variable New Column 0 Religion Buddhism 1 Source Clickerz 2 Religion Islam 3 Source SRZ FREE 4 Ethnicity Mixed - White and Black African

I want to iterate through the Variable column and manipulate the data to create New Column. I was planning on using multiple if statements to find a specific word for example 'Ethnicity' or 'Religion' and then apply a manipulation.

For example...

For row in df['Variable']: if 'Religion' in row: df['New Column'] = ... elif 'Ethnicity' in row: df['New Column'] = ... elif: 'Source' in row: df['New Column'] = ... else: df['New Column'] = 'Not Applicable'

Even though type(row) returns 'str' meaning it is of the class string, this code keeps returning the new column as all 'Not Applicable' meaning it is not detecting any of the strings in any of the rows in the data frame even when I can see they are there.

I am sure there is an easy way to do this...PLEASE HELP!

I have tried the following aswell...

For row in df['Variable']: if row.find('Religion') != -1: df['New Column'] = ... elif row.find('Ethnicity') != -1: df['New Column'] = ... elif: row.find('Source') != -1: df['New Column'] = ... else: df['New Column'] = 'Not Applicable'

And I continue to get all entries of the new column being 'Not Applicable'. Once again it is not finding the string in the existing column.

Is it an issue with the data type or something?

Xiddoc · Accepted Answer · 2021-09-29 19:10:32Z

You could use a nested for loop:

# For each row in the dataframe for row in df['column_variable']: # Set boolean to indicate if a substring was found substr_found = False # For each substring for sub_str in ["substring1", "substring2"]: # If the substring is in the row if sub_str in row: # Execute code... df['new_column'] = ... # Substring was found! substr_found = True # If substring was not found if not substr_found: # Set invalid code... df['new column'] = 'Not Applicable'

TCB919 · Accepted Answer · 2021-09-29 21:19:37Z

Updated to match your Dataframe!

import pandas as pd

Your Dataframe

lst = [] for i in ['Religion - Buddhism','Source: Clickerz','Religion - Islam','Source: SRZ FREE','Ethnicity - Mixed - White & Black African']: item = [i] lst.append(item) df = pd.DataFrame.from_records(lst) df.columns = ['variable'] print(df)

 variable 0 Religion - Buddhism 1 Source: Clickerz 2 Religion - Islam 3 Source: SRZ FREE 4 Ethnicity - Mixed - White & Black African

Using a For Loop and Partial String matching in conjuction with `.loc` to set the new values

for x,y in df['variable'].iteritems(): if 'religion' in y.lower(): z = y.split('-') df.loc[x, 'variable'] = z[0].strip() df.loc[x, 'value'] = ''.join(z[1:]).strip() if 'source' in y.lower(): z = y.split(':') df.loc[x, 'variable'] = z[0].strip() df.loc[x, 'value'] = ''.join(z[1:]).strip() if 'ethnicity' in y.lower(): z = y.split('-') df.loc[x, 'variable'] = z[0].strip() df.loc[x, 'value'] = ''.join(z[1:]).strip() print(df)

 variable value 0 Religion Buddhism 1 Source Clickerz 2 Religion Islam 3 Source SRZ FREE 4 Ethnicity Mixed White & Black African

bayesien · Accepted Answer · 2021-09-29 21:27:30Z

As much as possible, you should avoid looping through rows when manipulating a DataFrame. This article explains what are the more efficient alternatives.

You are basically attempting to translate strings based on some fixed map. Naturally, a dict comes to mind:

substring_map = { "at": "pseudo-cat", "dog": "true dog", "bre": "something else", "na": "not applicable" }

This map could be read from a file, e.g., a JSON file, in the scenario where you are handling a large number of substrings.

The substring matching logic can now be decoupled from the map definition:

def translate_substring(x): for substring, new_string in substring_map.items(): if substring in x: return new_string return "not applicable"

Use apply with the 'mapping' function to generate your target column:

df = pd.DataFrame({"name": ["cat", "dogg", "breeze", "bred", "hat", "misty"]}) df["new_column"] = df["name"].apply(translate_substring) # df: # name new_column # 0 cat pseudo-cat # 1 dogg true dog # 2 breeze something else # 3 bred something else # 4 hat pseudo-cat # 5 misty not applicable

This code, applied on pd.concat([df] * 10000) (60,000 rows), runs in 42ms in a Colab notebook. In comparison, using iterrows completes in 3.67s--a 87x speedup.

Andrej Kesely · Accepted Answer · 2021-09-29 19:27:37Z

You can create an empty list, add new values there and the create the new column as last step:

all_data = [] for row in df["column_variable"]: if "substring1" in row: all_data.append("Found 1") elif "substring2" in row: all_data.append("Found 2") elif "substring3" in row: all_data.append("Found 3") else: all_data.append("Not Applicable") df["new column"] = all_data print(df)

Prints:

 column_variable new column 0 this is substring1 Found 1 1 this is substring2 Found 2 2 this is substring1 Found 1 3 this is substring3 Found 3

For some reason when I type in..."if 'substring' in row:" it does not find the substring in the row even though it is clearly there. This is the main problem
@ElliottDavey Please edit your question with sample of your dataframe.

Tommy · Accepted Answer · 2021-09-29 19:43:12Z

Maybe the shortest way I can think of:

#Dummy DataFrame df = pd.DataFrame([[1,"substr1"],[3,"bla"],[5,"bla"]],columns=["abc","col_to_check"]) substrings = ["substr1","substr2", "substr3"] content = df["col_to_check"].unique().tolist() # Unique content of column for subs in substrings: # Go through all your substrings if subs in content: # Check if substring is in column df[subs] = 0 # Fill your new column with whatever you want

cavalcantelucas · Accepted Answer · 2021-10-02 19:59:59Z

I made a function 'string_splitter' and applied it in a lambda function, this solved the issue.

I created the following function to split strings in different ways based on different substrings contained in the cell.

def string_splitter(cell): word_list1 = ['Age', 'Disability', 'Religion', 'Gender'] word_list2 = ['Number shortlisted', 'Number Hired', 'Number Interviewed'] if any([word in cell for word in word_list1]): result = cell.split("-")[1] result = result.strip() elif 'Source' in cell: result = cell.split(":")[1] result = result.strip() elif 'Ethnicity' in cell: result_list = cell.split("-")[1:3] result = "-".join(result_list) result = result.strip() elif any([word in cell for word in word_list2]): result = cell.split(" ")[1] result = result.strip() elif 'Number of Applicants' in cell: result = cell return result

I then called string_splitter when using a lambda operation. This applied the function to each cell individually as the code iterates through each row of the specified column in the dataframe. As shown below:

df['Answer'] = df['Visual Type'].apply(lambda x: string_splitter(x))

string_splitter allowed me to create the New column.

I then created another function column_formatter to manipulate the Variable column once the New Column had been made. The second function is shown below:

def column_formatter(cell): word_list1 = ['Age', 'Gender', 'Ethnicity', 'Religion'] word_list2 = ['Number of Applicants', 'Number Hired', 'Number shortlisted', 'Number Interviewed'] if any([word in cell for word in word_list1]): result = cell.split("-")[0] result = result.strip() elif 'Source' in cell: result = cell.split(":")[0] result = result.strip() elif 'Disability' in cell: result = cell.split(" ")[0] result = result.strip() elif any([word in cell for word in word_list2]): result = 'Number of Applicants' else: result = 'Something wrong here' return result

And then called the function in the same way as follows:

df['Visual Type'] = df['Visual Type'].apply(lambda x: column_formatter(x))

Collectives™ on Stack Overflow

Python - Find a substring within a string using an IF statement when iterating through a pandas DataFrame with a FOR loop

6 Answers 6

Comments

Your Dataframe

Using a For Loop and Partial String matching in conjuction with `.loc` to set the new values

Comments

Comments

2 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Your Dataframe

Using a For Loop and Partial String matching in conjuction with .loc to set the new values

Comments

Comments

2 Comments

Comments

Comments

Related

Using a For Loop and Partial String matching in conjuction with `.loc` to set the new values