Remove all rows from pandas dataframe based on a second dataframe

Question

I'm relatively new to pandas, so forgive the possibly simple question. I have two dataframes, one containing all of my data, in this case filenames, and the other containing filenames I want to remove.

What I would like to do is remove all of the rows in the master dataframe, where the filename appears in the second dataframe. There are several thousand different filenames, so I'm looking for some kind of generalisation of df = df[df["filename"].str.contains("A") == False] to take a dataframe, where there are also lots of duplicate values.

master_df = pd.DataFrame({'filename': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'], 'label': [0, 0, 0, 1, 0, 1, 0, 1, 1]}) files_to_remove = pd.DataFrame({'filename': ['A', 'A', 'A', 'A', 'C', 'C', 'C'], 'label': [0, 0, 0, 1, 0, 1, 1]}) desired_result = pd.DataFrame({'filename': ['B', 'B'], 'label': [0, 1]})

Thanks for the help!

You can merge the dataframes and exclude the lines where the key is different from NaN — Gabriel Doretto
– Gabriel Doretto, Commented May 5, 2022 at 14:39

Achraf Ben Salah · Accepted Answer · 2022-05-05 14:43:57Z

try like this :

import pandas as pd master_df = pd.DataFrame({'filename': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'], 'label': [0, 0, 0, 1, 0, 1, 0, 1, 1]}) files_to_remove = pd.DataFrame({'filename': ['A', 'A', 'A', 'A', 'C', 'C', 'C'], 'label': [0, 0, 0, 1, 0, 1, 1]}) print (master_df[~master_df.filename.isin(files_to_remove.filename)])

Output :

 filename label 4 B 0 5 B 1

Collectives™ on Stack Overflow

Remove all rows from pandas dataframe based on a second dataframe

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related