Eliminate pandas dataframe rows with partial matches

Question

I want to compare two pandas dataframes and find out the rows that are only in df1, by comparing the values in column A and B. I feel like I could somehow perform this by using merge but cannot figure out..

import pandas as pd df1 = pd.DataFrame([[1,11, 111], [2,22, 222], [3, 33, 333]], columns=['A', 'B', 'C']) df2 = pd.DataFrame([[1, 11]], columns=['A', 'B'])

df1

 A B C 0 1 11 111 1 2 22 222 2 3 33 333

df2

 A B 0 1 11

Dataframe I want to see

 A B C 1 2 22 222 2 3 33 333

In the example you are showing, it seems that you want to eliminate rows with partial matches. Is that correct? — nitin
– nitin, Commented Oct 21, 2015 at 21:20

Community · Accepted Answer · 2017-05-23 12:02:02Z

Based on this approach:

import pandas as pd df1 = pd.DataFrame([[1,11, 111], [2,22, 222], [3, 33, 333]], columns=['A', 'B', 'C']) df2 = pd.DataFrame([[1, 11]], columns=['A', 'B'])

Concatenate the dataframes:

df = pd.concat([df1, df2]) df.reset_index(drop=True)

Groupby your desired comparison columns - in your case, A and B:

df_gpby = df.groupby(['A','B'])

Get the indexes for the groups with only one value - i.e. unique A,B pairs:

idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

Subset the concatenated dataframe by the indexes:

df.iloc[idx]

Results in:

 A B C 1 2 22 222 2 3 33 333

oz123 · Accepted Answer · 2015-10-21 21:44:53Z

While vmg's solution is neat, it requires you to know which columns you need to group by. A more generic approach is this:

First subtract one data frame from another:

In [46]: df3 = df1.subtract(df2) In [47]: df3 Out[47]: A B C 0 0 0 NaN 1 NaN NaN NaN 2 NaN NaN NaN

You see that the interesting rows, are those who don't exist in df2, so they are all NaN. Using numpy method you can find those rows:

In [50]: np.isnan(df3.iloc[0]) Out[50]: A False B False C True Name: 0, dtype: bool In [51]: np.isnan(df3.iloc[1]) Out[51]: A True B True C True Name: 1, dtype: bool

Now, that you know how to locate those rows, you can do a crazy one liner:

In [52]: df1.iloc[[idx for idx, row in df3.iterrows() if all(np.isnan(df3.iloc[idx]))]] Out[52]: A B C 1 2 22 222 2 3 33 333

update, let's add a generic function

def substract_dataframe(df1, df2): for i in [df1, df2]: if not isinstance(i, pd.DataFrame): raise ValueError(("Wrong argument given! All arguments must be DataFrame instances")) df = df1.subtract(df2) return df1.iloc[[idx for idx, row in df.iterrows() if all(np.isnan(df.iloc[idx]))]]

testing ...

In [54]: substract_dataframe(df1, df2) Out[54]: A B C 1 2 22 222 2 3 33 333 In [55]: substract_dataframe(df1, 'sdf') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-55-6ce801e88ce4> in <module>() ----> 1 substract_dataframe(df1, 'sdf') <ipython-input-53-e5d7db966311> in substract_dataframe(df1, df2) 2 for i in [df1, df2]: 3 if not isinstance(i, pd.DataFrame): ----> 4 raise ValueError("Wrong argument given! All arguments must be DataFrame instances") 5 df = df1.subtract(df2) 6 return df1.iloc[[idx for idx, row in df.iterrows() if all(np.isnan(df.iloc[idx]))]] ValueError: Wrong argument given! All arguments must be DataFrame instances

Zero · Accepted Answer · 2017-10-17 15:10:17Z

Use merge with indicator, was added in 0.17.0 version

In [3544]: df1.merge(df2, how='left', indicator='_b').query('_b != "both"').drop('_b', 1) Out[3544]: A B C 1 2 22 222 2 3 33 333

Details

In [3545]: df1.merge(df2, how='left', indicator='_b') Out[3545]: A B C _b 0 1 11 111 both 1 2 22 222 left_only 2 3 33 333 left_only In [3546]: df1.merge(df2, how='left', indicator='_b').query('_b != "both"') Out[3546]: A B C _b 1 2 22 222 left_only 2 3 33 333 left_only

Collectives™ on Stack Overflow

Eliminate pandas dataframe rows with partial matches

3 Answers 3

Comments

update, let's add a generic function

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

update, let's add a generic function

Comments

Comments

Linked

Related