How to remove rows from a dataframe based on their column values existence in another df?

Question

Given two dataframes A and B, which both have columns 'x', 'y' how can I efficiently remove all rows in A that their pairs of (x, y) appear in B.

I thought about implementing it using a row iterator on A and then per pair checking if it exists in B but I am guessing this is the least efficient way...

I tried using the .isin function as suggested in Filter dataframe rows if value in column is in a set list of values but couldn't make use of it for multiple columns.

Example dataframes:

A = pd.DataFrame([[1, 2], [1, 4], [3, 4], [2, 4]], columns=['x', 'y']) B = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])

C should contain [1,4] and [2,4] after the operation.

Are there possible duplicate rows in A itseld? And what to do with them? — joris
– joris, Commented Dec 19, 2013 at 10:19
@TomAugspurger I don't think it will work for this case, as it needs to have the same rows, so combination of two values, not just the same value as in a column + it does not need to match on the index. — joris
– joris, Commented Dec 19, 2013 at 13:04

joris · Accepted Answer · 2013-12-19 13:42:22Z

In pandas master (or in future 0.13) isin will also accept DataFrames, but the problem is that it just looks at the values in each column, and not at an exact row combination of the columns.

Taken from @AndyHayden comment here (https://github.com/pydata/pandas/issues/4421#issuecomment-23052472), a similar approach with set:

In [3]: mask = pd.Series(map(set(B.itertuples(index=False)).__contains__, A.itertuples(index=False))) In [4]: A[~mask] Out[4]: x y 1 1 4 3 2 4

Or a more readable version:

set_B = set(B.itertuples(index=False)) mask = [x not in set_B for x in A.itertuples(index=False)]

The possible advantage of this compared to @Acorbe's answer is that this preserves the index of A and does not remove duplicate rows in A (but that depends on what you want of course).

As I said, 0.13 will have accept DataFrames to isin. However, I don't think this will solve this issue because also the index has to be the same:

In [27]: A.isin(B) Out[27]: x y 0 True True 1 False True 2 False False 3 False False

You can solve this by converting it to a dict, but now it does not look at the combinatio of both columns, but just for each column seperately:

In [28]: A.isin(B.to_dict(outtype='list')) Out[28]: x y 0 True True 1 True True 2 True True 3 False True

Preserving the index is indeed an important detail I haven't taken into account and do need. In the more readable version a conversion of the mask to a pandas series is missing. Thank you!
You're welcome! In the first version, I converted to a Series because I needed the boolean inverted (with ~), and this is not possible with a list. But in the more readable version, this inversion is already included in the list comprehension (not in). But indeed, then just A[mask] is needed without ~.

tandy · Accepted Answer · 2015-01-16 04:36:09Z

For those looking for a single-column solution:

new_df = df1[~df1["column_name"].isin(df2["column_name"])]

The ~ is a logical operator for NOT.

So this will create a new dataframe when the values of df1["column_name"] are not found in df2["column_name"]

Acorbe · Accepted Answer · 2013-12-19 12:01:58Z

One option would be to generate two sets, say A_set, B_set, whose elements are the rows of the DataFrames. Hence, the fast set difference operation A_set - B_set can be used.

 A_set = set(map(tuple,A.values)) #we need to have an hashable object before generating a set B_set = set(map(tuple,B.values)) C_set = A_set - B_set C_set {(1, 4), (2, 4)} C = pd.DataFrame([c for c in C_set], columns=['x','y']) x y 0 2 4 1 1 4

This procedure involves some preliminary conversion operations, though.

Collectives™ on Stack Overflow

How to remove rows from a dataframe based on their column values existence in another df?

3 Answers 3

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Linked

Related