3

Given two dataframes A and B, which both have columns 'x', 'y' how can I efficiently remove all rows in A that their pairs of (x, y) appear in B.

I thought about implementing it using a row iterator on A and then per pair checking if it exists in B but I am guessing this is the least efficient way...

I tried using the .isin function as suggested in Filter dataframe rows if value in column is in a set list of values but couldn't make use of it for multiple columns.

Example dataframes:

A = pd.DataFrame([[1, 2], [1, 4], [3, 4], [2, 4]], columns=['x', 'y']) B = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y']) 

C should contain [1,4] and [2,4] after the operation.

4
  • Are there possible duplicate rows in A itseld? And what to do with them? Commented Dec 19, 2013 at 10:19
  • The isin method will work with DataFrames in .13 Commented Dec 19, 2013 at 12:58
  • @TomAugspurger I don't think it will work for this case, as it needs to have the same rows, so combination of two values, not just the same value as in a column + it does not need to match on the index. Commented Dec 19, 2013 at 13:04
  • @joris, there are no duplicate rows in A. Commented Dec 19, 2013 at 13:38

3 Answers 3

4

In pandas master (or in future 0.13) isin will also accept DataFrames, but the problem is that it just looks at the values in each column, and not at an exact row combination of the columns.

Taken from @AndyHayden comment here (https://github.com/pydata/pandas/issues/4421#issuecomment-23052472), a similar approach with set:

In [3]: mask = pd.Series(map(set(B.itertuples(index=False)).__contains__, A.itertuples(index=False))) In [4]: A[~mask] Out[4]: x y 1 1 4 3 2 4 

Or a more readable version:

set_B = set(B.itertuples(index=False)) mask = [x not in set_B for x in A.itertuples(index=False)] 

The possible advantage of this compared to @Acorbe's answer is that this preserves the index of A and does not remove duplicate rows in A (but that depends on what you want of course).


As I said, 0.13 will have accept DataFrames to isin. However, I don't think this will solve this issue because also the index has to be the same:

In [27]: A.isin(B) Out[27]: x y 0 True True 1 False True 2 False False 3 False False 

You can solve this by converting it to a dict, but now it does not look at the combinatio of both columns, but just for each column seperately:

In [28]: A.isin(B.to_dict(outtype='list')) Out[28]: x y 0 True True 1 True True 2 True True 3 False True 
Sign up to request clarification or add additional context in comments.

2 Comments

Preserving the index is indeed an important detail I haven't taken into account and do need. In the more readable version a conversion of the mask to a pandas series is missing. Thank you!
You're welcome! In the first version, I converted to a Series because I needed the boolean inverted (with ~), and this is not possible with a list. But in the more readable version, this inversion is already included in the list comprehension (not in). But indeed, then just A[mask] is needed without ~.
3

For those looking for a single-column solution:

new_df = df1[~df1["column_name"].isin(df2["column_name"])] 

The ~ is a logical operator for NOT.

So this will create a new dataframe when the values of df1["column_name"] are not found in df2["column_name"]

Comments

0

One option would be to generate two sets, say A_set, B_set, whose elements are the rows of the DataFrames. Hence, the fast set difference operation A_set - B_set can be used.

 A_set = set(map(tuple,A.values)) #we need to have an hashable object before generating a set B_set = set(map(tuple,B.values)) C_set = A_set - B_set C_set {(1, 4), (2, 4)} C = pd.DataFrame([c for c in C_set], columns=['x','y']) x y 0 2 4 1 1 4 

This procedure involves some preliminary conversion operations, though.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.