80

I need to compare two dataframes of different size row-wise and print out non matching rows. Lets take the following two:

df1 = DataFrame({ 'Buyer': ['Carl', 'Carl', 'Carl'], 'Quantity': [18, 3, 5, ]}) df2 = DataFrame({ 'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'], 'Quantity': [2, 1, 18, 5]}) 

What is the most efficient way to row-wise over df2 and print out rows not in df1 e.g.

Buyer Quantity Carl 2 Mark 1 

Important: I do not want to have row:

Buyer Quantity Carl 3 

Included in the diff:

I have already tried: Comparing two dataframes of different length row by row and adding columns for each row with equal value and Compare two DataFrames and output their differences side-by-side

But these do not match with my problem.

0

8 Answers 8

139

merge the 2 dfs using method 'outer' and pass param indicator=True this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:

In [22]: merged = df1.merge(df2, indicator=True, how='outer') merged[merged['_merge'] == 'right_only'] Out[22]: Buyer Quantity _merge 3 Carl 2 right_only 4 Mark 1 right_only 
Sign up to request clarification or add additional context in comments.

2 Comments

If a row appears once in one dataframe, and twice in the other - this method will not be able to detect that. Ideally, one of the entries should be marked as "right_only" and the other one as being in both.
See @TobiasBergkvist answer below to account for duplicate rows
34

you may find this as the best:

df2[ ~df2.isin(df1)].dropna() 

1 Comment

This doesn't work if the DataFrame has existing NAs.
11

@EdChum's answer is self-explained. But using not 'both' condition makes more sense and you do not need to care about the order of comparison, and this is what a real diff supposed to be. For the sake of answering your question:

merged = df1.merge(df2, indicator=True, how='outer') merged.loc = [merged['_merge'] != 'both'] 

2 Comments

This is a full anti join, OP asked for right anti join
The second line fails and should be swapped to: merged.loc[merged['_merge'] != 'both']
11

As of Pandas 1.1.0, there is pandas.DataFrame.compare:

df1.compare(df2) 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html

2 Comments

This doesn't work in inequal indices (aka, two dataframes with different number of rows)
Yep, from the notes in the doc page: Can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
9
diff = set(zip(df2.Buyer, df2.Quantity)) - set(zip(df1.Buyer, df1.Quantity)) 

This is the first solution that came to mind. You can then put the diff set back in a DF for presentation.

Comments

5

An important edge case

Consider the following, where you have an additional duplicate entry in the second dataframe. ('Carl', 5)

df1 = DataFrame({ 'Buyer': ['Carl', 'Carl', 'Carl'], 'Quantity': [ 18 , 3 , 5 ] }) df2 = DataFrame({ 'Buyer': ['Carl', 'Mark', 'Carl', 'Carl', 'Carl'], 'Quantity': [ 2 , 1 , 18 , 5 , 5 ] }) 

EdChum's answer will give you the following:

merged = df1.merge(df2, indicator=True, how='outer') print(merged[merged['_merge'] == 'right_only']) Buyer Quantity _merge 4 Carl 2 right_only 5 Mark 1 right_only 

As you can see, the solution ignores the additional duplicate value, which depending on what you are doing is something you want to avoid.

Here is a solution that more likely does what you want:

df1['duplicate_counter'] = df1.groupby(list(df1.columns)).cumcount() df2['duplicate_counter'] = df2.groupby(list(df2.columns)).cumcount() merged = df1.merge(df2, indicator=True, how='outer') merged[merged['_merge'] == 'right_only'] Buyer Quantity duplicate_counter _merge 3 Carl 2 0 right_only 4 Mark 1 0 right_only 5 Carl 5 1 right_only 

The duplicate counter ensures that every row is unique, which means that duplicate values are not removed. After merging, you can drop the duplicate_counter.

Comments

4

Try the following if you only care about adding the new Buyers to the other df:

df_delta=df2[df2['Buyer'].apply(lambda x: x not in df1['Buyer'].values)] 

Comments

1

There is datacompy e.g. as well. It allows to export some string based comparison report like this:

DataComPy Comparison -------------------- DataFrame Summary ----------------- DataFrame Columns Rows 0 original 5 6 1 new 4 5 Column Summary -------------- Number of columns in common: 4 Number of columns in original but not in new: 1 Number of columns in new but not in original: 0 Row Summary ----------- Matched on: acct_id Any duplicates on match values: Yes Absolute Tolerance: 0.0001 Relative Tolerance: 0 Number of rows in common: 5 Number of rows in original but not in new: 1 Number of rows in new but not in original: 0 Number of rows with some compared columns unequal: 5 Number of rows with all compared columns equal: 0 Column Comparison ----------------- Number of columns compared with some values unequal: 3 Number of columns compared with all values equal: 1 Total number of values which compare unequal: 7 Columns with Unequal Values or Types ------------------------------------ Column original dtype new dtype # Unequal Max Diff # Null Diff 0 dollar_amt float64 float64 1 0.0500 0 1 float_fld float64 float64 4 0.0005 3 2 name object object 2 0.0000 0 Sample Rows with Unequal Values ------------------------------- acct_id dollar_amt (original) dollar_amt (new) 0 10000001234 123.45 123.4 acct_id float_fld (original) float_fld (new) 0 10000001234 14530.1555 14530.155 5 10000001238 NaN 111.000 2 10000001236 NaN 1.000 1 10000001235 1.0000 NaN acct_id name (original) name (new) 0 10000001234 George Maharis George Michael Bluth 3 10000001237 Bob Loblaw Robert Loblaw Sample Rows Only in original (First 10 Columns) ----------------------------------------------- acct_id dollar_amt name float_fld date_fld 4 10000001238 1.05 Lucille Bluth NaN 2017-01-01 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.