Diff of two Dataframes

Question

I need to compare two dataframes of different size row-wise and print out non matching rows. Lets take the following two:

df1 = DataFrame({ 'Buyer': ['Carl', 'Carl', 'Carl'], 'Quantity': [18, 3, 5, ]}) df2 = DataFrame({ 'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'], 'Quantity': [2, 1, 18, 5]})

What is the most efficient way to row-wise over df2 and print out rows not in df1 e.g.

Buyer Quantity Carl 2 Mark 1

Important: I do not want to have row:

Buyer Quantity Carl 3

Included in the diff:

I have already tried: Comparing two dataframes of different length row by row and adding columns for each row with equal value and Compare two DataFrames and output their differences side-by-side

But these do not match with my problem.

EdChum · Accepted Answer · 2016-04-27 15:23:32Z

merge the 2 dfs using method 'outer' and pass param indicator=True this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:

In [22]: merged = df1.merge(df2, indicator=True, how='outer') merged[merged['_merge'] == 'right_only'] Out[22]: Buyer Quantity _merge 3 Carl 2 right_only 4 Mark 1 right_only

If a row appears once in one dataframe, and twice in the other - this method will not be able to detect that. Ideally, one of the entries should be marked as "right_only" and the other one as being in both.
See @TobiasBergkvist answer below to account for duplicate rows

A. Nadjar · Accepted Answer · 2019-05-01 10:19:46Z

34

you may find this as the best:

df2[ ~df2.isin(df1)].dropna()

answered May 1, 2019 at 10:19

A. Nadjar

2,5532 gold badges22 silver badges21 bronze badges

1 Comment

csteel Over a year ago

This doesn't work if the DataFrame has existing NAs.

Shawn Azdam · Accepted Answer · 2019-06-21 13:40:31Z

@EdChum's answer is self-explained. But using not 'both' condition makes more sense and you do not need to care about the order of comparison, and this is what a real diff supposed to be. For the sake of answering your question:

merged = df1.merge(df2, indicator=True, how='outer') merged.loc = [merged['_merge'] != 'both']

The second line fails and should be swapped to: merged.loc[merged['_merge'] != 'both']

gwelter · Accepted Answer · 2020-09-24 10:52:44Z

11

As of Pandas 1.1.0, there is pandas.DataFrame.compare:

df1.compare(df2)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html

answered Sep 24, 2020 at 10:52

gwelter

1,1832 gold badges11 silver badges15 bronze badges

2 Comments

Ali Naderi Over a year ago

This doesn't work in inequal indices (aka, two dataframes with different number of rows)

gwelter Over a year ago

Yep, from the notes in the doc page: Can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

Shovalt · Accepted Answer · 2016-04-27 15:07:17Z

diff = set(zip(df2.Buyer, df2.Quantity)) - set(zip(df1.Buyer, df1.Quantity))

This is the first solution that came to mind. You can then put the diff set back in a DF for presentation.

Tobias Bergkvist · Accepted Answer · 2022-02-19 17:35:28Z

An important edge case

Consider the following, where you have an additional duplicate entry in the second dataframe. ('Carl', 5)

df1 = DataFrame({ 'Buyer': ['Carl', 'Carl', 'Carl'], 'Quantity': [ 18 , 3 , 5 ] }) df2 = DataFrame({ 'Buyer': ['Carl', 'Mark', 'Carl', 'Carl', 'Carl'], 'Quantity': [ 2 , 1 , 18 , 5 , 5 ] })

EdChum's answer will give you the following:

merged = df1.merge(df2, indicator=True, how='outer') print(merged[merged['_merge'] == 'right_only']) Buyer Quantity _merge 4 Carl 2 right_only 5 Mark 1 right_only

As you can see, the solution ignores the additional duplicate value, which depending on what you are doing is something you want to avoid.

Here is a solution that more likely does what you want:

df1['duplicate_counter'] = df1.groupby(list(df1.columns)).cumcount() df2['duplicate_counter'] = df2.groupby(list(df2.columns)).cumcount() merged = df1.merge(df2, indicator=True, how='outer') merged[merged['_merge'] == 'right_only'] Buyer Quantity duplicate_counter _merge 3 Carl 2 0 right_only 4 Mark 1 0 right_only 5 Carl 5 1 right_only

The duplicate counter ensures that every row is unique, which means that duplicate values are not removed. After merging, you can drop the duplicate_counter.

Yasir · Accepted Answer · 2018-07-15 04:58:28Z

Try the following if you only care about adding the new Buyers to the other df:

df_delta=df2[df2['Buyer'].apply(lambda x: x not in df1['Buyer'].values)]

thinwybk · Accepted Answer · 2023-03-23 11:47:25Z

There is datacompy e.g. as well. It allows to export some string based comparison report like this:

DataComPy Comparison -------------------- DataFrame Summary ----------------- DataFrame Columns Rows 0 original 5 6 1 new 4 5 Column Summary -------------- Number of columns in common: 4 Number of columns in original but not in new: 1 Number of columns in new but not in original: 0 Row Summary ----------- Matched on: acct_id Any duplicates on match values: Yes Absolute Tolerance: 0.0001 Relative Tolerance: 0 Number of rows in common: 5 Number of rows in original but not in new: 1 Number of rows in new but not in original: 0 Number of rows with some compared columns unequal: 5 Number of rows with all compared columns equal: 0 Column Comparison ----------------- Number of columns compared with some values unequal: 3 Number of columns compared with all values equal: 1 Total number of values which compare unequal: 7 Columns with Unequal Values or Types ------------------------------------ Column original dtype new dtype # Unequal Max Diff # Null Diff 0 dollar_amt float64 float64 1 0.0500 0 1 float_fld float64 float64 4 0.0005 3 2 name object object 2 0.0000 0 Sample Rows with Unequal Values ------------------------------- acct_id dollar_amt (original) dollar_amt (new) 0 10000001234 123.45 123.4 acct_id float_fld (original) float_fld (new) 0 10000001234 14530.1555 14530.155 5 10000001238 NaN 111.000 2 10000001236 NaN 1.000 1 10000001235 1.0000 NaN acct_id name (original) name (new) 0 10000001234 George Maharis George Michael Bluth 3 10000001237 Bob Loblaw Robert Loblaw Sample Rows Only in original (First 10 Columns) ----------------------------------------------- acct_id dollar_amt name float_fld date_fld 4 10000001238 1.05 Lucille Bluth NaN 2017-01-01

Collectives™ on Stack Overflow

Diff of two Dataframes

8 Answers 8

2 Comments

1 Comment

2 Comments

2 Comments

Comments

An important edge case

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

2 Comments

1 Comment

2 Comments

2 Comments

Comments

An important edge case

Comments

Comments

Comments

Linked

Related