Comparing two pandas dataframes for differences

Question

I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...

How can I compare two dataframes to check if they're the same or not?

csvdata = pandas.read_csv('csvfile.csv') csvdata_old = csvdata # ... do stuff with csvdata dataframe if csvdata_old != csvdata: csvdata.to_csv('csvfile.csv', index=False)

Any ideas?

Not an expert at Pandas but shouldn't a normal equality comparison work? — aIKid
– aIKid, Commented Nov 11, 2013 at 23:21
I've had a look at equality but I'm unsure how to use that function, I'm also unable to find anything in search :( — Ryflex
– Ryflex, Commented Nov 11, 2013 at 23:28

Community · Accepted Answer · 2017-05-23 12:25:36Z

90

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

csvdata_old = csvdata.copy()

To check whether they are equal, you can use assert_frame_equal as in this answer:

from pandas.util.testing import assert_frame_equal assert_frame_equal(csvdata, csvdata_old)

You can wrap this in a function with something like:

try: assert_frame_equal(csvdata, csvdata_old) return True except: # appeantly AssertionError doesn't catch all return False

There was discussion of a better way...

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Nov 12, 2013 at 0:40

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Ryflex Over a year ago

For some reason I get the exception: Exception: Can only compare identically-labeled DataFrame objects

TomAugspurger Over a year ago

That gives you the answer then, right? It's saying that the labels (row and column names/values) aren't identical so then the DataFrames can't be identical. Add another except line to Andy's wrapper: except Exception: return False

Andy Hayden Over a year ago

@Hyflex yep, as Tom points out, just remove the AssertionError (before it was only stopping on AssertionError and raising on anything else)...

Ryflex Over a year ago

Okay, I fixed the lables (didn't realise I had to reorder the label order of csvdat_old and I've tried incorporating this into my actual data but it updates even though the data is identical when it's output, is there a way to find/see what exactly is "different"?

Gabi Over a year ago

pandas.util.testing doesn't work anymore, you now should use pandas.testing

|

Tom Chapin · Accepted Answer · 2017-11-02 22:50:57Z

Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.

def get_different_rows(source_df, new_df): """Returns just the rows from the new dataframe that differ from the source dataframe""" merged_df = source_df.merge(new_df, indicator=True, how='outer') changed_rows_df = merged_df[merged_df['_merge'] == 'right_only'] return changed_rows_df.drop('_merge', axis=1)

This is the answer I was actually looking for. Thanks so much
This is awesome! but may I ask how can it not ignore the data types? Like for example if there's a value 4 on the source_df and 4.0 on the new_df it should still return a dataframe since it would mean there's a difference, Thanks!
Thank you for this. And the utility function is really useful !

sobes · Accepted Answer · 2016-03-12 00:31:15Z

24

Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.

answered Mar 12, 2016 at 0:31

sobes

4063 silver badges7 bronze badges

6 Comments

Ryflex Over a year ago

Nope that never existed, it works just as good. (Better because it's a single line vs 5 lines

jf328 Over a year ago

How does it work on floats? I hope there is a precision parameter.

Nickpick Over a year ago

There has to be something better than that. Just a false is not a satisfactory answer

alpha_989 Over a year ago

Not sure whether this is what most people would want.. DataFrame.equals does a very deep comparison. For example.. I had 2 dataframes whose values are equal.. but it turns out that DataFrames also have some parameters called Axis 1, IntBlock and ObjectBlock. These are defined under pd._data. If these are not the same between the 2 objects.. it will result in False

citynorman Over a year ago

df.eq(df2).all().all() helped overcome the df.equals issue to just compare contents

|

Surya Chhetri · Accepted Answer · 2017-02-15 06:34:21Z

Check using: df_1.equals(df_2) # Returns True or False, details herebelow

In [45]: import numpy as np In [46]: import pandas as pd In [47]: np.random.seed(5) In [48]: df_1= pd.DataFrame(np.random.randn(3,3)) In [49]: df_1 Out[49]: 0 1 2 0 0.441227 -0.330870 2.430771 1 -0.252092 0.109610 1.582481 2 -0.909232 -0.591637 0.187603 In [50]: np.random.seed(5) In [51]: df_2= pd.DataFrame(np.random.randn(3,3)) In [52]: df_2 Out[52]: 0 1 2 0 0.441227 -0.330870 2.430771 1 -0.252092 0.109610 1.582481 2 -0.909232 -0.591637 0.187603 In [53]: df_1.equals(df_2) Out[53]: True In [54]: df_3= pd.DataFrame(np.random.randn(3,3)) In [55]: df_3 Out[55]: 0 1 2 0 -0.329870 -1.192765 -0.204877 1 -0.358829 0.603472 -1.664789 2 -0.700179 1.151391 1.857331 In [56]: df_1.equals(df_3) Out[56]: False

There is one catch: index.names are not compared. See my answer for details: stackoverflow.com/a/43420842/304209

Dennis Golomazov · Accepted Answer · 2017-04-15 00:25:14Z

A more accurate comparison should check for index names separately, because DataFrame.equals does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.

df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name']) df1 = df1.set_index('name') df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name']) df2 = df2.set_index('another_name') df1.equals(df2) True df1.index.names == df2.index.names False

Note: using index.names instead of index.name makes it work for multi-indexed dataframes as well.

alpha_989 · Accepted Answer · 2018-07-05 03:18:18Z

In my case, I had a weird error, whereby even though the indices, column-names and values were same, the DataFrames didnt match. I tracked it down to the data-types, and it seems pandas can sometimes use different datatypes, resulting in such problems

For example:

param2 = pd.DataFrame({'a': [1]}) param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})

if you check param1.dtypes and param2.dtypes, you will find that 'a' is of type object for param1 and is of type int64 for param2. Now, if you do some manipulation using a combination of param1 and param2, other parameters of the dataframe will deviate from the default ones.

So after the final dataframe is generated, even though the actual values that are printed out are same, final_df1.equals(final_df2), may turn out to be not-equal, because those samll parameters like Axis 1, ObjectBlock, IntBlock maynot be the same.

A easy way to get around this and compare the values is to use

final_df1==final_df2.

However, this will do a element by element comparison, so it wont work if you are using it to assert a statement for example in pytest.

TL;DR

What works well is

all(final_df1 == final_df2).

This does a element by element comparison, while neglecting the parameters not important for comparison.

TL;DR2

If your values and indices are same, but final_df1.equals(final_df2) is showing False, you can use final_df1._data and final_df2._data to check the rest of the elements of the dataframes.

Does anyone know why a data frame read in from a CSV would show as different from a data frame read in from an Excel workbook, even when they are identical?

leerssej · Accepted Answer · 2019-09-05 20:41:11Z

To pull out the symmetric differences:

df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)

For example:

df1 = pd.DataFrame({ 'num': [1, 4, 3], 'name': ['a', 'b', 'c'], }) df2 = pd.DataFrame({ 'num': [1, 2, 3], 'name': ['a', 'b', 'd'], })

Will yield:

Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=False argument. As below:

df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)

Tristan Forward · Accepted Answer · 2018-04-23 22:27:15Z

This compares the values of two dataframes note the number of row/columns needs to be the same between tables

comparison_array = table.values == expected_table.values print (comparison_array) >>>[[True, True, True] [True, False, True]] if False in comparison_array: print ("Not the same") #Return the position of the False values np.where(comparison_array==False) >>>(array([1]), array([1]))

You could then use this index information to return the value that does not match between tables. Since it's zero indexed, it's referring to the 2nd array in the 2nd position which is correct.

The main benefit here is you can you the index position that is returned by np.where to determine the exact position the table does not match and alert the user to that location.

Nitesh Yaduvanshi · Accepted Answer · 2021-08-10 07:12:54Z

I hope this below code snippet helps you!

import pandas as pd import datacompy df_old_original = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [7, 7, 7, 7], [3, 3, 3, 3], [4, 4, 4, 4], [7, 7, 7, 7], [5, 5, 5, 5], [6, 6, 6, 6]], columns=['A', 'B', 'C', 'D'], index=[0, 1, 2, 3, 4, 5, 6, 7], dtype=object) df_new_original = pd.DataFrame([[None, None, None, None], [1, 1, 1, 1], [2, 2, 2, 2], [8, 8, 8, 8], [3, 3, 3, 3], [4, 4, 4, 4], [7, 7, 7, 7], [5, 5, 5, 5], [None, None, None, None]], columns=['A', 'B', 'C', 'D'], index=[0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=object) compare = datacompy.Compare(df_old_original, df_new_original, join_columns=['A', 'B', 'C', 'D'], abs_tol=0, rel_tol=0, df1_name='Old', df2_name='New') changes_in_old_df = compare.df1_unq_rows changes_in_new_df = compare.df2_unq_rows print(changes_in_old_df) print(changes_in_new_df) print(Compare.report())

Aelius · Accepted Answer · 2023-04-03 13:56:37Z

A summary and analysis of some of the answers provided to date

Some of the answers provided fail to detect differences in datasets because they don't consider index values. Let's build two simple datasets and analyze some of the proposed approaches:

import pandas as pd df_1 = pd.DataFrame({'a': [1,1], 'b': [2,1]}) df_1

	a	b
0	1	2
1	1	1

df_2 = pd.DataFrame({'a': [1,1], 'b': [1,2]}) df_2

	a	b
0	1	1
1	1	2

These two dataframes are different because the first row in the df_1 is clearly not the same as the first row in the df_2. If we apply the method proposed by Tom Chapin, it will not detect the difference because it only compares the combination of values for each column in each row, regardless of the row position (index). So in this example, since the combination (1,2) for columns a and b is already present in df_1, it's considered a duplicate when merging with df_2, even if it shows up at index 1 instead of 0. For the same reason the method proposed by leersej fails as well (you can try to apply these methods and see the results by yourself). The .equals method proposed by Surya is able to detect the difference, returning False in the comparison, but as pointed out by Dennis Golomazov, if you're interested in also checking the equality of indexes, you should add another check, as showed in his answer. In the end if you want to ensure that dataframes have the same column values at the same indexes, I think that the best method is to perform a direct match, as suggested by alpha_989, however the usage of the all() method to check the final equality between the two fails, because it iterates over column names and not the boolean outcomes resulting from the comparison. all(df_1 == df_2) in fact, returns wrongly True. At the end of the day, to safely check if two dataframes are equal (using only the pandas library), you should:

# 1 compare the two datasets >>> comparison = (df_1 == df_2) # 2 - see what columns are matching and what aren't >>> comparison.all() a True b False dtype: bool # 3 - compare if all columns are matching >>> comparison.all().all() # equivalent all(comparison.all()) False # BONUS - row indexes that aren't matching (more info here: https://stackoverflow.com/a/52173171/11764049) >>> comparison.T.all()[~comparison.T.all()].index Int64Index([0, 1], dtype='int64')

TL;DR

In order to check the equality of two dataframes, considering also row indexes you can use the .equals method as here, or:

comparison = (df_1 == df_2) equality = comparison.all().all() not_matching_row_idxs = comparison.T.all()[~comparison.T.all()].index

This approach also allows you to find the row indexes that are not matching.

Note: This method only works with dataframes having same index values.

Collectives™ on Stack Overflow

Comparing two pandas dataframes for differences

10 Answers 10

7 Comments

3 Comments

6 Comments

1 Comment

Comments

TL;DR

TL;DR2

1 Comment

Comments

1 Comment

Comments

A summary and analysis of some of the answers provided to date

TL;DR

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

7 Comments

3 Comments

6 Comments

1 Comment

Comments

TL;DR

TL;DR2

1 Comment

Comments

1 Comment

Comments

A summary and analysis of some of the answers provided to date

TL;DR

Comments

Linked

Related