2

i have two dataframes df1 and df2. Same index and same column names. how to construct a dataframe which shows difference, but only rows which have at least one different cell? if row has different cells, but some are same, keep same cells intact.

example:

df1=pd.DataFrame({1:['a','a'],2:['c','c']}) df2=pd.DataFrame({1:['a','a'],2:['d','c']}) 

output needed:

pd.DataFrame({1:['a'],2:['c->d']},index=[0]) 

output in this example should be one row dataframe, not dataframe including same rows

NB: output should only contain full rows which has at least one difference in cell

i'd like an efficient solution without iterating by rows , and without creating special-strings in DataFrame

2 Answers 2

1

You can use this brilliant solution:

def report_diff(x): return x[0] if x[0] == x[1] else '{}->{}'.format(*x) In [70]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0) Out[70]: 1 2 0 a c->d 1 a c 

For bit more complex DataFrames:

In [73]: df1 Out[73]: A B C 0 a c 1 1 a c 2 2 1 2 3 In [74]: df2 Out[74]: A B C 0 a d 1 1 a c 2 2 1 2 4 In [75]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0) Out[75]: A B C 0 a c->d 1 1 a c 2 2 1 2 3->4 

UPDATE: showing only changed/different rows:

In [54]: mask = df1.ne(df2).any(1) In [55]: mask Out[55]: 0 True 1 False 2 True dtype: bool In [56]: pd.Panel(dict(df1=df1[mask],df2=df2[mask])).apply(report_diff, axis=0) Out[56]: A B C 0 a c->d 1 2 1 2 3->4 
Sign up to request clarification or add additional context in comments.

3 Comments

I like it. Much more pandastic
yes i like report_diff solution , but i'd like my output dataframe without same rows. i edited the question to highlight it. dataframe can be 100MB and can contain any string or number
@alexprice, please see UPDATE
1

How about a good ole list comprehension on the flattened contents...

import pandas as pd import numpy as np df1=pd.DataFrame({1:['a','a'],2:['c','c']}) df2=pd.DataFrame({1:['a','a'],2:['d','c']}) rows_different_mask = (df1 != df2).any(axis=1) pairs = zip(df1.values.reshape(1, -1)[0], df2.values.reshape(1, -1)[0]) new_elems = ["%s->%s" %(old, new) if (old != new) else new for old, new in pairs] df3 = pd.DataFrame(np.reshape(new_elems, df1.values.shape)) print df3 0 1 0 a c->d 1 a c 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.