Show complete rows highliting difference between dataframes df1 , df2 , but only when difference in row's cell exists

Question

i have two dataframes df1 and df2. Same index and same column names. how to construct a dataframe which shows difference, but only rows which have at least one different cell? if row has different cells, but some are same, keep same cells intact.

example:

df1=pd.DataFrame({1:['a','a'],2:['c','c']}) df2=pd.DataFrame({1:['a','a'],2:['d','c']})

output needed:

pd.DataFrame({1:['a'],2:['c->d']},index=[0])

output in this example should be one row dataframe, not dataframe including same rows

NB: output should only contain full rows which has at least one difference in cell

i'd like an efficient solution without iterating by rows , and without creating special-strings in DataFrame

Community · Accepted Answer · 2017-05-23 10:29:54Z

You can use this brilliant solution:

def report_diff(x): return x[0] if x[0] == x[1] else '{}->{}'.format(*x) In [70]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0) Out[70]: 1 2 0 a c->d 1 a c

For bit more complex DataFrames:

In [73]: df1 Out[73]: A B C 0 a c 1 1 a c 2 2 1 2 3 In [74]: df2 Out[74]: A B C 0 a d 1 1 a c 2 2 1 2 4 In [75]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0) Out[75]: A B C 0 a c->d 1 1 a c 2 2 1 2 3->4

UPDATE: showing only changed/different rows:

In [54]: mask = df1.ne(df2).any(1) In [55]: mask Out[55]: 0 True 1 False 2 True dtype: bool In [56]: pd.Panel(dict(df1=df1[mask],df2=df2[mask])).apply(report_diff, axis=0) Out[56]: A B C 0 a c->d 1 2 1 2 3->4

yes i like report_diff solution , but i'd like my output dataframe without same rows. i edited the question to highlight it. dataframe can be 100MB and can contain any string or number

dataflow · Accepted Answer · 2017-03-02 22:54:39Z

How about a good ole list comprehension on the flattened contents...

import pandas as pd import numpy as np df1=pd.DataFrame({1:['a','a'],2:['c','c']}) df2=pd.DataFrame({1:['a','a'],2:['d','c']}) rows_different_mask = (df1 != df2).any(axis=1) pairs = zip(df1.values.reshape(1, -1)[0], df2.values.reshape(1, -1)[0]) new_elems = ["%s->%s" %(old, new) if (old != new) else new for old, new in pairs] df3 = pd.DataFrame(np.reshape(new_elems, df1.values.shape)) print df3 0 1 0 a c->d 1 a c

Collectives™ on Stack Overflow

Show complete rows highliting difference between dataframes df1 , df2 , but only when difference in row's cell exists

2 Answers 2

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Linked

Related