Comparing two data frames with different columns and getting the differences

Question

I have a similar question as here Comparing two data frames and getting the differences But columns in df1 is a subset of columns in df2.

df1: Date Fruit Num Color 2013-11-24 Banana 22.1 Yellow 2013-11-24 Orange 8.6 Orange 2013-11-24 Apple 7.6 Green 2013-11-24 Celery 10.2 Green df2: Date Fruit Num Color A 2013-11-24 Banana 22.1 Yellow 1 2013-11-24 Orange 8.6 Orange 2 2013-11-24 Apple 7.6 Green 3 2013-11-24 Celery 10.2 Green 4 2013-11-25 Apple 22.1 Red 5 2013-11-25 Orange 8.6 Orange 6

I would like to get the difference the two df by comparing those columns in common only. So the result I expect to get is

 Date Fruit Num Color A 4 2013-11-25 Apple 22.1 Red 5 5 2013-11-25 Orange 8.6 Orange 6

Is there a way to do so? Any help is appreciated.

Timus · Accepted Answer · 2022-09-13 10:23:53Z

If you don't have duplicates in df1 or df2[df1.columns] you could try to use .drop_duplicates with keep=False:

res = pd.concat([df1, df2[df1.columns]]).drop_duplicates(keep=False)

Result for your sample dataframes:

 Date Fruit Num Color 4 2013-11-25 Apple 22.1 Red 5 2013-11-25 Orange 8.6 Orange

PS: As far as I can see the other answer also covers only the non-duplicate case.

If you do have duplicates in df1 or df2[df1.columns] and want to preserve them in the result you could try:

res = pd.concat( [df1.drop_duplicates(), df2[df1.columns].drop_duplicates()] ).drop_duplicates(keep=False).merge( pd.concat([df1, df2[df1.columns]]), on=list(df1.columns), how="inner" )

First drop the duplicates in the originals, and then the duplicates on the concatenated dataframes: This will give you the rows that are not in both.
Then fetch from the originals the right amount of rows by an inner merge with the result of the first step.

For example, if df2 would look like

Date Fruit Num Color A 2013-11-24 Banana 22.1 Yellow 1 2013-11-24 Orange 8.6 Orange 2 2013-11-24 Apple 7.6 Green 3 2013-11-24 Celery 10.2 Green 4 2013-11-25 Apple 22.1 Red 5 2013-11-25 Apple 22.1 Red 5 2013-11-25 Orange 8.6 Orange 6

then the result would be

 Date Fruit Num Color 0 2013-11-25 Apple 22.1 Red 1 2013-11-25 Apple 22.1 Red 2 2013-11-25 Orange 8.6 Orange

If you don't want to preserve the duplicates in the result then just do

res = pd.concat( [df1.drop_duplicates(), df2[df1.columns].drop_duplicates()] ).drop_duplicates(keep=False)

metc · Accepted Answer · 2022-09-13 08:01:41Z

First you get the column names of df1

df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]

Now you create a new df2 dataframe with only df1 columns

df2_filtered = df2[df1_columns]

And now you can apply the solution from this other question.

#concatenate both dataframes df = pd.concat([df1, df2_filtered]) df = df.reset_index(drop=True) #group by df_gpby = df.groupby(list(df.columns)) # get index of unique records idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #filter df.reindex(idx)

Hope it helps!

Collectives™ on Stack Overflow

Comparing two data frames with different columns and getting the differences

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related