88

I want to perform a join/merge/append operation on a dataframe with datetime index.

Let's say I have df1 and I want to add df2 to it. df2 can have fewer or more columns, and overlapping indexes. For all rows where the indexes match, if df2 has the same column as df1, I want the values of df1 be overwritten with those from df2.

How can I obtain the desired result?

0

4 Answers 4

78

How about: df2.combine_first(df1)?

In [33]: df2 Out[33]: A B C D 2000-01-03 0.638998 1.277361 0.193649 0.345063 2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726 2000-01-05 0.435507 -0.025162 -1.112890 0.324111 2000-01-06 -0.210756 -1.027164 0.036664 0.884715 2000-01-07 -0.821631 -0.700394 -0.706505 1.193341 2000-01-10 1.015447 -0.909930 0.027548 0.258471 2000-01-11 -0.497239 -0.979071 -0.461560 0.447598 In [34]: df1 Out[34]: A B C 2000-01-03 2.288863 0.188175 -0.040928 2000-01-04 0.159107 -0.666861 -0.551628 2000-01-05 -0.356838 -0.231036 -1.211446 2000-01-06 -0.866475 1.113018 -0.001483 2000-01-07 0.303269 0.021034 0.471715 2000-01-10 1.149815 0.686696 -1.230991 2000-01-11 -1.296118 -0.172950 -0.603887 2000-01-12 -1.034574 -0.523238 0.626968 2000-01-13 -0.193280 1.857499 -0.046383 2000-01-14 -1.043492 -0.820525 0.868685 In [35]: df2.comb df2.combine df2.combineAdd df2.combine_first df2.combineMult In [35]: df2.combine_first(df1) Out[35]: A B C D 2000-01-03 0.638998 1.277361 0.193649 0.345063 2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726 2000-01-05 0.435507 -0.025162 -1.112890 0.324111 2000-01-06 -0.210756 -1.027164 0.036664 0.884715 2000-01-07 -0.821631 -0.700394 -0.706505 1.193341 2000-01-10 1.015447 -0.909930 0.027548 0.258471 2000-01-11 -0.497239 -0.979071 -0.461560 0.447598 2000-01-12 -1.034574 -0.523238 0.626968 NaN 2000-01-13 -0.193280 1.857499 -0.046383 NaN 2000-01-14 -1.043492 -0.820525 0.868685 NaN 

Note that it takes the values from df1 for indices that do not overlap with df2. If this doesn't do exactly what you want I would be willing to improve this function / add options to it.

Sign up to request clarification or add additional context in comments.

Comments

56

For a merge like this, the update method of a DataFrame is useful.

Taking the examples from the documentation:

import pandas as pd import numpy as np df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, 2.1, np.nan], [np.nan, 7., np.nan]]) df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]], index=[1, 2]) 

Data before the update:

>>> df1 0 1 2 0 NaN 3.0 5.0 1 -4.6 2.1 NaN 2 NaN 7.0 NaN >>> >>> df2 0 1 2 1 -42.6 NaN -8.2 2 -5.0 1.6 4.0 

Let's update df1 with data from df2:

df1.update(df2) 

Data after the update:

>>> df1 0 1 2 0 NaN 3.0 5.0 1 -42.6 2.1 -8.2 2 -5.0 1.6 4.0 

Remarks:

  • It's important to notice that this is an operation "in place", modifying the DataFrame that calls update.
  • Also note that non NaN values in df1 are not overwritten with NaN values in df2

3 Comments

This is more intuitive than combine_first because it acts exactly as the update method we know from dicts.
Note, that DataFrame.update() is a bit different than dict because it doesn't add new index items the same way dicts add new keys. Example: gist.github.com/flutefreak7/b4eb6a93565c375d79b791c2bbd672b1
you should use update only if you are happy to exclude indexes unique to df2 Use combine_first if you want to include them
1

I've written a function, which should handle all cases (different lengths, nan values) and could be used inside an appending loop. However, it will prefer first df if data is conflicting.

def real_merge(df1,df2,on): if len(df1)==0: return df2 elif len(df2)==0: return df1 else: df3=pd.merge(df1,df2,on=on,how='outer') #now we have a mess to fix cols=[x[:-2] for x in df3.columns if x.endswith('_x')] for i_col in cols: df3.loc[:,i_col+'_x']=df3[i_col+'_x'].combine_first(df3[i_col+'_y']) df3.rename(columns={i_col+'_x':i_col},inplace=True) df3.drop(columns=[i_col+'_y'],inplace=True) return df3 

Why is there no native pandas function for that? merge/update/cobmine_first are not sufficient for such a task which is requested so often on SE.

Comments

0

I did not find a satisfactory answer for this task yet where one does not need to pre-specify which columns should be used. I am doing it by combining merge, rename, and drop. df_some_wrong is a dataframe with important new columns. However, it has some old columns that have wrong values and should be overwritten with the correct values contained in df_correct. Only the columns with wrong values in df_some_wrong are shared between df_some_wrong and df_correct.

wrong_suffix = "_wrong" correct_suffix = "_correct" # merge, rename shared columns with suffixes merged_df = df_some_wrong.merge( df_correct. left_index=True, right_index=True, suffixes=(wrong_suffix, correct_suffix), ) # find all the renamed columns from df_correct rename_and_drop_cols = [ col for col in merged_df.columns if col.endswith(correct_suffix) ] # dict that describes how correct columns should be renamed # to old column name rename_cols = { key: key.replace(correct_suffix, "") for key in rename_and_drop_cols } # list describing the wrong columns that can be dropped drop = [ col.replace(correct_suffix, wrong_suffix) for col in rename_and_drop_cols ] # finally rename and drop merged_df = merged_df.rename(rename, axis=1).drop(drop, axis=1) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.