Pandas - Merge two dataframes with different number of rows

Question

I have the following two dataframes:

df:

 value period 2000-01-01 100 2000-04-01 200 2000-07-01 300 2000-10-01 400 2001-01-01 500

df1:

 value period 2000-07-01 350 2000-10-01 450 2001-01-01 550 2001-04-01 600 2001-07-01 700

This is the desired output:

df:

 value period 2000-01-01 100 2000-04-01 200 2000-07-01 350 2000-10-01 450 2001-01-01 550 2001-04-01 600 2001-07-01 700

I have set_index(['period']) on both df1 and df2. I also tried few things including concat and where statement after creating new column but notting works as expected. My first dataframe is primary. The second is kind of update. It should replace the corresponding values in the first one and in the same time add new records if any available.

How I can do this?

It looks like a simple concatenate. Can you elaborate on "nothing works as expected"? — Andrew L
– Andrew L, Commented May 8, 2017 at 20:49
@AlIvon Feel free to up vote the accepted answer and any others you found useful. — piRSquared
– piRSquared, Commented May 8, 2017 at 21:43

jezrael · Accepted Answer · 2017-05-08 21:13:36Z

You can use combine_first, also if dtype of some index is object convert to_datetime which works nice if always df1.index is in df.index:

print (df.index.dtype) object print (df1.index.dtype) object df.index = pd.to_datetime(df.index) df1.index = pd.to_datetime(df1.index) df = df1.combine_first(df) #if necessary int columns #df = df1.combine_first(df).astype(int) print (df) value period 2000-01-01 100.0 2000-04-01 200.0 2000-07-01 350.0 2000-10-01 450.0 2001-01-01 550.0 2001-04-01 600.0 2001-07-01 700.0

If not, then is necessary filter by intersection first:

df = df1.loc[df1.index.intersection(df.index)].combine_first(df)

Another solution with numpy.setdiff1d and concat

df = pd.concat([df.loc[np.setdiff1d(df.index, df1.index)], df1]) print (df) value period 2000-01-01 100 2000-04-01 200 2000-07-01 350 2000-10-01 450 2001-01-01 550 2001-04-01 600 2001-07-01 700

MaxU - stand with Ukraine · Accepted Answer · 2017-05-08 21:01:35Z

Is that what you want?

In [151]: pd.concat([df1, df.loc[df.index.difference(df1.index)]]).sort_index() Out[151]: value period 2000-01-01 100 2000-04-01 200 2000-07-01 350 2000-10-01 450 2001-01-01 550 2001-04-01 600 2001-07-01 700

PS make sure that both indices are of the same dtype - it's better to convert them to datetime dtype, using pd.to_datetime() method

TypeError: unorderable types: datetime.date() > str(). When removing .sort_index() the last result isn't coming. 2001-07-01 is missing.
@AlIvon, one of your indices has object dtype, hence this error

piRSquared · Accepted Answer · 2017-05-08 21:43:13Z

Another option with append and drop_duplicates

d1 = df1.append(df) d1[~d1.index.duplicated()] value period 2000-07-01 350 2000-10-01 450 2001-01-01 550 2001-04-01 600 2001-07-01 700 2000-01-01 100 2000-04-01 200

Mr.Pacman · Accepted Answer · 2017-05-08 22:22:13Z

I used the pd.concat() function to concatenate the data frames, then dropped the duplicates to get the results.

df_con = pd.concat([df, df1]) df_con.drop_duplicates(subset="period",keep="last",inplace=True) print(df_con) period value 0 2000-01-01 100 1 2000-04-01 200 0 2000-07-01 350 1 2000-10-01 450 2 2001-01-01 550 3 2001-04-01 600 4 2001-07-01 700

To set "period" back as an index just set the index,

print(df_con.set_index("period")) value period 2000-01-01 100 2000-04-01 200 2000-07-01 350 2000-10-01 450 2001-01-01 550 2001-04-01 600 2001-07-01 700

Collectives™ on Stack Overflow

Pandas - Merge two dataframes with different number of rows

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Related