Concatenate pandas dataframes with varying rows per index

Question

I have two dataframes df1 and df2 with key as index.

dict_1={'key':[1,1,1,2,2,3], 'col1':['a1','b1','c1','d1','e1','f1']} df1 = pd.DataFrame(dict_1).set_index('key') dict_2={'key':[1,1,2], 'col2':['a2','b2','c2']} df2 = pd.DataFrame(dict_2).set_index('key')

df1:

 col1 key 1 a1 1 b1 1 c1 2 d1 2 e1 3 f1

df2

 col2 key 1 a2 1 b2 2 c2

Note that there are unequal rows for each index. I want to concatenate these two dataframes such that, I have the following dataframe (say df3).

df3

 col1 col2 key 1 a1 a2 1 b1 b2 2 d1 c2

i.e. concatenate the two columns so that the new dataframe as the least (of df1 and df2) rows for each index.

I tried

pd.concat([df1,df2],axis=1)

but I get the following error:

Value Error: Shape of passed values is (2,17), indices imply (2,7)

My question: How can I concatentate df1 and df2 to get df3? Should I use DataFrame.merge instead? If so, how?

ptrj · Accepted Answer · 2016-05-14 03:10:12Z

Merge/join alone will get you a lot of (hard to get rid of) duplicates. But a little trick will help:

df1['count1'] = 1 df1['count1'] = df1['count1'].groupby(df1.index).cumsum() df1 Out[198]: col1 count1 key 1 a1 1 1 b1 2 1 c1 3 2 d1 1 2 e1 2 3 f1 1

The same thing for df2:

df2['count2'] = 1 df2['count2'] = df2['count2'].groupby(df2.index).cumsum()

And finally:

df_aligned = df1.reset_index().merge(df2.reset_index(), left_on = ['key','count1'], right_on = ['key', 'count2']) df_aligned Out[199]: key col1 count1 col2 count2 0 1 a1 1 a2 1 1 1 b1 2 b2 2 2 2 d1 1 c2 1

Now, you can reset index with set_index('key') and drop no longer needed columns countn.

Thanks for the 'little trick'. That's the missing link that I couldn't comprehend. I accept this as the answer.

DataSwede · Accepted Answer · 2016-05-13 23:00:39Z

The biggest problem for why you are not going to be able to line up the two in the way that you want is that your keys are duplicative. How are you going to be line up the A1 value in df1 with the A2 value in df2 When A1, A2, B1, B2, and C1 all have the same key?

Using merge is what you'll want if you can resolve the key issues:

df3 = df1.merge(df2, left_index=True, right_index=True, how='inner')

You can use inner, outer, left or right for how.

duplicative keys is part of the dataframe. Here is the larger context: 'id1' is the index of a buffer id [say: 0 to 255], and 'col1' and 'col2' are timestamps when the buffer is acquired and released. buffers are going to be acquired/released over a large time window and ptrj's answer helped me reorder the dataframe such that i can calculate the time duration in which a buffer is in use.

Collectives™ on Stack Overflow

Concatenate pandas dataframes with varying rows per index

2 Answers 2

2 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Related