Combine dataframes on columns containing repeated values

Question

I am trying to combine two dataframes which both contain a column of repeated values but not the same number of repeats.

import pandas as pd df1 = pd.DataFrame({'col1':[1, 1, 2, 2, 3, 3, 3], 'col2':[1.1, 1.3, 2.1, 2.3, 3.1, 3.3, 3.5]}) df2 = pd.DataFrame({'col1':[1, 2, 2, 3, 3, 3], 'col2':[1.2, 2.2, 2.4, 3.2, 3.4, 3.6]}) df1 col1 col2 0 1 1.1 1 1 1.3 2 2 2.1 3 2 2.3 4 3 3.1 5 3 3.3 6 3 3.5 df2 col1 col2 0 1 1.2 1 2 2.2 2 2 2.4 3 3 3.2 4 3 3.4 5 3 3.6

The desired output would be for example:

desired_result = pd.DataFrame({'col1': [1, 1, 2, 2, 3, 3, 3], 'col2_x':[1.1, 1.3, 2.1, 2.3, 3.1, 3.3, 3.5], 'col2_y':[1.2, 'NaN' , 2.2, 2.4, 3.2, 3.4, 3.6]}) desired_result col1 col2_x col2_y 0 1 1.1 1.2 1 1 1.3 NaN 2 2 2.1 2.2 3 2 2.3 2.4 4 3 3.1 3.2 5 3 3.3 3.4 6 3 3.5 3.6

The problem is the ambiguity in how to combine the two dataframes on col1 which contains repeated values and a direct matching is not possible (and also not necessary).

Ami Tavory · Accepted Answer · 2016-08-11 09:59:22Z

You can groupby and horizontal-concat each of the groups. After that, it's just some column/index manipulation:

In [75]: merged = df1.groupby(df1.col1).apply(lambda g: pd.concat([g[['col2']].reset_index(), df2[['col2']][df2.col1 == g.col1.values[0]].reset_index()], axis=1)) In [76]: merged.columns = ['_', 'col2_x', '__', 'col2_y'] In [77]: merged.reset_index()[['col1', 'col2_x', 'col2_y']] Out[77]: col1 col2_x col2_y 0 1 1.1 1.2 1 1 1.3 NaN 2 2 2.1 2.2 3 2 2.3 2.4 4 3 3.1 3.2 5 3 3.3 3.4 6 3 3.5 3.6

Thank you very much @ami-tavory for you help. The first step with groupby and concat is great. However, the removal of the NaN's removes some of the data. In the case of col1==2 we have 3 numbers and 1 NaN in both col_2x and col_2y. With the current method to remove NaNs that simply drops all rows containing a NaN we loose some data.
@adabsurdum I forgot to reset_index when concating. See update.

Collectives™ on Stack Overflow

Combine dataframes on columns containing repeated values

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related