Merging two DataFrames without losing information

Question

I have two DataFrames like here:

df1 = sent token token2 0 a b 0 a c 0 b d 1 g h 1 h k 1 h i 1 g i 1 g k df2 = sent token token2 rel 0 a b A 1 g h B 1 k g C

Now I want to merge those two DataFrames into one which should look like this:

df_new = sent token token2 rel 0 a b A 0 a c NaN 0 b d NaN 1 g h B 1 h k NaN 1 h i NaN 1 g i NaN 1 g k C

However merging the DataFrames like this

df_new = df1.merge(df2, on=["sent","token","token2"], how="left")

I get the output I want except the ["rel"] token is wrong

df_new = sent token token2 rel 0 a b A 0 a c NaN 0 b d NaN 1 g h B 1 h k NaN 1 h i NaN 1 g i NaN 1 g k NaN

This is due to the order of the token in df1. Since the value in ["rel"] depends on ["token1"] -> ["token2"] it can't apply its value when the order is reversed. Is there any way to do this in the merging process without creating a new version of df1 ?

No, you'll have to do some sort of manipulation on df1 or df2 to get the results you desire, there is not a parameter in merge that will let your change merge keys. — Scott Boston
– Scott Boston, Commented May 15, 2018 at 14:09
Yes, I had to improve my df1 to include all possible combinations of token and token2. — Mi.
– Mi., Commented May 15, 2018 at 14:21

BENY · Accepted Answer · 2018-05-15 14:12:13Z

2

You can do with np.sort

df2[['token','token2']]=np.sort(df2[['token','token2']].values,axis = 1) df1.merge(df2, on=["sent","token","token2"], how="left") Out[398]: sent token token2 rel 0 0 a b A 1 0 a c NaN 2 0 b d NaN 3 1 g h B 4 1 h k NaN 5 1 h i NaN 6 1 g i NaN 7 1 g k C

answered May 15, 2018 at 14:12

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mi. Over a year ago

Sorry, I should have been more clear with my initial post. While this solution does work for this case it also doesn't work for all the other cases where the order was right. It's not dependent on alphabetical order.

BENY Over a year ago

@ThelMi where is other case ? You should include them in your question man

Mi. · Accepted Answer · 2018-05-15 14:35:43Z

Solution:

I had to include all possible combinations of token and token2 in the first DataFrame since the result of rel is dependent on the correct order of the two value. Meaning my desired outcome was wrong to begin with. I had to delete this line in the creation of df1:

df1[['token','token2']]=np.sort(df1[['token','token2']],1)

So I get the desired version of df1 for this task.

df1 = sent token token2 0 a b 0 a c 0 a d 0 b a 0 b c 0 b d ...

Collectives™ on Stack Overflow

Merging two DataFrames without losing information

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related