1

I have two DataFrames like here:

df1 = sent token token2 0 a b 0 a c 0 b d 1 g h 1 h k 1 h i 1 g i 1 g k df2 = sent token token2 rel 0 a b A 1 g h B 1 k g C 

Now I want to merge those two DataFrames into one which should look like this:

df_new = sent token token2 rel 0 a b A 0 a c NaN 0 b d NaN 1 g h B 1 h k NaN 1 h i NaN 1 g i NaN 1 g k C 

However merging the DataFrames like this

df_new = df1.merge(df2, on=["sent","token","token2"], how="left")

I get the output I want except the ["rel"] token is wrong

df_new = sent token token2 rel 0 a b A 0 a c NaN 0 b d NaN 1 g h B 1 h k NaN 1 h i NaN 1 g i NaN 1 g k NaN 

This is due to the order of the token in df1. Since the value in ["rel"] depends on ["token1"] -> ["token2"] it can't apply its value when the order is reversed. Is there any way to do this in the merging process without creating a new version of df1 ?

2
  • 3
    No, you'll have to do some sort of manipulation on df1 or df2 to get the results you desire, there is not a parameter in merge that will let your change merge keys. Commented May 15, 2018 at 14:09
  • Yes, I had to improve my df1 to include all possible combinations of token and token2. Commented May 15, 2018 at 14:21

2 Answers 2

2

You can do with np.sort

df2[['token','token2']]=np.sort(df2[['token','token2']].values,axis = 1) df1.merge(df2, on=["sent","token","token2"], how="left") Out[398]: sent token token2 rel 0 0 a b A 1 0 a c NaN 2 0 b d NaN 3 1 g h B 4 1 h k NaN 5 1 h i NaN 6 1 g i NaN 7 1 g k C 
Sign up to request clarification or add additional context in comments.

2 Comments

Sorry, I should have been more clear with my initial post. While this solution does work for this case it also doesn't work for all the other cases where the order was right. It's not dependent on alphabetical order.
@ThelMi where is other case ? You should include them in your question man
1

Solution:

I had to include all possible combinations of token and token2 in the first DataFrame since the result of rel is dependent on the correct order of the two value. Meaning my desired outcome was wrong to begin with. I had to delete this line in the creation of df1:

df1[['token','token2']]=np.sort(df1[['token','token2']],1) 

So I get the desired version of df1 for this task.

df1 = sent token token2 0 a b 0 a c 0 a d 0 b a 0 b c 0 b d ... 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.