1

I am having an issue merging two frames with a different amount of rows. The first dataframe has 5K rows, and the second dataframe has 20K rows. There is a column "id" in both frames, and all 5K "id" values will occur in the frame with 20K rows.

first frame "df"

 A B id A_1 B_1 0 1 1 1 0.5 0.5 1 3 2 2 0.2 0.4 2 3 4 3 0.8 0.9 

second frame "df_2"

 A B id 0 1 1 1 1 3 2 2 2 3 4 3 3 1 2 4 4 3 1 5 

Hopeful output frame "df_out"

 A B id A_1 B_1 0 1 1 1 0.5 0.5 1 3 2 2 0.2 0.4 2 3 4 3 0.8 0.9 3 1 2 4 na na 4 3 1 5 na na 

My attempts to merge on 'id' have left me with only the 5k rows. The operation I am seeking is to preserve all the rows of the large dataframe, and stick Nan values for the data that does not exist in the large frame.

Thanks

1
  • 1
    Use the how='outer' option of pd.merge. Commented Feb 17, 2017 at 1:55

1 Answer 1

3

Just specify how=outer to df.merge so that you use the union of both DataFrames.

>>> df.merge(df_2, how='outer') A A_1 B B_1 id 0 1.0 0.5 1.0 0.5 1.0 1 3.0 0.2 2.0 0.4 2.0 2 3.0 0.8 4.0 0.9 3.0 3 1.0 NaN 2.0 NaN 4.0 4 3.0 NaN 1.0 NaN 5.0 
Sign up to request clarification or add additional context in comments.

4 Comments

@JeffSaltfist Glad I could help!
quick question, does the 'outer' argument signify the last column in the frame?
@JeffSaltfist It specifies that the union of keys from both DataFrames should be used to merge, as opposed to the intersection or the entirety of one or the other (see the docs I linked).
Thank you. I was able to solve the issue with your code and by also including left_index=False and right_index=False in the code.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.