Pandas: how to merge two dataframes on a column by keeping the information of the first one?

Question

I have two dataframes df1 and df2. df1 contains the information of the age of people, while df2 contains the information of the sex of people. Not all the people are in df1 nor in df2

df1 Name Age 0 Tom 34 1 Sara 18 2 Eva 44 3 Jack 27 4 Laura 30 df2 Name Sex 0 Tom M 1 Paul M 2 Eva F 3 Jack M 4 Michelle F

I want to have the information of the sex of the people in df1 and setting NaN if I do not have this information in df2. I tried to do df1 = pd.merge(df1, df2, on = 'Name', how = 'outer') but I keep the information of some people in df2 that I don't want.

df1 Name Age Sex 0 Tom 34 M 1 Sara 18 NaN 2 Eva 44 F 3 Jack 27 M 4 Laura 30 NaN

jezrael · Accepted Answer · 2019-02-26 11:46:08Z

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 'Age': [34, 18, 44, 27, 30]}) #print (df1) df3 = df1.copy() df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 'Sex': ['M', 'M', 'F', 'M', 'F']}) #print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex']) print (df1) Name Age Sex 0 Tom 34 M 1 Sara 18 NaN 2 Eva 44 F 3 Jack 27 M 4 Laura 30 NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left') print (df) Name Age Sex 0 Tom 34 M 1 Sara 18 NaN 2 Eva 44 F 3 Jack 27 M 4 Laura 30 NaN

If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 'Year':[2000,2003,2003,2004,2007], 'Code':[1,2,3,4,4], 'Age': [34, 18, 44, 27, 30]}) print (df1) Name Year Code Age 0 Tom 2000 1 34 1 Sara 2003 2 18 2 Eva 2003 3 44 3 Jack 2004 4 27 4 Laura 2007 4 30 df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 'Sex': ['M', 'M', 'F', 'M', 'F'], 'Year':[2001,2003,2003,2004,2007], 'Code':[1,2,3,5,3], 'Val':[21,34,23,44,67]}) print (df2) Name Sex Year Code Val 0 Tom M 2001 1 21 1 Paul M 2003 2 34 2 Eva F 2003 3 23 3 Jack M 2004 5 44 4 Michelle F 2007 3 67

#merge by all columns df = df1.merge(df2, on=['Year','Code'], how='left') print (df) Name_x Year Code Age Name_y Sex Val 0 Tom 2000 1 34 NaN NaN NaN 1 Sara 2003 2 18 Paul M 34.0 2 Eva 2003 3 44 Eva F 23.0 3 Jack 2004 4 27 NaN NaN NaN 4 Laura 2007 4 30 NaN NaN NaN #specified columns - columns for join (Year, Code) need always + appended columns (Val) df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left') print (df) Name Year Code Age Val 0 Tom 2000 1 34 NaN 1 Sara 2003 2 18 34.0 2 Eva 2003 3 44 23.0 3 Jack 2004 4 27 NaN 4 Laura 2007 4 30 NaN

If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 'Age': [34, 18, 44, 27, 30]}) print (df1) Name Age 0 Tom 34 1 Sara 18 2 Eva 44 3 Jack 27 4 Laura 30 df3, df4 = df1.copy(), df1.copy() df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 'Val': [1,2,3,4,5]}) print (df2) Name Val 0 Tom 1 <-duplicated name Tom 1 Tom 2 <-duplicated name Tom 2 Eva 3 3 Jack 4 4 Michelle 5 s = df2.set_index('Name')['Val'] df1['New'] = df1['Name'].map(s) print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value s = df2.drop_duplicates('Name').set_index('Name')['Val'] print (s) Name Tom 1 Eva 3 Jack 4 Michelle 5 Name: Val, dtype: int64 df1['New'] = df1['Name'].map(s) print (df1) Name Age New 0 Tom 34 1.0 1 Sara 18 NaN 2 Eva 44 3.0 3 Jack 27 4.0 4 Laura 30 NaN

#add parameter for keep last value s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val'] print (s) Name Tom 2 Eva 3 Jack 4 Michelle 5 Name: Val, dtype: int64 df3['New'] = df3['Name'].map(s) print (df3) Name Age New 0 Tom 34 2.0 1 Sara 18 NaN 2 Eva 44 3.0 3 Jack 27 4.0 4 Laura 30 NaN

#map by dictionary d = dict(zip(df2['Name'], df2['Val'])) print (d) {'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5} df4['New'] = df4['Name'].map(d) print (df4) Name Age New 0 Tom 34 2.0 1 Sara 18 NaN 2 Eva 44 3.0 3 Jack 27 4.0 4 Laura 30 NaN

hello, how to usedf1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex']) when second dataframe have different number of rows? i am using it on my dataset and i am receiving results only for first row, thanks
@sygneto - It should working, values matched? What return print (df1['Sex'].unique()) vs print (df2['Sex'].unique()) ?
i have all uniqe values, but in my case this column df1['sex'] already exist and have in each row value =0, do you think is the way how to replace it? or maybe delete this column before map?
@sygneto - Not easy for me see problem, because cannot see your data. :(
i think the reason is that because i already have ['sex'] column in both dataframes, how i can replace it or append?

Xiaoyu Lu · Accepted Answer · 2018-10-26 14:09:25Z

You can also use the join method:

df1.set_index("Name").join(df2.set_index("Name"), how="left")

edit: added set_index("Name")

tdy · Accepted Answer · 2021-08-29 09:53:27Z

Reindexing hasn't been mentioned yet, but it's very fast and can auto-fill missing values if desired.

`DataFrame.reindex`

Use the common key (Name) as the index of the mapping dataframe (df2):

If df2's index is already Name, just reindex directly:
```
df2['Sex'].reindex(df1['Name']) 
```

Otherwise set_index beforehand:

df2.set_index('Name')['Sex'].reindex(df1['Name'])

Note that when assigning into an existing dataframe, the reindexed index will be misaligned, so assign only the array values:

df1['Sex'] = df2.set_index('Name')['Sex'].reindex(df1['Name']).array # Name Age Sex # 0 Tom 34 M # 1 Sara 18 NaN # 2 Eva 44 F # 3 Jack 27 M # 4 Laura 30 NaN

Also I've noticed a common assumption that reindexing is slow, but it's actually fast(est):

To fill missing values

reindex supports auto-filling missing values:

fill_value: static replacement
method: algorithmic replacement (ffill, bfill, or nearest) given monotonic index

For example, to fill empty Sex values with Prefer not to say (PNS):

df2.set_index('Name')['Sex'].reindex(df1['Name'], fill_value='PNS') # Name Age Sex # 0 Tom 34 M # 1 Sara 18 PNS # 2 Eva 44 F # 3 Jack 27 M # 4 Laura 30 PNS

Reindexing with fill_value is faster than chaining fillna:

To handle duplicates

The mapping dataframe (df2) cannot have duplicate keys, so drop_duplicates if applicable:

df2.drop_duplicates('Name').set_index('Name')['Sex'].reindex(df1['Name'])

_{Timing data:}

''' Note: This is python code in a js snippet, so "run code snippet" will not work. The snippet is just to avoid cluttering the main post with supplemental code. ''' df1 = pd.DataFrame({'Name': np.arange(n), 'Age': np.random.randint(100, size=n)}).sample(frac=1).reset_index(drop=True) df2 = pd.DataFrame({'Name': np.arange(n) + int(n * 0.5), 'Sex': np.random.choice(list('MF'), size=n)}).sample(frac=1).reset_index(drop=True) def reindex_(df1, df2): df1['Sex'] = df2.set_index('Name')['Sex'].reindex(df1['Name']).array return df1 def map_(df1, df2): df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex']) return df1 def dict_(df1, df2): df1['Sex'] = df1['Name'].map(dict(zip(df2['Name'], df2['Sex']))) return df1 def merge_(df1, df2): return df1.merge(df2[['Name', 'Sex']], left_on='Name', right_on='Name', how='left') def join_(df1, df2): return df1.set_index('Name').join(df2.set_index('Name'), how='left').reset_index() reindex_fill_value_ = lambda df1, df2: df2.set_index('Name')['Sex'].reindex(df1['Name'], fill_value='PNTS') reindex_fillna_ = lambda df1, df2: df2.set_index('Name')['Sex'].reindex(df1['Name']).fillna('PNTS') map_fillna_ = lambda df1, df2: df1['Name'].map(df2.set_index('Name')['Sex']).fillna('PNTS')

Bhanuchander Udhayakumar · Accepted Answer · 2021-04-01 05:38:44Z

Simple addition to @jezrael answer for creating dictionary from dataframe.

It may be helpful..

Python:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 'Age': [34, 18, 44, 27, 30]}) df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Paul', 'Jack', 'Michelle', 'Tom'], 'Something': ['M', 'M', 'F', 'M', 'A', 'F', 'B']}) df1_dict = pd.Series(df1.Age.values, index=df1.Name).to_dict() df2['Age'] = df2['Name'].map(df1_dict) print(df2)

Output:

 Name Something Age 0 Tom M 34.0 1 Paul M NaN 2 Eva F 44.0 3 Paul M NaN 4 Jack A 27.0 5 Michelle F NaN 6 Tom B 34.0

Collectives™ on Stack Overflow

Pandas: how to merge two dataframes on a column by keeping the information of the first one?

4 Answers 4

8 Comments

Comments

`DataFrame.reindex`

To fill missing values

To handle duplicates

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

Comments

To fill missing values

To handle duplicates

Comments

Comments

Linked

Related