Pandas DataFrame merge/join

Question

Okay, so I have 2 DataFrames which both have one common column ('ID'). Now I would like to add all keys and their values from df_2 to df_1 matching their respective 'ID'. I added a minimum reproducible example below. In this case the IDs are in the same order. However, that must not be the case. Both DataFrames always contain the same IDs. Important to note is that each ID is not unique, however, the order of the occurrences of one ID is the same for both DataFrames.

x = {'ID':[1]*50+[2]*50), 'row':[9, 2, 8, 4, 2, 5, 2, 1, 1, 6, 5, 6, 3, 1, 1, 2, 1, 2, 3, 8, 3, 2, 8, 8, 7, 4, 2, 3, 7, 7, 7, 4, 3, 7, 5, 6, 8, 7, 4, 5, 7, 1, 8, 9, 4, 1, 6, 1, 7, 1, 5, 9, 9, 9, 4, 6, 5, 1, 7, 9, 5, 9, 6, 8, 6, 7, 4, 3, 4, 4, 7, 2, 7, 8, 6, 6, 8, 5, 4, 9, 9, 4, 3, 9, 7, 8, 3, 5, 1, 6, 1, 6, 8, 2, 1, 4, 3, 7, 7, 6], 'block':[2, 8, 8, 2, 4, 1, 6, 4, 2, 7, 7, 1, 3, 1, 4, 8, 4, 2, 5, 2, 4, 9, 3, 6, 6, 1, 4, 2, 7, 4, 4, 1, 9, 8, 5, 1, 1, 4, 6, 4, 9, 1, 7, 4, 5, 3, 6, 2, 6, 8, 5, 9, 5, 8, 1, 8, 3, 3, 5, 7, 3, 2, 5, 9, 6, 4, 3, 9, 7, 5, 2, 1, 2, 4, 6, 7, 7, 8, 2, 2, 7, 7, 9, 3, 8, 7, 3, 3, 1, 4, 6, 2, 6, 2, 8, 1, 1, 7, 5, 1]} y = {'ID':[1]*50+[2]*50), 'Var1':[8, 3, 8, 3, 3, 2, 7, 8, 9, 6, 5, 6, 6, 3, 8, 3, 6, 4, 6, 2, 6, 2, 7, 3, 4, 7, 4, 9, 1, 5, 5, 5, 2, 9, 2, 5, 5, 1, 9, 9, 8, 2, 4, 2, 8, 7, 5, 8, 3, 9, 5, 9, 1, 1, 9, 9, 6, 3, 7, 1, 9, 8, 3, 3, 5, 7, 3, 8, 9, 1, 5, 2, 3, 3, 5, 5, 9, 6, 7, 4, 4, 1, 9, 7, 9, 8, 5, 4, 2, 5, 5, 8, 7, 5, 9, 9, 4, 5, 4, 8], 'Var2':[3, 3, 3, 8, 9, 5, 7, 5, 5, 3, 7, 5, 7, 7, 2, 1, 3, 9, 8, 5, 5, 1, 1, 8, 5, 3, 1, 5, 4, 3, 5, 2, 3, 2, 7, 3, 9, 4, 8, 4, 6, 6, 2, 3, 8, 3, 3, 6, 6, 4, 8, 4, 1, 9, 8, 9, 5, 7, 6, 9, 4, 6, 8, 5, 7, 2, 2, 8, 9, 1, 9, 4, 8, 8, 6, 8, 1, 1, 4, 7, 8, 4, 1, 2, 2, 9, 5, 3, 7, 4, 5, 4, 5, 1, 2, 3, 5, 9, 2, 4]} df_1 = pd.DataFrame(x) df_2 = pd.DataFrame(y)

My approach:

def add_analysis(data, ana): for key in ana.keys(): if key == 'ID': continue add_db_key(ana, data, key) def add_db_key(data_db, data_loaded, key): for ID in data_db.ID: data_loaded.loc[data_loaded['ID'] == ID, key] = data_db[data_db.ID == ID][key].values add_analysis(df_1, df_2)

which works fine, however, it takes for large DataFrames forever, as I am iterating through every column and every row of the DataFrame. I feel like that this is very inefficient and that there is probably a merge operation that would do the trick, however, most importantly the IDs have to match and the order has to be preserved. Any help with pointers for which function to look at is highly appreciated. Thanks in advance.

Alessandro · Accepted Answer · 2020-11-10 10:33:39Z

Your functions do not work for me (they have no return value). However, if I understand correctly, you can solve the problem with something like this:

df_1['ID_idx'] = df_1.reset_index().apply(lambda x: f"{x['ID']}_{x['index']}", axis=1) df_2['ID_idx'] = df_1.reset_index().apply(lambda x: f"{x['ID']}_{x['index']}", axis=1) df = df_1.merge(df_2, right_on='ID_idx', left_on='ID_idx')

and eventually drop the columns you don't want.

Well the function is not supposed to return anything as it works inplace. I will try if your solution works and whether or not it is faster. Anyhow, thanks for your contribution :)

Collectives™ on Stack Overflow

Pandas DataFrame merge/join

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related