20

This might be considered as a duplicate of a thorough explanation of various approaches, however I can't seem to find a solution to my problem there due to a higher number of Data Frames.

I have multiple Data Frames (more than 10), each differing in one column VARX. This is just a quick and oversimplified example:

import pandas as pd df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000], 'VAR1': [38.196202, 38.198002, 38.200001], 'profile': ['profile_1', 'profile_1','profile_1']}) df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000], 'VAR2': [0.20440, 0.20442, 0.20446], 'profile': ['profile_1', 'profile_1','profile_1']}) df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000], 'VAR3': [15.1880, 15.1820, 15.1820], 'profile': ['profile_1', 'profile_1','profile_1']}) 

Each df has same or different depths for the same profiles, so

I need to create a new DataFrame which would merge all separate ones, where the key columns for the operation are depth and profile, with all appearing depth values for each profile.

The VARX value should be therefore NaN where there is no depth measurement of that variable for that profile.

The result should be a thus a new, compressed DataFrame with all VARX as additional columns to the depth and profile ones, something like this:

name_profile depth VAR1 VAR2 VAR3 profile_1 0.500000 38.196202 NaN NaN profile_1 0.600000 38.198002 0.20440 NaN profile_1 1.100000 NaN 0.20442 NaN profile_1 1.200000 NaN 0.20446 15.1880 profile_1 1.300000 38.200001 NaN 15.1820 profile_1 1.400000 NaN NaN 15.1820 

Note that the actual number of profiles is much, much bigger.

Any ideas?

1
  • 1
    Take a look at the "Generalizing: mergeing multiple DataFrames" section, there is a partial solution explained there. If that was not helpful, please let me know how I might improve the post so it is clearer. Thanks! Commented Apr 15, 2019 at 2:35

5 Answers 5

23

Consider setting index on each data frame and then run the horizontal merge with pd.concat:

dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]] print(pd.concat(dfs, axis=1).reset_index()) # profile depth VAR1 VAR2 VAR3 # 0 profile_1 0.5 38.198002 NaN NaN # 1 profile_1 0.6 38.198002 0.20440 NaN # 2 profile_1 1.1 NaN 0.20442 NaN # 3 profile_1 1.2 NaN 0.20446 15.188 # 4 profile_1 1.3 38.200001 NaN 15.182 # 5 profile_1 1.4 NaN NaN 15.182 
Sign up to request clarification or add additional context in comments.

Comments

15

A simple way is with a combination of functools.partial/reduce.

Firstly partial allows to "freeze" some portion of a function’s arguments and/or keywords resulting in a new object with a simplified signature. Then with reduce we can apply cumulatively the new partial object to the items of iterable (list of dataframes here):

from functools import partial, reduce dfs = [df1, df2, df3] merge = partial(pd.merge, on=['depth', 'profile'], how='outer') reduce(merge, dfs) depth VAR1 profile VAR2 VAR3 0 0.6 38.198002 profile_1 0.20440 NaN 1 0.6 38.198002 profile_1 0.20440 NaN 2 1.3 38.200001 profile_1 NaN 15.182 3 1.1 NaN profile_1 0.20442 NaN 4 1.2 NaN profile_1 0.20446 15.188 5 1.4 NaN profile_1 NaN 15.182 

1 Comment

Is it possible to also keep the column names? When I use this, the column names are gone.
2

I would use append.

>>> df1.append(df2).append(df3).sort_values('depth') VAR1 VAR2 VAR3 depth profile 0 38.196202 NaN NaN 0.5 profile_1 1 38.198002 NaN NaN 0.6 profile_1 0 NaN 0.20440 NaN 0.6 profile_1 1 NaN 0.20442 NaN 1.1 profile_1 2 NaN 0.20446 NaN 1.2 profile_1 0 NaN NaN 15.188 1.2 profile_1 2 38.200001 NaN NaN 1.3 profile_1 1 NaN NaN 15.182 1.3 profile_1 2 NaN NaN 15.182 1.4 profile_1 

Obviously if you have a lot of dataframes, just make a list and loop through them.

4 Comments

thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
@BlivetWidget, I'm sure this is a silly question, but I have 3 dataframes I need to join - does it matter if 1) they have no data in common and 2) their columns are not in the same order? By 1, I mean that they have the same columns, but each row contains different data (this is 3 different infection types). And, yes, I'm going to try it and see what happens.
@DataGirl that's not a problem, but as with anything else, only you can say if the result is what you wanted it to be. Saying "join" in English is not very specific. There are other functions like df.combine(), df.join(), and df.merge() that are worth looking into if df.append() doesn't do what you want.
1

Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.

df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth']) df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value') 

Where df_pivot will be

variable VAR1 VAR2 VAR3 profile depth profile_1 0.5 38.196202 NaN NaN 0.6 38.198002 0.20440 NaN 1.1 NaN 0.20442 NaN 1.2 NaN 0.20446 15.188 1.3 38.200001 NaN 15.182 1.4 NaN NaN 15.182 

Comments

1

You can also use:

dfs = [df1, df2, df3] df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer') for d in dfs[2:]: df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer') depth VAR1 profile VAR2 VAR3 0 0.5 38.196202 profile_1 NaN NaN 1 0.6 38.198002 profile_1 0.20440 NaN 2 1.3 38.200001 profile_1 NaN 15.182 3 1.1 NaN profile_1 0.20442 NaN 4 1.2 NaN profile_1 0.20446 15.188 5 1.4 NaN profile_1 NaN 15.182 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.