4

This question is referring to the previous post

The solutions proposed worked very well for a smaller data set, here I'm manipulating with 7 .txt files with a total memory of 750 MB. Which shouldn't be too big, so I must be doing something wrong in the process.

df1 = pd.read_csv('Data1.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14]) df2 = pd.read_csv('Data2.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14]) df3 = ... df4 = ... 

This is how one of my dataframes (df1) look like - head:

 name_profile depth VAR1 ... year month day 0 profile_1 0.6 0.2044 ... 2012 11 26 1 profile_1 0.6 0.2044 ... 2012 11 26 2 profile_1 1.1 0.2044 ... 2012 11 26 3 profile_1 1.2 0.2044 ... 2012 11 26 4 profile_1 1.4 0.2044 ... 2012 11 26 ... 

And tail:

 name_profile depth VAR1 ... year month day 955281 profile_1300 194.600006 0.01460 ... 2015 3 20 955282 profile_1300 195.800003 0.01095 ... 2015 3 20 955283 profile_1300 196.899994 0.01095 ... 2015 3 20 955284 profile_1300 198.100006 0.00730 ... 2015 3 20 955285 profile_1300 199.199997 0.01825 ... 2015 3 20 

I followed a suggestion and dropped duplicates:

df1.drop_duplicates() ... 

etc.

Similarly df2 has VAR2, df3 VAR3 etc.

The solution is modified according to one of the answers from the previous post.

The aim is to create a new, merged DataFrame with all VARX (of each dfX) as additional columns to the depth, profile and other 3 ones, so I tried something like this:

dfs = [df.set_index(['depth','name_profile', 'year', 'month', 'day']) for df in [df1, df2, df3, df4, df5, df6, df7]] df_merged = (pd.concat(dfs, axis=1).reset_index()) 

The current error is:

ValueError: cannot handle a non-unique multi-index!

What am I doing wrong?

17
  • 1
    You don't need Dask for this, the file size is trivial for any modern system. Commented Apr 12, 2019 at 17:00
  • 1
    reduce is a very intensive process as it nests with each iteration. Use concat instead . Commented Apr 12, 2019 at 17:01
  • 1
    The problem is here: dfs2 = [dfs1, df3]. dfs1 is, itself, a list of dataframes. You perhaps wanted to extend the list or append to it, not nest it Commented Apr 12, 2019 at 17:06
  • 1
    Once again, de-dupe your data on the keys with drop_duplicates(...) or run an aggregation to pick first pairing groupby(...).first() Commented Apr 12, 2019 at 17:09
  • 1
    Please show your data, attempted code, and errors/undesired results. Commented Apr 12, 2019 at 17:25

1 Answer 1

1

Consider again using the horizontal concatenation with pandas.concat. Because you have multiple rows sharing same profile, depth, year, month, and day, add a running count cumcount into mult-index, calculated with groupby().cumcount():

grp_cols = ['depth', 'name_profile', 'year', 'month', 'day'] dfs = [(df.assign(grp_count = df.groupby(grp_cols).cumcount()) .set_index(grp_cols + ['grp_count']) ) for df in [df1, df2, df3, df4, df5, df6, df7]] df_merged = pd.concat(dfs, axis=1).reset_index() print(df_merged) 
Sign up to request clarification or add additional context in comments.

2 Comments

I think this might be it, yes!!! THANK YOU SO MUCH! I'll take a closer look tomorrow, now I'm KO, but that might be it! What's the point of this cumcount function by the way?
Please read my opening text where cumcount is to resolve your repeat profile/depth rows.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.