1

I have three dataframes consisting of more 556, 555, and ~ 1600 columns each. I want to horizontally stack them, while merging the like columns. How would I do this with so many columns? I tried re-indexing so indices went from 0-252 with the first df, 232-2518 on the second df and 2519 to ~4000 on the final but I'm still getting the following error:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects 

Is it better to use merge or join over concat in this case?

The data can be found here: https://github.com/eoefelein/sample_data

Thank you so much!

2 Answers 2

1

Do you have a unique identifier across each dataframe to join them on?

If not I think you just want a plain pd.concat which will union your dataframes and the total number of columns will be the distinct count of columns across all 3 dataframes

import pandas as pd df1 = pd.read_csv('sample_data/final_pre_rfe_fiverr.csv') df2 = pd.read_csv('sample_data/final_pre_rfe_freelancer.csv') df3 = pd.read_csv('sample_data/final_pre_rfe_pph.csv') pd.concat((df1,df2,df3)) 

Notice in the output below that new columns are horizontally stacked while old ones are merged.

Output:

 Unnamed: 0 title .net 360 photography 2d animation \ 0 253 mobile 0.0 0.0 0 1 254 quality assurance 0.0 0.0 0 2 255 data scientist 0.0 0.0 0 3 256 data scientist 0.0 0.0 0 4 257 quality assurance 0.0 0.0 0 .. ... ... ... ... ... 248 248 data scientist NaN NaN 0 249 249 fullstack NaN NaN 0 250 250 fullstack NaN NaN 0 251 251 fullstack NaN NaN 0 252 252 fullstack NaN NaN 0 3d modelling 3d rendering 3d texturing 3ddesign 3dmodeling ... \ 0 0.0 0 0.0 0.0 0.0 ... 1 0.0 0 0.0 0.0 0.0 ... 2 0.0 0 0.0 0.0 0.0 ... 3 0.0 0 0.0 0.0 0.0 ... 4 0.0 0 0.0 0.0 0.0 ... .. ... ... ... ... ... ... 248 NaN 0 NaN NaN NaN ... 249 NaN 0 NaN NaN NaN ... 250 NaN 0 NaN NaN NaN ... 251 NaN 0 NaN NaN NaN ... 252 NaN 0 NaN NaN NaN ... webui studio 2013 for asp.net windows administration \ 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN .. ... ... 248 0.0 0.0 249 0.0 0.0 250 0.0 0.0 251 0.0 0.0 252 0.0 0.0 windows powershell programming language.1 wordpress e-commerce \ 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN .. ... ... 248 0.0 0.0 249 0.0 0.0 250 0.0 0.0 251 0.0 0.0 252 0.0 0.0 wordpress plugin.1 wordpress template worpress migration zapier \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN .. ... ... ... ... 248 0.0 0.0 0.0 0.0 249 0.0 0.0 0.0 0.0 250 0.0 0.0 0.0 0.0 251 0.0 0.0 0.0 0.0 252 0.0 0.0 0.0 0.0 zend framework zimbra 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN .. ... ... 248 0.0 0.0 249 0.0 0.0 250 0.0 0.0 251 0.0 0.0 252 0.0 0.0 [4194 rows x 2194 columns] 

Hope it helps! If not, would you mind clarifying a little more what you're looking for?

Sign up to request clarification or add additional context in comments.

2 Comments

That worked but I have to read the csv's from where I had output them, rather than where they had been created in my notebook? Strange, but glad that worked! Thank you!
No you absolutely do not have to do that :). I just wanted to show you that I was actually using the csvs you linked. You should be able to do the same thing in your notebook. It's possible you'll have to run reset_index() on each dataframe
1

Use pd.concat() with axis=1 and ignore_index=True:

Assuming you have already read the CSV files into dataframes df1, df2, df3:

df_out = pd.concat([df1, df2, df3], axis=1, ignore_index=True) 

Edit

I might have overlooked that you want to horizontally stack them. In that case, just use the default axis=0:

df_out = pd.concat([df1, df2, df3], ignore_index=True) 

Keep the ignore_index=True in order to re-serialize the row index.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.