33

I am trying to concat dataframes from the following two csv files:

df_a: https://www.dropbox.com/s/slcu7o7yyottujl/df_current.csv?dl=0 df_b: https://www.dropbox.com/s/laveuldraurdpu1/df_climatology.csv?dl=0 

Both of these have the same number and names of columns. However, when I do this:

pandas.concat([df_a, df_b]) 

I get the error:

AssertionError: Number of manager items must equal union of block items # manager items: 20, # tot_items: 21 

How to fix this?

5
  • 1
    Just tried with your data and pandas==0.17.1 and concat works fine. Commented Feb 1, 2016 at 18:54
  • hmm, not sure what is happening....i still get the error, I am using pandas == 0.17.1 as well Commented Feb 1, 2016 at 18:59
  • I'm using pandas 0.17.1, Python 2.7.11 on Ubuntu 14.04, and for me it is working fine also. Commented Feb 1, 2016 at 19:13
  • I check column names print df_a.columns == df_b.columns and output: [ True True True True True True True True True True True True True True False False True False True False False] Commented Feb 1, 2016 at 19:17
  • thanks @jezrael, the column names are not in the same order, but they are all present. Commented Feb 1, 2016 at 19:21

4 Answers 4

45

I believe that this error occurs if the following two conditions are met:

  1. The data frames have different columns. (i.e. (df1.columns == df2.columns) is False
  2. The columns has a repeated value.

Basically if you concat dataframes with columns [A,B,C] and [B,C,D] it can work out to make one series for each distinct column name. So if I try to join a third dataframe [B,B,C] it does not know which column to append and ends up with fewer distinct columns than it thinks it needs.

If your dataframes are such that df1.columns == df2.columns then it will work anyway. So you can join [B,B,C] to [B,B,C], but not to [C,B,B], as if the columns are identical it probably just uses the integer indexes or something.

Sign up to request clarification or add additional context in comments.

3 Comments

I was having a problem in the spatial extension geopandas where the .overlay() operation was failing due to an error very similar to the original post. It seems that if you have the same column name if both geodataframes, it will enumerate them in the output dataframe ONLY ONCE. On the third overlay operation, it will throw this error. So if you are making a chain-overlay, make sure the column names are different for each geodataframe in the chain.
Thanks! & FYI to find duplicate columns: duplicates = df.columns.duplicated(keep=False) [x[0] for x in tuple(zip(df.columns , duplicates)) if x[1]]
Repeated Columns! Of course, thanks a lot for the clear answer !
9

The answers here did not solve my issue, but this answer did.

The Issue was duplicated columns in one or both DataFrames.

Here's a duplicated column fix(as per answer above):

df = df.loc[:,~df.columns.duplicated()] 

Comments

6

You can get around this issue with a 'manual' concatenation, in this case your

list_of_dfs = [df_a, df_b] 

And instead of running

giant_concat_df = pd.concat(list_of_dfs,0) 

You can use turn all of the dataframes to a list of dictionaries and then make a new data frame from these lists (merged with chain)

from itertools import chain list_of_dicts = [cur_df.T.to_dict().values() for cur_df in list_of_dfs] giant_concat_df = pd.DataFrame(list(chain(*list_of_dicts))) 

1 Comment

Please be aware that this solution will take a significantly different time to complete and will consume a significant amount of memory too on large data frames.
2

Unfortunately, the source files are already unavailable, so I can't check my solution in your case. In my case the error occurred when:

  1. Data frames have two columns with the same name (I've had ID and id columns, which I then converted to lower case, so they become the same)
  2. Value types of the same-named columns are different

Here is an example which gives me the error in question:

df1 = pd.DataFrame(data=[ ['a', 'b', 'id', 1], ['a', 'b', 'id', 2] ], columns=['A', 'B', 'id', 'id']) df2 = pd.DataFrame(data=[ ['b', 'c', 'id', 1], ['b', 'c', 'id', 2] ], columns=['B', 'C', 'id', 'id']) pd.concat([df1, df2]) >>> AssertionError: Number of manager items must equal union of block items # manager items: 4, # tot_items: 5 

Removing / renaming one of the columns makes this code work.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.