pd.concat() crashes if dataframe contains duplicate indices but not df.join()

I just found out that when we concatenate two dataframes horizontally, if one dataframe has duplicate indices, pd.concat() will crash, but df.join() will not crash. Instead, df.join() will spread the values into all rows with the same index value. Is this behavior by design? Thanks!

df1 = pd.DataFrame(np.random.randn(5), index=[0,1,2,3,3], columns=['a']) df2 = pd.DataFrame(np.random.randn(5), index=[0,1,2,2,4], columns=['b']) dfj = df1.join(df2, how='outer') display(df1, df2, dfj) dfc = pd.concat([df1, df2], axis=1) display(dfc)

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-40-150ce3e802ea> in <module> 3 dfj = df1.join(df2, how='outer') 4 display(df1, df2, dfj) ----> 5 dfc = pd.concat([df1, df2], axis=1) 6 display(dfc) ~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy) 282 ) 283 --> 284 return op.get_result() 285 286 ~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in get_result(self) 495 496 new_data = concatenate_block_managers( --> 497 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy 498 ) 499 if not self.copy: ~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy) 2025 blocks.append(b) 2026 -> 2027 return BlockManager(blocks, axes) ~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check) 137 138 if do_integrity_check: --> 139 self._verify_integrity() 140 141 self._consolidate_check() ~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in _verify_integrity(self) 332 for block in self.blocks: 333 if block._verify_integrity and block.shape[1:] != mgr_shape[1:]: --> 334 construction_error(tot_items, block.shape[1:], self.axes) 335 if len(self.items) != tot_items: 336 raise AssertionError( ~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e) 1692 if block_shape[0] == 0: 1693 raise ValueError("Empty data passed with indices specified.") -> 1694 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}") 1695 1696 ValueError: Shape of passed values is (9, 2), indices imply (7, 2)

By right, if the dataframes have duplicate indices, it can behave like df.join() and at least it should NOT crash.
I suggest we introduce additional arguments to handle duplicate indices, e.g., if the same index has X(>0) rows in df1, Y(>0) rows in df2, then if dup_index=:

combinatorial: after merging, it will have X*Y rows for every combination possibility.
outer-top-align: after merging, it will have max(X, Y) rows, in which the rows align from top
outer-bottom-align: after merging, it will have max(X, Y) rows, in which the rows align from bottom
inner-top-align: after merging, it will have min(X, Y) rows, in which the rows align from top
inner-bottom-align: after merging, it will have min(X, Y) rows, in which the rows align from bottom
raise: raise an exception with the warning message

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions