Fast method for removing duplicate columns in pandas.Dataframe [duplicate]

Question

so by using

df_ab = pd.concat([df_a, df_b], axis=1, join='inner')

I get a Dataframe looking like this:

 A A B B 0 5 5 10 10 1 6 6 19 19

and I want to remove its multiple columns:

 A B 0 5 10 1 6 19

Because df_a and df_b are subsets of the same Dataframe I know that all rows have the same values if the column name is the same. I have a working solution:

df_ab = df_ab.T.drop_duplicates().T

but I have a number of rows so this one is very slow. Does someone have a faster solution? I would prefer a solution where explicit knowledge of the column names isn't needed.

Prayson W. Daniel · Accepted Answer · 2017-04-19 05:54:24Z

59

The easiest way is:

df = df.loc[:,~df.columns.duplicated()]

One line of code can change everything

answered Apr 19, 2017 at 5:54

Prayson W. Daniel

15.8k6 gold badges57 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Jeru Luke Over a year ago

This should be the verified answer, as not ALL columns have to be the same exactly all the time

Murtaza Haji Over a year ago

This fails for large number of columns. I get this error MemoryError: Unable to allocate 480. GiB for an array with shape (87494, 736334) and data type object. This is the shape of my dataframe (736334, 1312).

Prayson W. Daniel Over a year ago

If I were you I would not read all data at once. Read it in chunks. E.g. Column/N and operates in smaller chunks or randomly read the 5 numbers of rows saw (736334, 5) and remove duplicates columns. Then get the remaining columns as a list, and read your data keeping only those columns. Look at Pandas-ish library like Modin, Dask, Ray, Blaze that support large data and checkout pandas.pydata.org/pandas-docs/stable/user_guide/scale.html

Prayson W. Daniel Over a year ago

Plus if you have GPUs see cudf.

unutbu · Accepted Answer · 2015-08-17 00:44:33Z

13

Perhaps you would be better off avoiding the problem altogether, by using pd.merge instead of pd.concat:

df_ab = pd.merge(df_a, df_b, how='inner')

This will merge df_a and df_b on all columns shared in common.

edited Aug 17, 2015 at 0:44

answered Aug 17, 2015 at 0:30

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

3 Comments

Peter Klauke Over a year ago

Yes, that's actually better than my concat :D thanks.

Teepeemm Over a year ago

Although concat can take more than two at a time.

hafiz031 Over a year ago

Merging is generating duplicate rows. Couldn't figure out why. But concatenation doesn't do that.

behzad.nouri · Accepted Answer · 2015-08-17 00:30:32Z

12

You may use np.unique to get indices of unique columns, and then use .iloc:

>>> df A A B B 0 5 5 10 10 1 6 6 19 19 >>> _, i = np.unique(df.columns, return_index=True) >>> df.iloc[:, i] A B 0 5 10 1 6 19

answered Aug 17, 2015 at 0:30

behzad.nouri

78.5k18 gold badges130 silver badges127 bronze badges

1 Comment

bretcj7 Over a year ago

I don't know how pandas is comparing the speed, but they claim the built in unique method is much faster. Index.unique() pandas.pydata.org/pandas-docs/version/0.17/generated/…

James Wright · Accepted Answer · 2019-07-30 12:19:30Z

10

For those who skip the question and look straight at answers, the simplest way for me is to use OP's solution (assuming you don't run into the same performance issues he did: Transpose the dataframe, use drop_duplicates, and then Transpose it again:

df.T.drop_duplicates().T

edited Jul 30, 2019 at 12:19

answered Nov 2, 2018 at 20:28

James Wright

1,5143 gold badges17 silver badges40 bronze badges

3 Comments

Declan Over a year ago

This worked for me, but was very slow. Answer from @Prayson W. Daniel was a fraction of the speed.

GrimSqueaker Over a year ago

That answer only works if the column names are identical. If. you have identical column values and different names, you'd want the Transpose solution.

Teepeemm Over a year ago

This may be slower because it creates a new object instead of operating on a view. With the solution of @Prayson W. Daniel, I kept getting the SettingWithCopyWarning.

Collectives™ on Stack Overflow

Fast method for removing duplicate columns in pandas.Dataframe [duplicate]

4 Answers 4

4 Comments

3 Comments

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

1 Comment

3 Comments

Linked

Related