25

so by using

df_ab = pd.concat([df_a, df_b], axis=1, join='inner') 

I get a Dataframe looking like this:

 A A B B 0 5 5 10 10 1 6 6 19 19 

and I want to remove its multiple columns:

 A B 0 5 10 1 6 19 

Because df_a and df_b are subsets of the same Dataframe I know that all rows have the same values if the column name is the same. I have a working solution:

df_ab = df_ab.T.drop_duplicates().T 

but I have a number of rows so this one is very slow. Does someone have a faster solution? I would prefer a solution where explicit knowledge of the column names isn't needed.

0

4 Answers 4

59

The easiest way is:

df = df.loc[:,~df.columns.duplicated()] 

One line of code can change everything

Sign up to request clarification or add additional context in comments.

4 Comments

This should be the verified answer, as not ALL columns have to be the same exactly all the time
This fails for large number of columns. I get this error MemoryError: Unable to allocate 480. GiB for an array with shape (87494, 736334) and data type object. This is the shape of my dataframe (736334, 1312).
If I were you I would not read all data at once. Read it in chunks. E.g. Column/N and operates in smaller chunks or randomly read the 5 numbers of rows saw (736334, 5) and remove duplicates columns. Then get the remaining columns as a list, and read your data keeping only those columns. Look at Pandas-ish library like Modin, Dask, Ray, Blaze that support large data and checkout pandas.pydata.org/pandas-docs/stable/user_guide/scale.html
Plus if you have GPUs see cudf.
13

Perhaps you would be better off avoiding the problem altogether, by using pd.merge instead of pd.concat:

df_ab = pd.merge(df_a, df_b, how='inner') 

This will merge df_a and df_b on all columns shared in common.

3 Comments

Yes, that's actually better than my concat :D thanks.
Although concat can take more than two at a time.
Merging is generating duplicate rows. Couldn't figure out why. But concatenation doesn't do that.
12

You may use np.unique to get indices of unique columns, and then use .iloc:

>>> df A A B B 0 5 5 10 10 1 6 6 19 19 >>> _, i = np.unique(df.columns, return_index=True) >>> df.iloc[:, i] A B 0 5 10 1 6 19 

1 Comment

I don't know how pandas is comparing the speed, but they claim the built in unique method is much faster. Index.unique() pandas.pydata.org/pandas-docs/version/0.17/generated/…
10

For those who skip the question and look straight at answers, the simplest way for me is to use OP's solution (assuming you don't run into the same performance issues he did: Transpose the dataframe, use drop_duplicates, and then Transpose it again:

df.T.drop_duplicates().T 

3 Comments

This worked for me, but was very slow. Answer from @Prayson W. Daniel was a fraction of the speed.
That answer only works if the column names are identical. If. you have identical column values and different names, you'd want the Transpose solution.
This may be slower because it creates a new object instead of operating on a view. With the solution of @Prayson W. Daniel, I kept getting the SettingWithCopyWarning.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.