333

What is the easiest way to remove duplicate columns from a dataframe?

I am reading a text file that has duplicate columns via:

import pandas as pd df=pd.read_table(fname) 

The column names are:

Time, Time Relative, N2, Time, Time Relative, H2, etc... 

All the Time and Time Relative columns contain the same data. I want:

Time, Time Relative, N2, H2 

All my attempts at dropping, deleting, etc such as:

df=df.T.drop_duplicates().T 

Result in uniquely valued index errors:

Reindexing only valid with uniquely valued index objects 

Sorry for being a Pandas noob. Any Suggestions would be appreciated.


Additional Details

Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)

data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):

Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm] 2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005 2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005 2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005 ... etc ... 2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005 2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005 2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005 
0

18 Answers 18

842

Here's a one line solution to remove columns based on duplicate column names:

df = df.loc[:,~df.columns.duplicated()].copy() 

How it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.

Note: the above only checks columns names, not column values.

To remove duplicated indexes

Since it is similar enough, do the same thing on the index:

df = df.loc[~df.index.duplicated(),:].copy() 

To remove duplicates by checking values without transposing

Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.

df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy() 

This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:

# create a large(ish) dataframe ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312))) #to see size in gigs #ldf.memory_usage().sum()/1e9 #it's about 3 gigs # duplicate a column ldf.loc[:,'dup'] = ldf.loc[:,101] # take out duplicated columns by values ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy() 
Sign up to request clarification or add additional context in comments.

9 Comments

An ideal answer would also work for duplicated values, not just names.
@GrimSqueaker: If you want to consider whether the values are duplicated, you want something like df.T.drop_duplicates().T.
@JohnZwinck: this only works for small dataframes, as there is a limit to the number of columns you can have. For me it failed for a dataframe with 100,000 rows for instance, as this yields 100,000 columns after transposing, which is not possible
NOTE: the 'remove duplicates by checking values without transposing' is cool, but it does not check each column against another single column, but against all other columns. I think this example is correct: if col1 is [0,1,0], col2 [1,0,1] and col3 [1,1,0], then col3 will be deleted.
@drwhat that's a good point and may arise in practice, especially when data are discrete or binary. I'll add the caveat for now until a better alternative can be devised
|
50

It sounds like you already know the unique column names. If that's the case, then df = df['Time', 'Time Relative', 'N2'] would work.

If not, your solution should work:

In [101]: vals = np.random.randint(0,20, (4,3)) vals Out[101]: array([[ 3, 13, 0], [ 1, 15, 14], [14, 19, 14], [19, 5, 1]]) In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] ) df Out[106]: Time H1 N2 Time Relative N2 Time 0 3 13 0 3 13 0 1 1 15 14 1 15 14 2 14 19 14 14 19 14 3 19 5 1 19 5 1 In [107]: df.T.drop_duplicates().T Out[107]: Time H1 N2 0 3 13 0 1 1 15 14 2 14 19 14 3 19 5 1 

You probably have something specific to your data that's messing it up. We could give more help if there's more details you could give us about the data.

Edit: Like Andy said, the problem is probably with the duplicate column titles.

For a sample table file 'dummy.csv' I made up:

Time H1 N2 Time N2 Time Relative 3 13 13 3 13 0 1 15 15 1 15 14 14 19 19 14 19 14 19 5 5 19 5 1 

using read_table gives unique columns and works properly:

In [151]: df2 = pd.read_table('dummy.csv') df2 Out[151]: Time H1 N2 Time.1 N2.1 Time Relative 0 3 13 13 3 13 0 1 1 15 15 1 15 14 2 14 19 19 14 19 14 3 19 5 5 19 5 1 In [152]: df2.T.drop_duplicates().T Out[152]: Time H1 Time Relative 0 3 13 0 1 1 15 14 2 14 19 14 3 19 5 1 

If your version doesn't let your, you can hack together a solution to make them unique:

In [169]: df2 = pd.read_table('dummy.csv', header=None) df2 Out[169]: 0 1 2 3 4 5 0 Time H1 N2 Time N2 Time Relative 1 3 13 13 3 13 0 2 1 15 15 1 15 14 3 14 19 19 14 19 14 4 19 5 5 19 5 1 In [171]: from collections import defaultdict col_counts = defaultdict(int) col_ix = df2.first_valid_index() In [172]: cols = [] for col in df2.ix[col_ix]: cnt = col_counts[col] col_counts[col] += 1 suf = '_' + str(cnt) if cnt else '' cols.append(col + suf) cols Out[172]: ['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative'] In [174]: df2.columns = cols df2 = df2.drop([col_ix]) In [177]: df2 Out[177]: Time H1 N2 Time_1 N2_1 Time Relative 1 3 13 13 3 13 0 2 1 15 15 1 15 14 3 14 19 19 14 19 14 4 19 5 5 19 5 1 In [178]: df2.T.drop_duplicates().T Out[178]: Time H1 Time Relative 1 3 13 0 2 1 15 14 3 14 19 14 4 19 5 1 

5 Comments

Unfortunately df['Time'] selects all Time series (i.e. returns a DataFrame), and df['Time', ..] this will return the entire DataFrame.
Yeah, it's pretty tedious...hopefully it's just a version difference.
Using double transposes could have unintended side effects like converting numeric types to objects in the case that you have a df with mixed types. See: stackoverflow.com/questions/24682396/…
This solution give me problems on large dataframes: RecursionError: maximum recursion depth exceeded
Transpose of large data frame will be slow process
19

Transposing is inefficient for large DataFrames. Here is an alternative:

def duplicate_columns(frame): groups = frame.columns.to_series().groupby(frame.dtypes).groups dups = [] for t, v in groups.items(): dcols = frame[v].to_dict(orient="list") vs = dcols.values() ks = dcols.keys() lvs = len(vs) for i in range(lvs): for j in range(i+1,lvs): if vs[i] == vs[j]: dups.append(ks[i]) break return dups 

Use it like this:

dups = duplicate_columns(frame) frame = frame.drop(dups, axis=1) 

Edit

A memory efficient version that treats nans like any other value:

from pandas.core.common import array_equivalent def duplicate_columns(frame): groups = frame.columns.to_series().groupby(frame.dtypes).groups dups = [] for t, v in groups.items(): cs = frame[v].columns vs = frame[v] lcs = len(cs) for i in range(lcs): ia = vs.iloc[:,i].values for j in range(i+1, lcs): ja = vs.iloc[:,j].values if array_equivalent(ia, ja): dups.append(cs[i]) break return dups 

1 Comment

array_equivalent is deprecated. Would love an update for this code.
15

If I'm not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than @kalu 's function, keeping the first of any similarly named columns.

Cols = list(df.columns) for i,item in enumerate(df.columns): if item in df.columns[:i]: Cols[i] = "toDROP" df.columns = Cols df = df.drop("toDROP",1) 

2 Comments

Your solution doesn't work in my case, it shows me: "ValueError: labels ['toDROP'] not contained in axis" after executing the last line
It worked for me, but it came with this warning: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only. df1 = df1.drop("toDROP",1)
6

It looks like you were on the right path. Here is the one-liner you were looking for:

df.reset_index().T.drop_duplicates().T 

But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:

original_index = df.index.names df.reset_index().T.drop_duplicates().reset_index(original_index).T 

Comments

5

Note that Gene Burinsky's answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:

df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]] 

1 Comment

You don't have to do [::-1], check out Pandas documentation on the duplicated. Instead, just do df.columns.duplicate(keep='last') for a simpler solution of df = df.loc[:,~df.columns.duplicated(keep='last')]
4

An update on @kalu's answer, which uses the latest pandas:

def find_duplicated_columns(df): dupes = [] columns = df.columns for i in range(len(columns)): col1 = df.iloc[:, i] for j in range(i + 1, len(columns)): col2 = df.iloc[:, j] # break early if dtypes aren't the same (helps deal with # categorical dtypes) if col1.dtype is not col2.dtype: break # otherwise compare values if col1.equals(col2): dupes.append(columns[i]) break return dupes 

2 Comments

This is around 10 times slower than @kalu's answer
Slow is a feature of Python, not a bug. ;-)
4

Although @Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df. This means that subsequent assignments like df['newcol'] = 1 generate a SettingWithCopy warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:

duplicate_cols = df.columns[df.columns.duplicated()] df.drop(columns=duplicate_cols, inplace=True) 

1 Comment

In practice, I found using drop is a pain on larger data frames because it is slow. To avoid the warning, just append the command with .copy() like so df.loc[:,~df.columns.duplicated()].copy()
2

I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.

The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.

- E

March 2021 update

The subsequent post by @CircArgs may have provided a succinct one-liner to accomplish what I described here.

Comments

0

First step:- Read first row i.e all columns the remove all duplicate columns.

Second step:- Finally read only that columns.

cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates() df = pd.read_csv("file.csv", usecols=cols) 

Comments

0

The way below will identify dupe columns to review what is going wrong building the dataframe originally.

dupes = pd.DataFrame(df.columns) dupes[dupes.duplicated()] 

Comments

0

Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:

def get_dup_columns(m): ''' This will check every column in data frame and verify if you have duplicated columns. can help whenever you are cleaning big data sets of 50+ columns and clean up a little bit for you The result will be a list of tuples showing what columns are duplicates for example (column A, Column C) That means that column A is duplicated with column C more info go to https://wanatux.com ''' headers_list = [x for x in m.columns] duplicate_col2 = [] y = 0 while y <= len(headers_list)-1: for x in range(1,len(headers_list)-1): if m[headers_list[y]].equals(m[headers_list[x]]) == False: continue else: duplicate_col2.append((headers_list[y],headers_list[x])) headers_list.pop(0) return duplicate_col2 

And you can cast the definition like this:

duplicate_col = get_dup_columns(pd_excel) 

It will show a result like the following:

 [('column a', 'column k'), ('column a', 'column r'), ('column h', 'column m'), ('column k', 'column r')] 

Comments

0

I am not sure why Gene Burinsky's answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.

df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()]) 

Comments

0

A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:

import numpy as np import pandas as pd from itertools import combinations as combi df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d']) df['a'] = df['d'].copy() # column 'a' is equal to column 'd' # to keep the first dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()] # to keep the last dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()] df = df.drop(columns=dupli_cols) 

Comments

0

In case you want to check for duplicate columns, this code can be useful

columns_to_drop= [] for cname in sorted(list(df)): for cname2 in sorted(list(df))[::-1]: if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop: columns_to_drop.append(cname2) print(cname,cname2,'Are equal') df = df.drop(columns_to_drop, axis=1) 

Comments

0

If you're sure that the values in the columns are the same and you only want to filter based on the column name, you can try this:

# generate a list of unique column names l_h = list(set(df.columns.tolist())) # determine list length to be used for filtering the data frame index = len(l_h) # filter dataframe df[l_h].iloc[:,-index:] 

Comments

0

Maybe out of scope, but if we're allowed to use numpy, here's a simple solution.

import numpy import pandas as pd colsToKeep=np.unique(df.columns,return_index=True)[1] df=df.iloc[:,colsToKeep] 

Comments

-2

Fast and easy way to drop the duplicated columns by their values:

df = df.T.drop_duplicates().T 

More info: Pandas DataFrame drop_duplicates manual .

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.