3

How to go about removing duplicates column by column in a pandas data frame so that:

set1 set2 set3 set4 apple apple orange orange apple orange banana orange orange banana pear banana banana lemon pear lemon grape lemon 

becomes:

set1 set2 set3 set4 apple apple orange orange orange orange banana banana banana pear pear lemon grape 
1
  • 2
    If you want the unique values from a column, you can always do df['column_name'].unique(). Commented Aug 30, 2019 at 13:00

4 Answers 4

3

Use:

m=df.apply(lambda x:dict.fromkeys(x).keys()) pd.DataFrame(m.values.tolist(),index=m.index).T 

Or a better way courtesy @piRSquared:

pd.DataFrame.from_dict({k: {*df[k].dropna()} for k in df}, orient='index').T 

 set1 set2 set3 set4 0 apple apple orange orange 1 orange orange banana NaN 2 banana banana pear None 3 pear NaN lemon None 4 grape None None None 
Sign up to request clarification or add additional context in comments.

1 Comment

pd.DataFrame.from_dict({k: {*df[k].dropna()} for k in df}, orient='index').T
3

Here is another way pivot

df.melt().dropna().drop_duplicates(['variable','value']).\ assign(key=lambda x : x.groupby('variable').cumcount()).pivot(index='key',columns='variable',values='value') Out[806]: variable set1 set2 set3 set4 key 0 apple apple orange orange 1 orange orange banana NaN 2 banana banana pear NaN 3 pear NaN lemon NaN 4 grape NaN NaN NaN 

Comments

3

itertools.zip_longest

from itertools import zip_longest pd.DataFrame( [*zip_longest(*({*df[c].dropna()} for c in df))], columns=[*df] ) set1 set2 set3 set4 0 banana orange banana orange 1 grape banana lemon None 2 pear apple pear None 3 apple None orange None 4 orange None None None 

collections.defaultdict and itertools.count

# %%timeit from collections import defaultdict from itertools import count i = defaultdict(count) pd.DataFrame({c: {next(i[c]): v for v in {*df[c].dropna()}} for c in df}) set1 set2 set3 set4 0 pear apple orange orange 1 grape banana lemon NaN 2 apple orange banana NaN 3 banana NaN pear NaN 4 orange NaN NaN NaN 

Comments

1

You can also use drop_duplicates :

df.apply(lambda x : x.drop_duplicates().reset_index(drop=True)) 

>

 set1 set2 set3 set4 0 apple apple orange orange 1 orange orange banana NaN 2 banana banana pear NaN 3 pear NaN lemon NaN 4 grape NaN NaN NaN 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.