60

Starting from this dataframe df:

df = pd.DataFrame({'c':[1,1,1,2,2,2],'l1':['a','a','b','c','c','b'],'l2':['b','d','d','f','e','f']}) c l1 l2 0 1 a b 1 1 a d 2 1 b d 3 2 c f 4 2 c e 5 2 b f 

I would like to perform a groupby over the c column to get unique values of the l1 and l2 columns. For one columns I can do:

g = df.groupby('c')['l1'].unique() 

that correctly returns:

c 1 [a, b] 2 [c, b] Name: l1, dtype: object 

but using:

g = df.groupby('c')['l1','l2'].unique() 

returns:

AttributeError: 'DataFrameGroupBy' object has no attribute 'unique' 

I know I can get the unique values for the two columns with (among others):

In [12]: np.unique(df[['l1','l2']]) Out[12]: array(['a', 'b', 'c', 'd', 'e', 'f'], dtype=object) 

Is there a way to apply this method to the groupby in order to get something like:

c 1 [a, b, d] 2 [c, b, e, f] Name: l1, dtype: object 
1
  • 3
    is there a way you can have the output as distinct columns instead of one cell having a list? Commented Oct 9, 2020 at 4:45

4 Answers 4

70

Alternatively, you can use agg:

g = df.groupby('c')['l1','l2'].agg(['unique']) 
Sign up to request clarification or add additional context in comments.

6 Comments

how would you combine 'unique' and let's say '.join' in the same agg?
You can write a custom function and apply it the same way. For example: f = lambda arr: ','.join(np.unique(arr)) --> then .agg([f]) or, if you want to label it: .agg([('MyName', f)])
@YaakovBressler how do you actually get the resulting values in order?
You could sort the data at any point! Best performance would be to sort after the aggregation -> df.groupby(...).agg()..sort_values() More context + options here: pandas groupby, then sort within groups @josepmaria
Visiting this in 2023, this is the correct answer. While you CAN use apply, this approach with agg is much more readable and flexible.
|
63

You can do it with apply:

import numpy as np g = df.groupby('c')['l1','l2'].apply(lambda x: list(np.unique(x))) 

Comments

18

One more alternative is to use GroupBy.agg with set

df.groupby('c').agg(set) l1 l2 c 1 {a, b} {d, b} 2 {c, b} {e, f} 

2 Comments

You might get into trouble with this when the values in l1 and l2 aren't hashable (ex timestamps). Otherwise, solid solution.
Beautiful solution but it doesn't work for nan.
0

A shorter version without the lambda function:

df.groupby('c').apply(np.unique) # or df.groupby('c')['l1','l2'].apply(np.unique) 

Output:

c 1 [a, b, d] 2 [b, c, e, f] dtype: object 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.