Groupby value counts on the dataframe pandas

Question

I have the following dataframe:

df = pd.DataFrame([ (1, 1, 'term1'), (1, 2, 'term2'), (1, 1, 'term1'), (1, 1, 'term2'), (2, 2, 'term3'), (2, 3, 'term1'), (2, 2, 'term1') ], columns=['id', 'group', 'term'])

I want to group it by id and group and calculate the number of each term for this id-group pair.

So in the end I want to get something like this:

Anyway I can achieve this without looping?

Community · Accepted Answer · 2020-06-20 09:12:55Z

174

I use groupby and size

df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

Timing

1,000,000 rows

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000), group=np.random.choice(20, 1000000), term=np.random.choice(10, 1000000)))

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Aug 24, 2016 at 20:57

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

piRSquared Over a year ago

@jezrael thx, size is quicker too. crosstab is oddly inefficient

jezrael Over a year ago

And I am surprised that crosstab is so lazy ;)

MaxU - stand with Ukraine Over a year ago

@jezrael, crosstab uses pivot_table internally... ;)

jezrael Over a year ago

@piRSquared - can you add to timings df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) ? It seems faster for me. Thanks.

jezrael Over a year ago

@piRSquared - I try it in larger df and a bit faster (0.2ms, maybe it is same ;))

|

MaxU - stand with Ukraine · Accepted Answer · 2016-08-25 07:17:10Z

using pivot_table() method:

In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) Out[22]: term term1 term2 term3 id group 1 1 2 1 0 2 0 1 0 2 2 1 0 1 3 1 0 0

Timing against 700K rows DF:

In [24]: df = pd.concat([df] * 10**5, ignore_index=True) In [25]: df.shape Out[25]: (700000, 3) In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) 1 loop, best of 3: 226 ms per loop In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) 1 loop, best of 3: 236 ms per loop In [5]: %timeit pd.crosstab([df.id, df.group], df.term) 1 loop, best of 3: 355 ms per loop In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int) 1 loop, best of 3: 232 ms per loop In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0) 1 loop, best of 3: 231 ms per loop

Timing against 7M rows DF:

In [9]: df = pd.concat([df] * 10, ignore_index=True) In [10]: df.shape Out[10]: (7000000, 3) In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) 1 loop, best of 3: 2.27 s per loop In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) 1 loop, best of 3: 2.3 s per loop In [13]: %timeit pd.crosstab([df.id, df.group], df.term) 1 loop, best of 3: 3.37 s per loop In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int) 1 loop, best of 3: 2.28 s per loop In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0) 1 loop, best of 3: 1.89 s per loop

wow! pivot seems just as efficient at larger scales. I'll have to remember that. I'd give you +1 but I already did a while ago.
@ayhan, very strange - this time the solution with df.assign(ones = np.ones(len(df))).pivot_table(index=['id','group'], columns='term', values = 'ones', aggfunc=np.sum, fill_value=0) is bit slower - 1 loop, best of 3: 2.55 s per loop
I think it is because you used len there, instead of 'size'. len is a Python function but the functions we pass as strings are aliases to optimized C functions.

A.Kot · Accepted Answer · 2016-08-24 21:46:53Z

26

Instead of remembering lengthy solutions, how about the one that pandas has built in for you:

df.groupby(['id', 'group', 'term']).count()

answered Aug 24, 2016 at 21:46

A.Kot

7,9832 gold badges24 silver badges24 bronze badges

2 Comments

ali bakhtiari Over a year ago

Maybe this used to work before, but it doesn't return any columns in pandas 1.5.2

A.Kot Over a year ago

@alibakhtiari, would love to see what columns your dataframe has, groupby count has been working since python existed and still does.

jezrael · Accepted Answer · 2016-08-24 22:02:25Z

You can use crosstab:

print (pd.crosstab([df.id, df.group], df.term)) term term1 term2 term3 id group 1 1 2 1 0 2 0 1 0 2 2 1 0 1 3 1 0 0

Another solution with groupby with aggregating size, reshaping by unstack:

df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) term term1 term2 term3 id group 1 1 2 1 0 2 0 1 0 2 2 1 0 1 3 1 0 0

Timings:

df = pd.concat([df]*10000).reset_index(drop=True) In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)) 100 loops, best of 3: 12.4 ms per loop In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)) 100 loops, best of 3: 12.2 ms per loop

wow wow wow, you are amazing. And it took you only 3 minutes (the same time it took me to write a loop, and less time then it took me to write this question). I would really appreciate if you can write some explanation of why this works, but most probably I will be able to understand it by myself in a few minutes.
In your case crosstab is better as pivot_table, because default aggregating function is len (it is same as size) and I think it is also faster solution. Crosstab use first argument as index and second of columns. Give me a time, I try add timings.

Barth · Accepted Answer · 2021-10-14 15:24:18Z

If you want to use value_counts you can use it on a given series, and resort to the following:

df.groupby(["id", "group"])["term"].value_counts().unstack(fill_value=0)

or in an equivalent fashion, using the .agg method:

df.groupby(["id", "group"]).agg({"term": "value_counts"}).unstack(fill_value=0)

Another option is to directly use value_counts on the DataFrame itself without resorting to groupby:

df.value_counts().unstack(fill_value=0)

ali bakhtiari · Accepted Answer · 2023-01-05 13:21:42Z

Another alternative:

df.assign(count=1).groupby(['id', 'group','term']).sum().unstack(fill_value=0).xs("count", 1) term term1 term2 term3 id group 1 1 2 1 0 2 0 1 0 2 2 1 0 1 3 1 0 0

Collectives™ on Stack Overflow

Groupby value counts on the dataframe pandas

6 Answers 6

Timing

7 Comments

9 Comments

2 Comments

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Timing

7 Comments

9 Comments

2 Comments

3 Comments

Comments

Comments

Linked

Related