-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Closed
Labels
GroupbyPerformanceMemory or execution speed performanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, Explode
Milestone
Description
xref #14376
# from the asv In [10]: n = 10000 ...: df = DataFrame({'key1': randint(0, 500, size=n), ...: 'key2': randint(0, 100, size=n), ...: 'ints': randint(0, 1000, size=n), ...: 'ints2': randint(0, 1000, size=n), }) ...: In [11]: %timeit df.groupby(['key1', 'key2']).nunique() 1 loop, best of 3: 4.25 s per loop In [12]: result = df.groupby(['key1', 'key2']).nunique() In [13]: g = df.groupby(['key1', 'key2']) In [14]: expected = pd.concat([getattr(g, col).nunique() for col in g._selected_obj.columns], axis=1) In [15]: result.equals(expected) Out[15]: True In [16]: %timeit pd.concat([getattr(g, col).nunique() for col in g._selected_obj.columns], axis=1) 100 loops, best of 3: 6.94 ms per loop Series.groupby.nunique has a very performant implementation, but the way the DataFrame.groupby.nunique is implemented (via .apply) it ends up in a python loop over the groups, which nullifies this.
should be straightforward to fix this. need to make sure to test with as_index=True/False
Metadata
Metadata
Assignees
Labels
GroupbyPerformanceMemory or execution speed performanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, Explode