PERF: faster grouping #14294

jreback · 2016-09-24T13:53:05Z

closes #14293

jreback · 2016-09-24T13:53:22Z

cc @mrocklin
@wesm

mrocklin · 2016-09-24T14:28:10Z

Does the performance regression pointed out in #14293 (comment) still hold here?

jreback · 2016-09-24T15:43:11Z

so that was a bug, now fixed.

this is size=2**21
large is 10k groups, small is 100

· Running 8 total benchmarks (2 commits * 1 environments * 4 benchmarks) [ 0.00%] · For pandas commit hash 45b79968: [ 0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 12.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large 279.51ms [ 25.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small 104.08ms [ 37.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large 374.57ms [ 50.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small 151.01ms [ 50.00%] · For pandas commit hash d9e51fe7: [ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 62.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large 856.54ms [ 75.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small 674.41ms [ 87.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large 392.38ms [100.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small 262.49ms before after ratio [d9e51fe7] [45b79968] - 856.54ms 279.51ms 0.33 groupby.groupby_groups.time_groupby_groups_int64_large - 674.41ms 104.08ms 0.15 groupby.groupby_groups.time_groupby_groups_int64_small SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

These are faster for object dtypes (and about 2x for _small), though the margin isn't as big as it should be I think.

codecov-io · 2016-09-24T16:07:20Z

Current coverage is 85.26% (diff: 100%)

Merging #14294 into master will decrease coverage by <.01%

@@ master #14294 diff @@ ========================================== Files 140 140 Lines 50593 50599 +6 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== + Hits 43137 43142 +5  - Misses 7456 7457 +1  Partials 0 0

Powered by Codecov. Last update b81d444...82d19dd

jreback · 2016-09-24T23:16:01Z

these are 2*22 with 100, 10000 for small/large

 before after ratio [d9e51fe7] [87db6a4f] - 136.59ms 107.11ms 0.78 groupby.groupby_indices.time_groupby_indices - 996.17ms 699.80ms 0.70 groupby.groupby_groups.time_groupby_groups_object_large - 606.62ms 319.25ms 0.53 groupby.groupby_groups.time_groupby_groups_object_small - 1.80s 540.05ms 0.30 groupby.groupby_groups.time_groupby_groups_int64_large - 1.49s 207.64ms 0.14 groupby.groupby_groups.time_groupby_groups_int64_small

jreback · 2016-09-26T10:46:51Z

@wesm
@jorisvandenbossche

any comments

chris-b1 · 2016-09-26T15:44:49Z

pandas/core/groupby.py

I might be missing something, but wouldn't it possible to re-use the possibly existing factorization here?

def _groupby_indices(codes : Grouping.labels, cats : Grouping.group_index)

we are re-using the existing factorization (if its categorical already its unchanged).

I meant in the non Categorical case, if Grouping.labels is already populated, could skip factorizing again in the Categorical?

actually I see you proposed something else. .group_index is not defined except for a single Grouping. If we had a MultiGrouping, then yes I think that would work (right now that basically a list of Groupings).

Oh, I see it now, I went one class too deep. I do think you could speed up the single grouping case here - https://github.com/jreback/pandas/blob/580237924022eb74575420ad4433952c8de318dd/pandas/core/groupby.py#L2345 - make it:

return self.index.groupby(Categorical.from_codes(self.labels, self.group_index))

wesm · 2016-09-26T16:22:08Z

pandas/algos.pyx

We already have groupsort_indexer, does this do anything different?

let me see...

wesm · 2016-09-26T16:23:06Z

pandas/indexes/base.py

Are there benefits to doing this lazily?

this is not lazy, so not sure what you mean.

I'm just wondering whether producing a fully-boxed Index for each group up front has memory use or performance implications. This is something I guess we'll address in much more detail in pandas 2.0

AFAICT, this is only used on grouped.groups, so by-definition we need to box. (and that's the ONLY benchmark that changed with this PR). I though this would have broader implications (on the good side), but no-go.

jreback · 2016-09-26T22:57:47Z

roughly the same perf benefits here. with 5802379

 [7dedbed8] [a731e2dd] - 554.34ms 269.39ms 0.49 groupby.groupby_groups.time_groupby_groups_object_small - 1.78s 430.09ms 0.24 groupby.groupby_groups.time_groupby_groups_int64_large - 1.32s 180.49ms 0.14 groupby.groupby_groups.time_groupby_groups_int64_small SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293

wesm · 2016-09-27T13:44:17Z

lgtm

jreback added Groupby Performance Memory or execution speed performance labels Sep 24, 2016

jreback added this to the 0.19.0 milestone Sep 24, 2016

jreback force-pushed the groupby branch from d9c7ce5 to 1f4e97f Compare September 24, 2016 13:56

jreback force-pushed the groupby branch from 1f4e97f to 45b7996 Compare September 24, 2016 15:10

jreback force-pushed the groupby branch 2 times, most recently from 5f82e6f to 1ba34fb Compare September 24, 2016 17:02

chris-b1 reviewed Sep 26, 2016

View reviewed changes

wesm reviewed Sep 26, 2016

View reviewed changes

jreback force-pushed the groupby branch from 87db6a4 to a731e2d Compare September 26, 2016 22:37

jreback force-pushed the groupby branch from a731e2d to 5802379 Compare September 26, 2016 23:05

PERF: faster grouping

82d19dd

remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293

jreback force-pushed the groupby branch from 669d0dc to 82d19dd Compare September 27, 2016 10:39

jreback closed this in 71df09c Sep 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: faster grouping #14294

PERF: faster grouping #14294

Uh oh!

jreback commented Sep 24, 2016

jreback commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016 •

edited

Loading

codecov-io commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 26, 2016

chris-b1 Sep 26, 2016

jreback Sep 26, 2016

chris-b1 Sep 26, 2016

jreback Sep 26, 2016

chris-b1 Sep 26, 2016

wesm Sep 26, 2016

jreback Sep 26, 2016

jreback Sep 26, 2016

wesm Sep 26, 2016

jreback Sep 26, 2016

wesm Sep 27, 2016

jreback Sep 27, 2016

jreback commented Sep 26, 2016 •

edited

Loading

wesm commented Sep 27, 2016

Labels

5 participants

Uh oh!

PERF: faster grouping #14294

PERF: faster grouping #14294

Uh oh!

Conversation

jreback commented Sep 24, 2016

jreback commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov-io commented Sep 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 85.26% (diff: 100%)

jreback commented Sep 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jreback commented Sep 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wesm commented Sep 27, 2016

Labels

5 participants

jreback commented Sep 24, 2016 •

edited

Loading

codecov-io commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 26, 2016 •

edited

Loading