Skip to content

Conversation

@jreback
Copy link
Contributor

@jreback jreback commented Sep 24, 2016

closes #14293

@jreback jreback added Groupby Performance Memory or execution speed performance labels Sep 24, 2016
@jreback jreback added this to the 0.19.0 milestone Sep 24, 2016
@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2016

@mrocklin
Copy link
Contributor

Does the performance regression pointed out in #14293 (comment) still hold here?

@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2016

so that was a bug, now fixed.

this is size=2**21
large is 10k groups, small is 100

· Running 8 total benchmarks (2 commits * 1 environments * 4 benchmarks) [ 0.00%] · For pandas commit hash 45b79968: [ 0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 12.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large 279.51ms [ 25.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small 104.08ms [ 37.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large 374.57ms [ 50.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small 151.01ms [ 50.00%] · For pandas commit hash d9e51fe7: [ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 62.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large 856.54ms [ 75.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small 674.41ms [ 87.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large 392.38ms [100.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small 262.49ms before after ratio [d9e51fe7] [45b79968] - 856.54ms 279.51ms 0.33 groupby.groupby_groups.time_groupby_groups_int64_large - 674.41ms 104.08ms 0.15 groupby.groupby_groups.time_groupby_groups_int64_small SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. 

These are faster for object dtypes (and about 2x for _small), though the margin isn't as big as it should be I think.

@codecov-io
Copy link

codecov-io commented Sep 24, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14294 into master will decrease coverage by <.01%

@@ master #14294 diff @@ ========================================== Files 140 140 Lines 50593 50599 +6 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== + Hits 43137 43142 +5  - Misses 7456 7457 +1  Partials 0 0 

Powered by Codecov. Last update b81d444...82d19dd

@jreback jreback force-pushed the groupby branch 2 times, most recently from 5f82e6f to 1ba34fb Compare September 24, 2016 17:02
@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2016

these are 2*22 with 100, 10000 for small/large

 before after ratio [d9e51fe7] [87db6a4f] - 136.59ms 107.11ms 0.78 groupby.groupby_indices.time_groupby_indices - 996.17ms 699.80ms 0.70 groupby.groupby_groups.time_groupby_groups_object_large - 606.62ms 319.25ms 0.53 groupby.groupby_groups.time_groupby_groups_object_small - 1.80s 540.05ms 0.30 groupby.groupby_groups.time_groupby_groups_int64_large - 1.49s 207.64ms 0.14 groupby.groupby_groups.time_groupby_groups_int64_small 
@jreback
Copy link
Contributor Author

jreback commented Sep 26, 2016

@wesm
@jorisvandenbossche

any comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but wouldn't it possible to re-use the possibly existing factorization here?

def _groupby_indices(codes : Grouping.labels, cats : Grouping.group_index) 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are re-using the existing factorization (if its categorical already its unchanged).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in the non Categorical case, if Grouping.labels is already populated, could skip factorizing again in the Categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I see you proposed something else. .group_index is not defined except for a single Grouping. If we had a MultiGrouping, then yes I think that would work (right now that basically a list of Groupings).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see it now, I went one class too deep. I do think you could speed up the single grouping case here - https://github.com/jreback/pandas/blob/580237924022eb74575420ad4433952c8de318dd/pandas/core/groupby.py#L2345 - make it:

return self.index.groupby(Categorical.from_codes(self.labels, self.group_index)) 
pandas/algos.pyx Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have groupsort_indexer, does this do anything different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me see...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there benefits to doing this lazily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not lazy, so not sure what you mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering whether producing a fully-boxed Index for each group up front has memory use or performance implications. This is something I guess we'll address in much more detail in pandas 2.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this is only used on grouped.groups, so by-definition we need to box. (and that's the ONLY benchmark that changed with this PR). I though this would have broader implications (on the good side), but no-go.

@jreback
Copy link
Contributor Author

jreback commented Sep 26, 2016

roughly the same perf benefits here. with 5802379

 [7dedbed8] [a731e2dd] - 554.34ms 269.39ms 0.49 groupby.groupby_groups.time_groupby_groups_object_small - 1.78s 430.09ms 0.24 groupby.groupby_groups.time_groupby_groups_int64_large - 1.32s 180.49ms 0.14 groupby.groupby_groups.time_groupby_groups_int64_small SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. 
remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293
@wesm
Copy link
Member

wesm commented Sep 27, 2016

lgtm

@jreback jreback closed this in 71df09c Sep 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Groupby Performance Memory or execution speed performance

5 participants