-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
REF: Compute complete result_index upfront in groupby #55738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into gb_observed_pre � Conflicts: � pandas/core/groupby/ops.py
6ee8fcb to da9169d Compare …into gb_observed_pre # Conflicts: # pandas/core/groupby/ops.py
| taker = [0, 2, 1, 3] | ||
| else: | ||
| taker = [2, 3, 0, 1] | ||
| taker = [2, 1, 0, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref: #56016 (comment)
| @jbrockmendel @mroeschke - this is now ready |
| Does this happen to fix #35202 |
No, see the bottom half of #55738 (comment). |
# Conflicts: # doc/source/whatsnew/v3.0.0.rst
mroeschke left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| Thanks @rhshadrach |
| @rhshadrach is the following change in behaviour intentional? df = pd.DataFrame( { "key_cat": pd.Categorical(["a", "a", "b", "b"]), "key_noncat": [1, 1, 1, 2], "values": [1.0, 2.0, 3.0, 4.0], } ) df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum())Using released pandas, this injected NaNs for the unobserved values: # using pandas 2.2.0 >>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum()) key_cat key_noncat a 1 3.0 2 NaN b 1 3.0 2 4.0 Name: values, dtype: float64versus after (I assume) this PR on pandas main: # using pandas main >>> df.groupby(["key_cat", "key_noncat"], observed=False)["values"].agg(lambda x: x.sum()) key_cat key_noncat a 1 3.0 2 0.0 b 1 3.0 2 4.0 Name: values, dtype: float64The reason this now gives a 0 is because this actually calls the UDF on the empty group, and the sum of an empty Series is 0 (while before, the UDF was never called for the unobserved group). I don't think I have a strong opinion on what the behaviour should be, but if it is intentional, this probably warrants a whatsnew note (it was confusing to debug my failing tests). Apart of the different result, this also means that your UDF needs to be able to handle empty input (before, the UDF essentially never got empty input for those cases, I think?) |
| I also see that your top post has a long list of changes / fixes:
I think those are mostly added to the whatsnew, but it might be clearer to have a separate section explaining the general change (unobserved groups are now properly included through all APIs), and then list some of the consequence of it (instead of those bullet points that will become part of a large list of bullet points in the groupby section) |
| Thanks @jorisvandenbossche - indeed this is intentional; xref #36698. I'll |
* REF: Compute correct result_index upfront in groupby * Refinements * Refinements * Refinements * Restore inferring index dtype * Test fixups * Refinements * Refinements * fixup * fixup * fixup * Fix sorting and non-sorting * Cleanup * Call ensure_plantform_int last * fixup * fixup * REF: Compute correct result_index upfront in groupby * Add test * Remove test * Move unobserved to the end * cleanup * cleanup * cleanup * Merge fixup * fixup * fixup * Fixup and test * whatsnew * type ignore * Refactor & type annotations * Better bikeshed
groupbymethod #55919doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Just a PoC at this point, need to try to move behavior changes out of here as much as possible.The main change this makes is it moves the computation of unobserved groups upfront. Currently, we only include unobserved groups upfront if there is a single grouping (e.g.
df.groupby('a')) but not two or more (e.g.df.groupby(['a', 'b'])). When there are two or more, we go through the groupby computations with only observed groups, and then tack on the unobserved groups at the end. By always including unobserved groups upfront, we can simplify the logic in the groupby code. Having unobserved groups sometimes included and sometimes not included upfront is also a footgun.In order to make this change, I found I needed to rework a bit of how NA values are handled in
Grouping._codes_and_uniques. This in turn fixed some NA bugs.This PR fixes a number of issues that stem from this (shown in the test changes)
groupby.applyif there is more than one groupinggroubpy.groupslen(df.groupby(..., dropna=True))counts groups that have NA valuesdf.groupby(..., dropna=True).groupshas groups that have NA values.allfor empty groups wasnp.nanbut should beTrue.anyfor empty groups wasnp.nanbut should beFalse.prodfor empty groups wasnp.nanbut should be1In addition, it allows us to remove various similar objects throughout groupby:
BaseGrouper.group_keys_seqBaseGrouper.reconstructed_codesGrouping.group_indexGrouping.result_indexGrouping.group_arraylikeBinGrouper.reconstructed_codesand a few methods
GroupBy._reindex_outputBaseGrouper._sort_idxBaseGrouper._get_compressed_codesBut it does add
BaseGrouper.result_index_and_codes. This computes the (aggregated) result index that takes into accountdropnaandobserved, along with the codes for the groups themselves.ASVs; no performance regressions (with the standard 10% cutoff) other than groupby transform operations with multiple categorical groupings.
I haven't verified this, but I believe this would also make #55261 trivial to implement just by changing a few lines of
result_index_and_ids