-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@ ## master #26550 +/- ## ========================================== - Coverage 91.77% 91.74% -0.03% ========================================== Files 174 174 Lines 50642 50666 +24 ========================================== + Hits 46476 46484 +8 - Misses 4166 4182 +16
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@ ## master #26550 +/- ## ========================================== - Coverage 91.97% 91.95% -0.03% ========================================== Files 180 180 Lines 50756 50789 +33 ========================================== + Hits 46685 46705 +20 - Misses 4071 4084 +13
Continue to review full report at Codecov.
|
| obj = self.obj[data.items[locs]] | ||
| s = groupby(obj, self.grouper) | ||
| result = s.aggregate(lambda x: alt(x, axis=self.axis)) | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? Not immediately obvious the link between this and the overall PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this handles blocks that return NotImplementedError and then cann't be aggregated, e.g. Categoricals with string categories, aggregating with mean for exampel (and numeric_only=False) is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? I'm also having trouble seeing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those return NotImplementedError can we limit the scope of the catching to just that?
| @jreback rough sketch of the idea in #26550 (comment) diff --git a/pandas/core/generic.py b/pandas/core/generic.py index ce1cb37ab..898d03e6c 100644 --- a/pandas/core/generic.py +++ b/pandas/core/generic.py @@ -1945,6 +1945,10 @@ class NDFrame(PandasObject, SelectionMixin): # GH#23114 Ensure ndarray.__op__(DataFrame) returns NotImplemented __array_priority__ = 1000 + def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): + result = self._data.map_blocks(ufunc, **kwargs) + return type(self)(result) + def __array__(self, dtype=None): return com.values_from_object(self) diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py index 0b63588c9..9c7cc4bd5 100644 --- a/pandas/core/internals/managers.py +++ b/pandas/core/internals/managers.py @@ -117,6 +117,10 @@ class BlockManager(PandasObject): self._rebuild_blknos_and_blklocs() + def map_blocks(self, func, **kwargs): + newbs = [make_block(func(block.values, **kwargs), block.mgr_locs) for block in self.blocks] + return type(self)(newbs, self.axes) + def make_empty(self, axes=None): """ return an empty BlockManager with the items axis of len 0 """ if axes is None:usage: In [1]: import numpy as np; import pandas as pd In [2]: df = pd.DataFrame({"A": [1, 0]}, dtype=pd.SparseDtype('int')) In [3]: np.sin(df) Out[3]: A 0 0.841471 1 0.000000 In [4]: np.sin(df).dtypes Out[4]: A Sparse[float64, 0.0] dtype: objectThe On this branch, the casting seems problematic: In [10]: df = pd.DataFrame({"A": [1, 2, 3]}, dtype=pd.SparseDtype('int')) In [11]: np.sqrt(df) Out[11]: A 0 1 1 1 2 1 |
| yeah i can do via blocks array_ufunc is orthogonal here - i don’t think we have any implementation yet (should though) |
| fixed up the casting, a bit tricky. |
| **kwargs) | ||
| return self._constructor(new_data).__finalize__(self) | ||
| | ||
| if not results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What case hits this? I'm not immediately seeing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty frames :->
| obj = self.obj[data.items[locs]] | ||
| s = groupby(obj, self.grouper) | ||
| result = s.aggregate(lambda x: alt(x, axis=self.axis)) | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? I'm also having trouble seeing this.
| """ | ||
| try: | ||
| result = self._holder._from_sequence( | ||
| np.asarray(result).ravel(), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the asarray and ravel needed? _from_sequence should take any sequence, so the trip through NumPy shouldn't be needed. Can result be 2-D here? I can't imagine us wanting to ravel a 2-D array, since the length will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments (short answer is this could be a 2-D numpy array or a EA which is already 1-D)
| 'std', | ||
| 'var', | ||
| 'sem', | ||
| pytest.param('median', marks=pytest.mark.xfail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a know issue we have an issue open for already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no was discovered here, but I can't repro locally
| obj = self.obj[data.items[locs]] | ||
| s = groupby(obj, self.grouper) | ||
| result = s.aggregate(lambda x: alt(x, axis=self.axis)) | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those return NotImplementedError can we limit the scope of the catching to just that?
pandas/core/groupby/groupby.py Outdated
| lambda x: npfunc(x, axis=self.axis)) | ||
| | ||
| # coerce the columns if we can | ||
| if isinstance(result, DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this logic be in _try_cast? OK as a follow up just want to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, _try_cast is for 1d results, maybe I can make this cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sligthly refactored
| push should make this pass, will address comments soon. |
preserve dtypes when applying a ufunc to a sparse dtype closes pandas-dev#18502 closes pandas-dev#23743
| replacing with a PR just with the categorical grouping changes; the sparse ufuncs is conflicting with what @TomAugspurger is doing and is messy. |
closes #18502
closes #23743