Skip to content

Conversation

@jreback
Copy link
Contributor

@jreback jreback commented May 29, 2019

closes #18502
closes #23743

@jreback jreback added Bug Groupby Sparse Sparse Data Type Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels May 29, 2019
@jreback jreback added this to the 0.25.0 milestone May 29, 2019
@jreback jreback changed the title BUG: preserve categorical & sparse types when grouping / pivot BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs May 29, 2019
@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26550 into master will decrease coverage by 0.02%.
The diff coverage is 88.88%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #26550 +/- ## ========================================== - Coverage 91.77% 91.74% -0.03%  ========================================== Files 174 174 Lines 50642 50666 +24 ========================================== + Hits 46476 46484 +8  - Misses 4166 4182 +16
Flag Coverage Δ
#multiple 90.28% <88.88%> (-0.02%) ⬇️
#single 41.66% <44.44%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 88.07% <100%> (-0.89%) ⬇️
pandas/core/series.py 93.63% <100%> (+0.01%) ⬆️
pandas/core/frame.py 97.02% <100%> (-0.11%) ⬇️
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/nanops.py 94.11% <100%> (ø) ⬆️
pandas/core/groupby/groupby.py 97.39% <100%> (+0.16%) ⬆️
pandas/core/dtypes/cast.py 91.54% <100%> (ø) ⬆️
pandas/core/generic.py 93.42% <25%> (-0.15%) ⬇️
pandas/core/internals/blocks.py 93.94% <71.42%> (-0.19%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7629a18...f397908. Read the comment docs.

@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26550 into master will decrease coverage by 0.02%.
The diff coverage is 92.3%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #26550 +/- ## ========================================== - Coverage 91.97% 91.95% -0.03%  ========================================== Files 180 180 Lines 50756 50789 +33 ========================================== + Hits 46685 46705 +20  - Misses 4071 4084 +13
Flag Coverage Δ
#multiple 90.55% <92.3%> (-0.02%) ⬇️
#single 41.81% <41.53%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 95.67% <100%> (+0.03%) ⬆️
pandas/core/groupby/generic.py 88.48% <100%> (-0.86%) ⬇️
pandas/core/series.py 93.69% <100%> (+0.02%) ⬆️
pandas/core/dtypes/cast.py 90.53% <100%> (ø) ⬆️
pandas/core/nanops.py 94.76% <100%> (ø) ⬆️
pandas/core/frame.py 96.9% <100%> (-0.11%) ⬇️
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/groupby/groupby.py 97.46% <100%> (+0.29%) ⬆️
pandas/core/arrays/sparse.py 93.88% <100%> (+0.15%) ⬆️
pandas/core/internals/construction.py 96.19% <100%> (+0.25%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9b081d...4bd486e. Read the comment docs.

obj = self.obj[data.items[locs]]
s = groupby(obj, self.grouper)
result = s.aggregate(lambda x: alt(x, axis=self.axis))
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? Not immediately obvious the link between this and the overall PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this handles blocks that return NotImplementedError and then cann't be aggregated, e.g. Categoricals with string categories, aggregating with mean for exampel (and numeric_only=False) is passed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example? I'm also having trouble seeing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those return NotImplementedError can we limit the scope of the catching to just that?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 30, 2019

@jreback rough sketch of the idea in #26550 (comment)

diff --git a/pandas/core/generic.py b/pandas/core/generic.py index ce1cb37ab..898d03e6c 100644 --- a/pandas/core/generic.py +++ b/pandas/core/generic.py @@ -1945,6 +1945,10 @@ class NDFrame(PandasObject, SelectionMixin): # GH#23114 Ensure ndarray.__op__(DataFrame) returns NotImplemented __array_priority__ = 1000 + def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): + result = self._data.map_blocks(ufunc, **kwargs) + return type(self)(result) + def __array__(self, dtype=None): return com.values_from_object(self) diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py index 0b63588c9..9c7cc4bd5 100644 --- a/pandas/core/internals/managers.py +++ b/pandas/core/internals/managers.py @@ -117,6 +117,10 @@ class BlockManager(PandasObject): self._rebuild_blknos_and_blklocs() + def map_blocks(self, func, **kwargs): + newbs = [make_block(func(block.values, **kwargs), block.mgr_locs) for block in self.blocks] + return type(self)(newbs, self.axes) + def make_empty(self, axes=None): """ return an empty BlockManager with the items axis of len 0 """ if axes is None:

usage:

In [1]: import numpy as np; import pandas as pd In [2]: df = pd.DataFrame({"A": [1, 0]}, dtype=pd.SparseDtype('int')) In [3]: np.sin(df) Out[3]: A 0 0.841471 1 0.000000 In [4]: np.sin(df).dtypes Out[4]: A Sparse[float64, 0.0] dtype: object

The __array_ufunc__ isn't really correct yet, but does make sense? I think that's much cleaner than doing the op and trying to cast back.

On this branch, the casting seems problematic:

In [10]: df = pd.DataFrame({"A": [1, 2, 3]}, dtype=pd.SparseDtype('int')) In [11]: np.sqrt(df) Out[11]: A 0 1 1 1 2 1
@jreback
Copy link
Contributor Author

jreback commented May 30, 2019

yeah i can do via blocks

array_ufunc is orthogonal here - i don’t think we have any implementation yet (should though)

@jreback
Copy link
Contributor Author

jreback commented Jun 2, 2019

fixed up the casting, a bit tricky.

**kwargs)
return self._constructor(new_data).__finalize__(self)

if not results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case hits this? I'm not immediately seeing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty frames :->

obj = self.obj[data.items[locs]]
s = groupby(obj, self.grouper)
result = s.aggregate(lambda x: alt(x, axis=self.axis))
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example? I'm also having trouble seeing this.

"""
try:
result = self._holder._from_sequence(
np.asarray(result).ravel(), dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the asarray and ravel needed? _from_sequence should take any sequence, so the trip through NumPy shouldn't be needed. Can result be 2-D here? I can't imagine us wanting to ravel a 2-D array, since the length will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments (short answer is this could be a 2-D numpy array or a EA which is already 1-D)

'std',
'var',
'sem',
pytest.param('median', marks=pytest.mark.xfail(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a know issue we have an issue open for already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no was discovered here, but I can't repro locally

obj = self.obj[data.items[locs]]
s = groupby(obj, self.grouper)
result = s.aggregate(lambda x: alt(x, axis=self.axis))
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those return NotImplementedError can we limit the scope of the catching to just that?

lambda x: npfunc(x, axis=self.axis))

# coerce the columns if we can
if isinstance(result, DataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this logic be in _try_cast? OK as a follow up just want to confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, _try_cast is for 1d results, maybe I can make this cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sligthly refactored

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2019

push should make this pass, will address comments soon.

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2019

replacing with a PR just with the categorical grouping changes; the sparse ufuncs is conflicting with what @TomAugspurger is doing and is messy.

@jreback jreback closed this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Sparse Sparse Data Type

3 participants