Skip to content

Conversation

@auderson
Copy link
Contributor

@auderson auderson commented Feb 27, 2025

this pr is a one-line change to the code here:

pandas/pandas/core/generic.py

Lines 9735 to 9737 in d1ec1a4

for _dt in cond.dtypes:
if not is_bool_dtype(_dt):
raise TypeError(msg.format(dtype=_dt))

Performance comparison:

test script:

import numpy as np import pandas as pd import timeit for width in [10, 1000, 1000_00, 1000_0000]: df = pd.DataFrame(np.random.randn(1, width)) mask = df > 0.5 tm = timeit.timeit("df.where(mask)", number=10, globals=globals()) print(width, tm)

for _dt in cond.dtypes:

10 0.002963045029900968
1000 0.006705133942887187
100000 0.40306550299283117
10000000 46.55275956704281
image

for _dt in cond.dtypes.unique():

10 0.0028260269900783896
1000 0.002695770002901554
100000 0.042065858957357705
10000000 6.068146598991007
image

for _dt in [blk.dtype for blk in cond._mgr.blocks]:

10 0.0009857049444690347
1000 0.0011893719201907516
100000 0.003112988080829382
10000000 0.13763279700651765
image

@mroeschke mroeschke added the Performance Memory or execution speed performance label Feb 27, 2025
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
@auderson
Copy link
Contributor Author

@mroeschke
Hi, do I need to add this in whatsnew? 2.3.0 or 3.0.0?

@auderson auderson changed the title PERF: use blk.dtype in where() PERF: use blk.dtype in where() & _setitem_frame() Feb 28, 2025
@auderson
Copy link
Contributor Author

This pattern is also found in _setitem_frame:

pandas/pandas/core/frame.py

Lines 4276 to 4289 in 5da9eb7

def _setitem_frame(self, key, value) -> None:
# support boolean setting with DataFrame input, e.g.
# df[df > df2] = 0
if isinstance(key, np.ndarray):
if key.shape != self.shape:
raise ValueError("Array conditional must be same shape as self")
key = self._constructor(key, **self._construct_axes_dict(), copy=False)
if key.size and not all(is_bool_dtype(dtype) for dtype in key.dtypes):
raise TypeError(
"Must pass DataFrame or 2-d ndarray with boolean values only"
)
self._where(-key, value, inplace=True)

Before

image

After

image

@mroeschke
Copy link
Member

Hi, do I need to add this in whatsnew? 2.3.0 or 3.0.0?

v3.0.0.rst please

@mroeschke mroeschke added this to the 3.0 milestone Feb 28, 2025
@mroeschke mroeschke merged commit 928fb7e into pandas-dev:main Feb 28, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @auderson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

2 participants