(wip) BUG: replacing one column's values was changing other columns' dtypes #31216

MarcoGorelli · 2020-01-22T18:46:30Z

…s' dtypes

closes DataFrame.replace changes dtype of columns not containing any value to be replaced. #30512
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

MarcoGorelli · 2020-01-23T09:06:07Z

I think the flattening can be done later in the code, will update this later

EDIT

update done

MarcoGorelli · 2020-01-23T13:48:20Z

Actually, I think we can still do better - will update it later this week

EDIT

done

WillAyd · 2020-01-23T20:04:23Z

pandas/tests/frame/methods/test_replace.py

+
+ def test_replace_with_nan(self):
+ # GH30512
+ df = pd.DataFrame({"A": [np.nan, 3], "B": [1.096, 1.086], "C": [1, 2]})


Does this only happen if A and B are both float dtypes before replacement?

I think so, because then they're both part of the same FloatBlock and the whole thing gets cast to object, but I'll check

Update: it happens with other dtypes, e.g.

>>> data = pd.DataFrame({'B':2, 'C':1}, index=[0]) >>> data.dtypes B int64 C int64 dtype: object >>> data.replace(to_replace={1:None}).dtypes B object C object dtype: object

and the nan doesn't have anything to do with the issue. Will update test / whatsnew entry tomorrow

@WillAyd have updated tests with other dtypes and reworded whatsnew

jreback · 2020-01-25T15:41:32Z

doc/source/whatsnew/v1.1.0.rst

 - Bug in :class:`Series` construction from NumPy array with big-endian ``datetime64`` dtype (:issue:`29684`)
-
+- Bug in :meth:`DataFrame.replace` was casting all columns of the same dtype to the target value's dtype, even for columns which didn't contain
+ the value to be replaced (:issue:`30512`)


can you remove the newline (just make a long note is ok)

jreback · 2020-01-25T15:45:02Z

pandas/core/internals/managers.py

- )
 if m.any() or convert:
- new_rb = _extend_blocks(result, new_rb)
+ if m.ndim > 1:


so you can use _block_shape to make all blocks conform to ndim=2, then you can simply separate this cleanly. this is too nested now

also would be ok to move parts of this to a helper function as this is getting hard to read.

further you might be able to discriminate the blocks on line 603 rather than here.

Thanks for your review - I've tried using _block_shape from pandas.core.internals.blocks import _block_shape, but then I run into AttributeErrors:

>>> b IntBlock: 4 dtype: int64 >>> b.values array([1, 2, 3, 4]) >>> _block_shape(b, ndim=2) Traceback (most recent call last): File "<console>", line 1, in <module> File "/home/marco/pandas/pandas/core/internals/blocks.py", line 3053, in _block_shape values = values.reshape(tuple((1,) + shape)) AttributeError: 'IntBlock' object has no attribute 'reshape

sure, will work on the rest

block_shape operates on an ndarray.

Thanks for clarifying that. I've spent some time today thinking about this, but it's not clear to me how to

use _block_shape to make all blocks conform to ndim=2

if _block_shape only operates on ndarrays. I can only see how to use it to ensure m is 2-dimensional.

If anyone following along can see how to do this, feel free to take over, it might currently be too advanced for me :) Else I'll keep working on it, it just might take a bit longer.

For now, I've split this part of the code into a helper function.

jreback · 2020-01-25T15:45:29Z

pandas/tests/frame/methods/test_replace.py

+ expected = df.dtypes.drop("A")
+ tm.assert_series_equal(result, expected)
+
+ def test_replace_datetime64_dont_change_other_columns(self, datetime64_dtype):


can you parameterize all of these cases into a single test

I thought about that, but these test functions take fixtures as arguments (e.g. datetime64_dtype is defined as a fixture in pandas.conftest.py). Is it possible to parametrize over fixtures? I tried looking around for a way but couldn't find one

Is it possible to parametrize over fixtures?

I think tests.io.excel.test_readers.engine_and_read_ext might be doing something similar to what you need

or tests.arrays.test_integer.all_data might be a better fit

@jbrockmendel thanks for the tip, have updated the test - is this what you had in mind?

update: what I've done is not correct, it generates nearly 90,000 tests (when I made commit 61b8b2c there were only 101) in pandas/tests/frame/methods/test_replace.py

@jbrockmendel @jreback if I write

@pytest.fixture( params=[ "any_real_dtype", "string_dtype", "datetime64_dtype", "timedelta64_dtype", "bool", "complex_dtype", "bytes_dtype", "tz_aware_fixture", ] ) def replace_value_a_value_b( request, any_real_dtype, string_dtype, datetime64_dtype, timedelta64_dtype, complex_dtype, bytes_dtype, tz_aware_fixture, ): if request.param == "any_real_dtype": return any_real_dtype, 2, 3 elif request.param == "string_dtype": return string_dtype, "a", "b" elif request.param == "datetime64_dtype": return datetime64_dtype, pd.Timestamp("2020-01-24"), pd.Timestamp("2020-01-25") elif request.param == "timedelta64_dtype": return timedelta64_dtype, pd.Timedelta(3), pd.Timedelta(4) elif request.param == "bool": return bool, 1, 0 elif request.param == "complex_dtype": return complex_dtype, 1 + 2j, 1 + 3j elif request.param == "bytes_dtype": return bytes_dtype, b"a", b"b" elif request.param == "tz_aware_fixture": return ( None, Timestamp("20130102", tz=tz_aware_fixture), Timestamp("20130103", tz=tz_aware_fixture), )

then nearly 90,000 tests are collected.

An alternative would involve importing the types in here (e.g. ALL_REAL_DTYPES) and iterating over them but, as I understand, importing from conftest is discouraged.

Any other tips on how to do this? It seems parametrisation over fixtures has been proposed in pytest, but it's not there yet - https://docs.pytest.org/en/latest/proposals/parametrize_with_fixtures.html

jreback · 2020-01-26T00:38:43Z

pandas/core/internals/managers.py

- )
 if m.any() or convert:
- new_rb = _extend_blocks(result, new_rb)
+ if m.ndim > 1:


block_shape operates on an ndarray.

jreback · 2020-01-26T00:39:09Z

pandas/core/internals/managers.py

- new_rb = _extend_blocks(result, new_rb)
+ if m.ndim > 1:
+ to_replace = m.any(axis=1)
+ subblocks = [


here you should insead use _split_op_result which creates the individual split blocks

this is too nested and not understandable.

pandas/core/internals/blocks.py

pandas/core/internals/managers.py

jbrockmendel · 2020-02-21T00:42:41Z

pandas/core/internals/managers.py

+ subblocks = [block]
+ subblock_masks = [mask]
+
+ for subblock, subblock_mask in zip(subblocks, subblock_masks):


any chance we can use Block.split_and_operate for some of this?

@jbrockmendel thanks for the suggestion - just looked into it and it says

f : callable accepting (1d-mask, 1d values, indexer)

while here we may be operating on multidimensional blocks, so it's not obvious to me how to use this here. Have marked this WIP while I update according to your other comments and think about this one

pandas/tests/frame/methods/test_replace.py

pandas/core/internals/blocks.py

jbrockmendel · 2020-02-25T16:15:31Z

pandas/tests/frame/methods/test_replace.py

+ "bool",
+ "complex_dtype",
+ "bytes_dtype",
+ "tz_aware_fixture",


i think i see what this is doing, but at first glance it looks like a weird usage. can you add a docstring for when i inevitably forget?

jbrockmendel · 2020-03-30T19:41:48Z

pandas/core/internals/blocks.py

 )
 return self

+ def _discriminate_replace_and_coerce_block(


could we re-use _split_and_operate here?

we also do something kinda-similar at the end of Block.where with for m in [mask, ~mask], maybe something can be shared

@jbrockmendel would there be performance implications to operating on each column separately (which, as I understand, split_and_operate would do)?

Also, if I call split_and_operate, then it will operate on self.values, which is a numpy array. But the method I need to apply (_replace_and_coerce) is a block method - so, within the function f which split_and_operate accepts, would I need to convert the values it accepts from a numpy array to a block?

I would rather take a perf hit (maybe) that adding a ton more code.

would there be performance implications to operating on each column separately (which, as I understand, split_and_operate would do)?

Only a small one. Building on what @jreback said, the replace_foo code is already obscenely complicated, so a small perf hit is better than making that worse.

Thanks for the explanation - I've tried writing this using split_and_operate, though it seems a bit complicated - is this kind of what you had in mind @jbrockmendel ?

Currently, this works for:

any_real_dtype

datetime64_dtype

timedelta64_dtype

complex_dtype

doesn't work for:

string_dtype

bytes_dtype

tz_aware_fixture

What I'm struggling with is that method I'd like to apply is a block method, but inside the f function in split_and_operate, I would only have access to an ndarray

but inside the f function in split_and_operate, I would only have access to an ndarray

Are you sure? Looking at Block.fillna it passes a function f to split_and_operate that operates on blocks.

@jbrockmendel thanks for pointing me to that, that was useful.

Something I'm stuck on is that if I try writing this as

def f(mask, values, idx): if idx is not None: block = b.getitem_block(slice(idx, idx + 1)) else: block = b return block._replace_coerce( mask=mask.reshape(block.shape), to_replace=s, value=d, inplace=inplace, convert=convert, regex=regex, ) if m.any() or convert: result = b.split_and_operate(m, f, inplace) new_rb = _extend_blocks(result, new_rb) else: new_rb.append(b)

then I run into the following problem: if b is ObjectBlock, then b.values is also ObjectBlock, which will cause a problem when it's passed to split_and_operate:

(Pdb) !b ObjectBlock: slice(1, 2, 1), 1 x 4, dtype: object (Pdb) !b.values ObjectBlock: slice(1, 2, 1), 1 x 4, dtype: object (Pdb) !b.values[0] *** TypeError: 'ObjectBlock' object is not subscriptable

This is in contrast to when b is an IntBlock, when it works fine:

(Pdb) !b IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64 (Pdb) !b.values array([[0, 1, 2, 3]]) (Pdb) !b.values[0] array([0, 1, 2, 3])

The code, as it is, would solve issue 30512, but there are problems that arise in other tests because of the above, which I still need to make sense of

MarcoGorelli changed the title ~~[BUG] replacing one columns values with nan was changing other column…~~ (wip) [BUG] replacing one columns values with nan was changing other column… Jan 23, 2020

MarcoGorelli changed the title ~~(wip) [BUG] replacing one columns values with nan was changing other column…~~ [BUG] replacing one columns values with nan was changing other column… Jan 23, 2020

MarcoGorelli force-pushed the issue-30512 branch 4 times, most recently from c4fbf42 to 274d0cd Compare January 23, 2020 13:33

WillAyd reviewed Jan 23, 2020

View reviewed changes

WillAyd added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jan 23, 2020

MarcoGorelli changed the title ~~[BUG] replacing one columns values with nan was changing other column…~~ (wip) [BUG] replacing one columns values with nan was changing other column… Jan 23, 2020

MarcoGorelli changed the title ~~(wip) [BUG] replacing one columns values with nan was changing other column…~~ (wip) [BUG] replacing one columns values was changing other columns' dtypes Jan 23, 2020

MarcoGorelli changed the title ~~(wip) [BUG] replacing one columns values was changing other columns' dtypes~~ (wip) [BUG] replacing one column's values was changing other columns' dtypes Jan 23, 2020

MarcoGorelli force-pushed the issue-30512 branch from 50822d9 to 10c982d Compare January 24, 2020 15:38

MarcoGorelli changed the title ~~(wip) [BUG] replacing one column's values was changing other columns' dtypes~~ [BUG] replacing one column's values was changing other columns' dtypes Jan 24, 2020

MarcoGorelli changed the title ~~[BUG] replacing one column's values was changing other columns' dtypes~~ BUG: replacing one column's values was changing other columns' dtypes Jan 25, 2020

jreback requested changes Jan 25, 2020

View reviewed changes

jreback requested changes Jan 26, 2020

View reviewed changes

MarcoGorelli force-pushed the issue-30512 branch 3 times, most recently from 30d901a to 0f6172f Compare February 2, 2020 08:27

MarcoGorelli force-pushed the issue-30512 branch from 0f6172f to ec581ba Compare February 6, 2020 13:22

MarcoGorelli force-pushed the issue-30512 branch 2 times, most recently from dc8a336 to 6ea3a7a Compare February 19, 2020 13:24

jbrockmendel reviewed Feb 21, 2020

View reviewed changes

pandas/core/internals/blocks.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 21, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 21, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 21, 2020

View reviewed changes

pandas/tests/frame/methods/test_replace.py Outdated Show resolved Hide resolved

MarcoGorelli changed the title ~~BUG: replacing one column's values was changing other columns' dtypes~~ (wip) BUG: replacing one column's values was changing other columns' dtypes Feb 22, 2020

jbrockmendel reviewed Feb 25, 2020

View reviewed changes

pandas/core/internals/blocks.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 25, 2020

View reviewed changes

jbrockmendel reviewed Mar 30, 2020

View reviewed changes

use split and operate

7ba0129

MarcoGorelli force-pushed the issue-30512 branch from 98fcdc3 to 7ba0129 Compare May 8, 2020 17:05

rewrite closer to original

e9c2b8b

MarcoGorelli marked this pull request as draft June 13, 2020 10:49

MarcoGorelli added 2 commits June 13, 2020 16:19

Merge remote-tracking branch 'upstream/master' into issue-30512

5e143fb

wip

17ca4dd

MarcoGorelli closed this Jul 26, 2020

MarcoGorelli deleted the issue-30512 branch October 10, 2020 14:14

Uh oh!

(wip) BUG: replacing one column's values was changing other columns' dtypes #31216

(wip) BUG: replacing one column's values was changing other columns' dtypes #31216

Uh oh!

Conversation

MarcoGorelli commented Jan 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MarcoGorelli commented Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EDIT

MarcoGorelli commented Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EDIT

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

MarcoGorelli Feb 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Jun 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Labels

4 participants

MarcoGorelli commented Jan 22, 2020 •

edited

Loading

MarcoGorelli commented Jan 23, 2020 •

edited

Loading

MarcoGorelli commented Jan 23, 2020 •

edited

Loading

MarcoGorelli Jan 23, 2020 •

edited

Loading

MarcoGorelli Jan 26, 2020 •

edited

Loading

MarcoGorelli Jan 26, 2020 •

edited

Loading

MarcoGorelli Feb 25, 2020 •

edited

Loading

MarcoGorelli Feb 22, 2020 •

edited

Loading

MarcoGorelli Apr 10, 2020 •

edited

Loading

MarcoGorelli Jun 14, 2020 •

edited

Loading