-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
(wip) BUG: replacing one column's values was changing other columns' dtypes #31216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| I think the flattening can be done later in the code, will update this later EDITupdate done |
c4fbf42 to 274d0cd Compare | Actually, I think we can still do better - will update it later this week EDITdone |
| | ||
| def test_replace_with_nan(self): | ||
| # GH30512 | ||
| df = pd.DataFrame({"A": [np.nan, 3], "B": [1.096, 1.086], "C": [1, 2]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only happen if A and B are both float dtypes before replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, because then they're both part of the same FloatBlock and the whole thing gets cast to object, but I'll check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: it happens with other dtypes, e.g.
>>> data = pd.DataFrame({'B':2, 'C':1}, index=[0]) >>> data.dtypes B int64 C int64 dtype: object >>> data.replace(to_replace={1:None}).dtypes B object C object dtype: object and the nan doesn't have anything to do with the issue. Will update test / whatsnew entry tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd have updated tests with other dtypes and reworded whatsnew
50822d9 to 10c982d Compare doc/source/whatsnew/v1.1.0.rst Outdated
| - Bug in :class:`Series` construction from NumPy array with big-endian ``datetime64`` dtype (:issue:`29684`) | ||
| - | ||
| - Bug in :meth:`DataFrame.replace` was casting all columns of the same dtype to the target value's dtype, even for columns which didn't contain | ||
| the value to be replaced (:issue:`30512`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove the newline (just make a long note is ok)
pandas/core/internals/managers.py Outdated
| ) | ||
| if m.any() or convert: | ||
| new_rb = _extend_blocks(result, new_rb) | ||
| if m.ndim > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you can use _block_shape to make all blocks conform to ndim=2, then you can simply separate this cleanly. this is too nested now
also would be ok to move parts of this to a helper function as this is getting hard to read.
further you might be able to discriminate the blocks on line 603 rather than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review - I've tried using _block_shape from pandas.core.internals.blocks import _block_shape, but then I run into AttributeErrors:
>>> b IntBlock: 4 dtype: int64 >>> b.values array([1, 2, 3, 4]) >>> _block_shape(b, ndim=2) Traceback (most recent call last): File "<console>", line 1, in <module> File "/home/marco/pandas/pandas/core/internals/blocks.py", line 3053, in _block_shape values = values.reshape(tuple((1,) + shape)) AttributeError: 'IntBlock' object has no attribute 'reshape sure, will work on the rest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_shape operates on an ndarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying that. I've spent some time today thinking about this, but it's not clear to me how to
use _block_shape to make all blocks conform to ndim=2
if _block_shape only operates on ndarrays. I can only see how to use it to ensure m is 2-dimensional.
If anyone following along can see how to do this, feel free to take over, it might currently be too advanced for me :) Else I'll keep working on it, it just might take a bit longer.
For now, I've split this part of the code into a helper function.
| expected = df.dtypes.drop("A") | ||
| tm.assert_series_equal(result, expected) | ||
| | ||
| def test_replace_datetime64_dont_change_other_columns(self, datetime64_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize all of these cases into a single test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but these test functions take fixtures as arguments (e.g. datetime64_dtype is defined as a fixture in pandas.conftest.py). Is it possible to parametrize over fixtures? I tried looking around for a way but couldn't find one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to parametrize over fixtures?
I think tests.io.excel.test_readers.engine_and_read_ext might be doing something similar to what you need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or tests.arrays.test_integer.all_data might be a better fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel thanks for the tip, have updated the test - is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: what I've done is not correct, it generates nearly 90,000 tests (when I made commit 61b8b2c there were only 101) in pandas/tests/frame/methods/test_replace.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: what I've done is not correct, it generates nearly 90,000 tests (when I made commit 61b8b2c there were only 101) in pandas/tests/frame/methods/test_replace.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: what I've done is not correct, it generates nearly 90,000 tests (when I made commit 61b8b2c there were only 101) in pandas/tests/frame/methods/test_replace.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel @jreback if I write
@pytest.fixture( params=[ "any_real_dtype", "string_dtype", "datetime64_dtype", "timedelta64_dtype", "bool", "complex_dtype", "bytes_dtype", "tz_aware_fixture", ] ) def replace_value_a_value_b( request, any_real_dtype, string_dtype, datetime64_dtype, timedelta64_dtype, complex_dtype, bytes_dtype, tz_aware_fixture, ): if request.param == "any_real_dtype": return any_real_dtype, 2, 3 elif request.param == "string_dtype": return string_dtype, "a", "b" elif request.param == "datetime64_dtype": return datetime64_dtype, pd.Timestamp("2020-01-24"), pd.Timestamp("2020-01-25") elif request.param == "timedelta64_dtype": return timedelta64_dtype, pd.Timedelta(3), pd.Timedelta(4) elif request.param == "bool": return bool, 1, 0 elif request.param == "complex_dtype": return complex_dtype, 1 + 2j, 1 + 3j elif request.param == "bytes_dtype": return bytes_dtype, b"a", b"b" elif request.param == "tz_aware_fixture": return ( None, Timestamp("20130102", tz=tz_aware_fixture), Timestamp("20130103", tz=tz_aware_fixture), ) then nearly 90,000 tests are collected.
An alternative would involve importing the types in here (e.g. ALL_REAL_DTYPES) and iterating over them but, as I understand, importing from conftest is discouraged.
Any other tips on how to do this? It seems parametrisation over fixtures has been proposed in pytest, but it's not there yet - https://docs.pytest.org/en/latest/proposals/parametrize_with_fixtures.html
pandas/core/internals/managers.py Outdated
| ) | ||
| if m.any() or convert: | ||
| new_rb = _extend_blocks(result, new_rb) | ||
| if m.ndim > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_shape operates on an ndarray.
pandas/core/internals/managers.py Outdated
| new_rb = _extend_blocks(result, new_rb) | ||
| if m.ndim > 1: | ||
| to_replace = m.any(axis=1) | ||
| subblocks = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you should insead use _split_op_result which creates the individual split blocks
this is too nested and not understandable.
30d901a to 0f6172f Compare 0f6172f to ec581ba Compare dc8a336 to 6ea3a7a Compare pandas/core/internals/managers.py Outdated
| subblocks = [block] | ||
| subblock_masks = [mask] | ||
| | ||
| for subblock, subblock_mask in zip(subblocks, subblock_masks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance we can use Block.split_and_operate for some of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel thanks for the suggestion - just looked into it and it says
f : callable accepting (1d-mask, 1d values, indexer) while here we may be operating on multidimensional blocks, so it's not obvious to me how to use this here. Have marked this WIP while I update according to your other comments and think about this one
| "bool", | ||
| "complex_dtype", | ||
| "bytes_dtype", | ||
| "tz_aware_fixture", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i see what this is doing, but at first glance it looks like a weird usage. can you add a docstring for when i inevitably forget?
pandas/core/internals/blocks.py Outdated
| ) | ||
| return self | ||
| | ||
| def _discriminate_replace_and_coerce_block( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we re-use _split_and_operate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also do something kinda-similar at the end of Block.where with for m in [mask, ~mask], maybe something can be shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel would there be performance implications to operating on each column separately (which, as I understand, split_and_operate would do)?
Also, if I call split_and_operate, then it will operate on self.values, which is a numpy array. But the method I need to apply (_replace_and_coerce) is a block method - so, within the function f which split_and_operate accepts, would I need to convert the values it accepts from a numpy array to a block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather take a perf hit (maybe) that adding a ton more code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would there be performance implications to operating on each column separately (which, as I understand, split_and_operate would do)?
Only a small one. Building on what @jreback said, the replace_foo code is already obscenely complicated, so a small perf hit is better than making that worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation - I've tried writing this using split_and_operate, though it seems a bit complicated - is this kind of what you had in mind @jbrockmendel ?
Currently, this works for:
- any_real_dtype
- datetime64_dtype
- timedelta64_dtype
- complex_dtype
doesn't work for:
- string_dtype
- bytes_dtype
- tz_aware_fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm struggling with is that method I'd like to apply is a block method, but inside the f function in split_and_operate, I would only have access to an ndarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but inside the f function in split_and_operate, I would only have access to an ndarray
Are you sure? Looking at Block.fillna it passes a function f to split_and_operate that operates on blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel thanks for pointing me to that, that was useful.
Something I'm stuck on is that if I try writing this as
def f(mask, values, idx): if idx is not None: block = b.getitem_block(slice(idx, idx + 1)) else: block = b return block._replace_coerce( mask=mask.reshape(block.shape), to_replace=s, value=d, inplace=inplace, convert=convert, regex=regex, ) if m.any() or convert: result = b.split_and_operate(m, f, inplace) new_rb = _extend_blocks(result, new_rb) else: new_rb.append(b)then I run into the following problem: if b is ObjectBlock, then b.values is also ObjectBlock, which will cause a problem when it's passed to split_and_operate:
(Pdb) !b ObjectBlock: slice(1, 2, 1), 1 x 4, dtype: object (Pdb) !b.values ObjectBlock: slice(1, 2, 1), 1 x 4, dtype: object (Pdb) !b.values[0] *** TypeError: 'ObjectBlock' object is not subscriptableThis is in contrast to when b is an IntBlock, when it works fine:
(Pdb) !b IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64 (Pdb) !b.values array([[0, 1, 2, 3]]) (Pdb) !b.values[0] array([0, 1, 2, 3])The code, as it is, would solve issue 30512, but there are problems that arise in other tests because of the above, which I still need to make sense of
…s' dtypes
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diff