Skip to content

Conversation

@MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Jan 22, 2020

…s' dtypes

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 23, 2020

I think the flattening can be done later in the code, will update this later

EDIT

update done

@MarcoGorelli MarcoGorelli changed the title [BUG] replacing one columns values with nan was changing other column… (wip) [BUG] replacing one columns values with nan was changing other column… Jan 23, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) [BUG] replacing one columns values with nan was changing other column… [BUG] replacing one columns values with nan was changing other column… Jan 23, 2020
@MarcoGorelli MarcoGorelli force-pushed the issue-30512 branch 4 times, most recently from c4fbf42 to 274d0cd Compare January 23, 2020 13:33
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 23, 2020

Actually, I think we can still do better - will update it later this week

EDIT

done


def test_replace_with_nan(self):
# GH30512
df = pd.DataFrame({"A": [np.nan, 3], "B": [1.096, 1.086], "C": [1, 2]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only happen if A and B are both float dtypes before replacement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, because then they're both part of the same FloatBlock and the whole thing gets cast to object, but I'll check

Copy link
Member Author

@MarcoGorelli MarcoGorelli Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: it happens with other dtypes, e.g.

>>> data = pd.DataFrame({'B':2, 'C':1}, index=[0]) >>> data.dtypes B int64 C int64 dtype: object >>> data.replace(to_replace={1:None}).dtypes B object C object dtype: object 

and the nan doesn't have anything to do with the issue. Will update test / whatsnew entry tomorrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd have updated tests with other dtypes and reworded whatsnew

@WillAyd WillAyd added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jan 23, 2020
@MarcoGorelli MarcoGorelli changed the title [BUG] replacing one columns values with nan was changing other column… (wip) [BUG] replacing one columns values with nan was changing other column… Jan 23, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) [BUG] replacing one columns values with nan was changing other column… (wip) [BUG] replacing one columns values was changing other columns' dtypes Jan 23, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) [BUG] replacing one columns values was changing other columns' dtypes (wip) [BUG] replacing one column's values was changing other columns' dtypes Jan 23, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) [BUG] replacing one column's values was changing other columns' dtypes [BUG] replacing one column's values was changing other columns' dtypes Jan 24, 2020
@MarcoGorelli MarcoGorelli changed the title [BUG] replacing one column's values was changing other columns' dtypes BUG: replacing one column's values was changing other columns' dtypes Jan 25, 2020
- Bug in :class:`Series` construction from NumPy array with big-endian ``datetime64`` dtype (:issue:`29684`)
-
- Bug in :meth:`DataFrame.replace` was casting all columns of the same dtype to the target value's dtype, even for columns which didn't contain
the value to be replaced (:issue:`30512`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove the newline (just make a long note is ok)

)
if m.any() or convert:
new_rb = _extend_blocks(result, new_rb)
if m.ndim > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can use _block_shape to make all blocks conform to ndim=2, then you can simply separate this cleanly. this is too nested now

also would be ok to move parts of this to a helper function as this is getting hard to read.

further you might be able to discriminate the blocks on line 603 rather than here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review - I've tried using _block_shape from pandas.core.internals.blocks import _block_shape, but then I run into AttributeErrors:

>>> b IntBlock: 4 dtype: int64 >>> b.values array([1, 2, 3, 4]) >>> _block_shape(b, ndim=2) Traceback (most recent call last): File "<console>", line 1, in <module> File "/home/marco/pandas/pandas/core/internals/blocks.py", line 3053, in _block_shape values = values.reshape(tuple((1,) + shape)) AttributeError: 'IntBlock' object has no attribute 'reshape 

sure, will work on the rest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_shape operates on an ndarray.

Copy link
Member Author

@MarcoGorelli MarcoGorelli Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying that. I've spent some time today thinking about this, but it's not clear to me how to

use _block_shape to make all blocks conform to ndim=2

if _block_shape only operates on ndarrays. I can only see how to use it to ensure m is 2-dimensional.

If anyone following along can see how to do this, feel free to take over, it might currently be too advanced for me :) Else I'll keep working on it, it just might take a bit longer.

For now, I've split this part of the code into a helper function.

expected = df.dtypes.drop("A")
tm.assert_series_equal(result, expected)

def test_replace_datetime64_dont_change_other_columns(self, datetime64_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize all of these cases into a single test

Copy link
Member Author

@MarcoGorelli MarcoGorelli Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but these test functions take fixtures as arguments (e.g. datetime64_dtype is defined as a fixture in pandas.conftest.py). Is it possible to parametrize over fixtures? I tried looking around for a way but couldn't find one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to parametrize over fixtures?

I think tests.io.excel.test_readers.engine_and_read_ext might be doing something similar to what you need

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or tests.arrays.test_integer.all_data might be a better fit

Copy link
Member Author

@MarcoGorelli MarcoGorelli Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel thanks for the tip, have updated the test - is this what you had in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: what I've done is not correct, it generates nearly 90,000 tests (when I made commit 61b8b2c there were only 101) in pandas/tests/frame/methods/test_replace.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: what I've done is not correct, it generates nearly 90,000 tests (when I made commit 61b8b2c there were only 101) in pandas/tests/frame/methods/test_replace.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: what I've done is not correct, it generates nearly 90,000 tests (when I made commit 61b8b2c there were only 101) in pandas/tests/frame/methods/test_replace.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel @jreback if I write

@pytest.fixture( params=[ "any_real_dtype", "string_dtype", "datetime64_dtype", "timedelta64_dtype", "bool", "complex_dtype", "bytes_dtype", "tz_aware_fixture", ] ) def replace_value_a_value_b( request, any_real_dtype, string_dtype, datetime64_dtype, timedelta64_dtype, complex_dtype, bytes_dtype, tz_aware_fixture, ): if request.param == "any_real_dtype": return any_real_dtype, 2, 3 elif request.param == "string_dtype": return string_dtype, "a", "b" elif request.param == "datetime64_dtype": return datetime64_dtype, pd.Timestamp("2020-01-24"), pd.Timestamp("2020-01-25") elif request.param == "timedelta64_dtype": return timedelta64_dtype, pd.Timedelta(3), pd.Timedelta(4) elif request.param == "bool": return bool, 1, 0 elif request.param == "complex_dtype": return complex_dtype, 1 + 2j, 1 + 3j elif request.param == "bytes_dtype": return bytes_dtype, b"a", b"b" elif request.param == "tz_aware_fixture": return ( None, Timestamp("20130102", tz=tz_aware_fixture), Timestamp("20130103", tz=tz_aware_fixture), ) 

then nearly 90,000 tests are collected.

An alternative would involve importing the types in here (e.g. ALL_REAL_DTYPES) and iterating over them but, as I understand, importing from conftest is discouraged.

Any other tips on how to do this? It seems parametrisation over fixtures has been proposed in pytest, but it's not there yet - https://docs.pytest.org/en/latest/proposals/parametrize_with_fixtures.html

)
if m.any() or convert:
new_rb = _extend_blocks(result, new_rb)
if m.ndim > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_shape operates on an ndarray.

new_rb = _extend_blocks(result, new_rb)
if m.ndim > 1:
to_replace = m.any(axis=1)
subblocks = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you should insead use _split_op_result which creates the individual split blocks

this is too nested and not understandable.

@MarcoGorelli MarcoGorelli force-pushed the issue-30512 branch 3 times, most recently from 30d901a to 0f6172f Compare February 2, 2020 08:27
@MarcoGorelli MarcoGorelli force-pushed the issue-30512 branch 2 times, most recently from dc8a336 to 6ea3a7a Compare February 19, 2020 13:24
subblocks = [block]
subblock_masks = [mask]

for subblock, subblock_mask in zip(subblocks, subblock_masks):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance we can use Block.split_and_operate for some of this?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Feb 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel thanks for the suggestion - just looked into it and it says

 f : callable accepting (1d-mask, 1d values, indexer) 

while here we may be operating on multidimensional blocks, so it's not obvious to me how to use this here. Have marked this WIP while I update according to your other comments and think about this one

@MarcoGorelli MarcoGorelli changed the title BUG: replacing one column's values was changing other columns' dtypes (wip) BUG: replacing one column's values was changing other columns' dtypes Feb 22, 2020
"bool",
"complex_dtype",
"bytes_dtype",
"tz_aware_fixture",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i see what this is doing, but at first glance it looks like a weird usage. can you add a docstring for when i inevitably forget?

)
return self

def _discriminate_replace_and_coerce_block(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we re-use _split_and_operate here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also do something kinda-similar at the end of Block.where with for m in [mask, ~mask], maybe something can be shared

Copy link
Member Author

@MarcoGorelli MarcoGorelli Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel would there be performance implications to operating on each column separately (which, as I understand, split_and_operate would do)?

Also, if I call split_and_operate, then it will operate on self.values, which is a numpy array. But the method I need to apply (_replace_and_coerce) is a block method - so, within the function f which split_and_operate accepts, would I need to convert the values it accepts from a numpy array to a block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather take a perf hit (maybe) that adding a ton more code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be performance implications to operating on each column separately (which, as I understand, split_and_operate would do)?

Only a small one. Building on what @jreback said, the replace_foo code is already obscenely complicated, so a small perf hit is better than making that worse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation - I've tried writing this using split_and_operate, though it seems a bit complicated - is this kind of what you had in mind @jbrockmendel ?

Currently, this works for:

  • any_real_dtype
  • datetime64_dtype
  • timedelta64_dtype
  • complex_dtype

doesn't work for:

  • string_dtype
  • bytes_dtype
  • tz_aware_fixture
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm struggling with is that method I'd like to apply is a block method, but inside the f function in split_and_operate, I would only have access to an ndarray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but inside the f function in split_and_operate, I would only have access to an ndarray

Are you sure? Looking at Block.fillna it passes a function f to split_and_operate that operates on blocks.

Copy link
Member Author

@MarcoGorelli MarcoGorelli Jun 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel thanks for pointing me to that, that was useful.

Something I'm stuck on is that if I try writing this as

 def f(mask, values, idx): if idx is not None: block = b.getitem_block(slice(idx, idx + 1)) else: block = b return block._replace_coerce( mask=mask.reshape(block.shape), to_replace=s, value=d, inplace=inplace, convert=convert, regex=regex, ) if m.any() or convert: result = b.split_and_operate(m, f, inplace) new_rb = _extend_blocks(result, new_rb) else: new_rb.append(b)

then I run into the following problem: if b is ObjectBlock, then b.values is also ObjectBlock, which will cause a problem when it's passed to split_and_operate:

(Pdb) !b ObjectBlock: slice(1, 2, 1), 1 x 4, dtype: object (Pdb) !b.values ObjectBlock: slice(1, 2, 1), 1 x 4, dtype: object (Pdb) !b.values[0] *** TypeError: 'ObjectBlock' object is not subscriptable

This is in contrast to when b is an IntBlock, when it works fine:

(Pdb) !b IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64 (Pdb) !b.values array([[0, 1, 2, 3]]) (Pdb) !b.values[0] array([0, 1, 2, 3])

The code, as it is, would solve issue 30512, but there are problems that arise in other tests because of the above, which I still need to make sense of

@MarcoGorelli MarcoGorelli marked this pull request as draft June 13, 2020 10:49
@MarcoGorelli MarcoGorelli deleted the issue-30512 branch October 10, 2020 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

4 participants