Skip to content

Conversation

@lukemanley
Copy link
Member

@lukemanley lukemanley commented Feb 25, 2023

import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(1_000_000, 10), dtype="float64[pyarrow]") %timeit df.fillna(0.0) 
25 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) -> main 76.4 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) -> PR 
@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Feb 25, 2023
@phofl
Copy link
Member

phofl commented Feb 25, 2023

Could you test this on #51411 as well? This should address the problem for CoW

@lukemanley
Copy link
Member Author

Could you test this on #51411 as well? This should address the problem for CoW

I tested locally on top of your branch in #51411 and it seems to work. What "problem" are you referring to?

@phofl
Copy link
Member

phofl commented Feb 25, 2023

Performance, we are avoiding the copy on the block level there

@phofl phofl added this to the 2.1 milestone Feb 27, 2023
@lukemanley
Copy link
Member Author

@phofl - hmmm... clearly this doesn't play well with #51411

Not sure what I did yesterday to convince myself it did... sorry

@lukemanley
Copy link
Member Author

actually, what is the expected behavior for COW wrt pyarrow? Since arrow arrays are immutable, ArrowExtensionArray.copy doesn't actually copy the underlying pyarrow array. I think tests like this (which are failing) are expected to fail:

import pandas as pd import numpy as np arr = pd.array([1, 2, 3], dtype="int64[pyarrow]") arr2 = arr.copy() assert not np.shares_memory(arr, arr2) # -> raises 
@phofl
Copy link
Member

phofl commented Feb 27, 2023

CoW handles the copies globally on the block level, hence we can remove the check above.

You can adjust the tests, eas made to many defensive copies without CoW, you got rid of one of them, which is a good thing

@lukemanley
Copy link
Member Author

Ok will do. Is the comment you added still relevant? (# TODO(CoW): Not necessary anymore when CoW is the default)

I would think this fastpath can still apply after CoW is the default (e.g. for whatever reason CoW is not being used)

@phofl
Copy link
Member

phofl commented Feb 27, 2023

We are not planning on having a non CoW mode when CoW is the default

@mroeschke
Copy link
Member

Merge when ready @phofl

@mroeschke
Copy link
Member

Could you rebase @lukemanley

@mroeschke mroeschke merged commit 04ec93e into pandas-dev:main Mar 17, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the arrow-fillna-no-nulls branch March 18, 2023 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Performance Memory or execution speed performance

3 participants