Skip to content

Conversation

@lukemanley
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Perf improvement for DataFrame.values when backed by a single pyarrow numeric dtype without any nulls. I realize this is a narrow use case, so happy to close this PR if it isn't worth special casing. The current slowness is due to DataFrame.values always casting to object dtype for EA-backed frames. Unfortunately, a single null anywhere in the dataframe misses this optimization since pd.NA is used as the null representation in the ndarray.

import pandas as pd import numpy as np data = np.random.randn(100_000, 20) df = pd.DataFrame(data, dtype="float64[pyarrow]") %timeit df.values # 98.7 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <- main # 3.56 ms ± 96.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR 
@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Apr 1, 2023
@phofl
Copy link
Member

phofl commented Apr 2, 2023

This also changes behavior (e.g. getting float instead of object). Personally, I think this is fine but we have an unresolved discussion about this somewhere. We should decide there first before special casing here I'd say

@lukemanley
Copy link
Member Author

This also changes behavior (e.g. getting float instead of object).

Yes, the performance improvement is due to avoiding the cast to object. Note, this behavior actually already exists on main for a DataFrame with a single column:

Behavior on main:

import pandas as pd df1 = pd.DataFrame({"a": [1.0, 2.0, 3.0]}, dtype="float64[pyarrow]") df2 = pd.DataFrame({"a": [1.0, 2.0, pd.NA]}, dtype="float64[pyarrow]") print(df1.values.dtype) # float64 print(df2.values.dtype) # object 

Personally, I think this is fine but we have an unresolved discussion about this somewhere. We should decide there first before special casing here I'd say

Sure, I think you might be referring to #22791

@lukemanley
Copy link
Member Author

closing for now pending further discussion in #22791

@lukemanley lukemanley closed this Apr 9, 2023
@lukemanley lukemanley deleted the perf-df-values-arrow branch April 18, 2023 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Performance Memory or execution speed performance

2 participants