Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 16, 2024

xref #57431

I think we are sure that the values from self.get_new_values(..), which we pass here to the DataFrame constructor, are always new values owned by us (not shared by any other object), and so we can safely set copy=False in that case.

Mimicking the time_unstack benchmark:

m = 100 n = 1000 levels = np.arange(m) index = pd.MultiIndex.from_product([levels] * 2) columns = np.arange(n) values = np.arange(m * m * n).reshape(m * m, n) df = pd.DataFrame(values, index, columns) 
In [2]: %timeit df.unstack() 316 ms ± 44.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # main 228 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # PR 
@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Copy / view semantics labels Feb 16, 2024
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this should be fine

should be good to backport as well

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are sure that the values from self.get_new_values(..)... are always new values owned by us

I'm not sure that's the case. This path:

sorted_values = self._make_sorted_values(values)
# place the values
length, width = self.full_shape
stride = values.shape[1]
result_width = width * stride
result_shape = (length, result_width)
mask = self.mask
mask_all = self.mask_all
# we can simply reshape if we don't have a mask
if mask_all and len(values):
# TODO: Under what circumstances can we rely on sorted_values
# matching values? When that holds, we can slice instead
# of take (in particular for EAs)
new_values = (
sorted_values.reshape(length, width, stride)
.swapaxes(1, 2)
.reshape(result_shape)
)
new_mask = np.ones(result_shape, dtype=bool)
return new_values, new_mask

with self.sort=False only does reshape and swapaxes. Both of those can be no copy, right?

@jorisvandenbossche
Copy link
Member Author

I'm not sure that's the case. This path:

Ah, yes, I missed the early return there.

Continued in -> #57487

@jorisvandenbossche jorisvandenbossche deleted the cow-perf-fixes branch February 19, 2024 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Copy / view semantics Performance Memory or execution speed performance

3 participants