Skip to content

Conversation

@mroeschke
Copy link
Member

@mroeschke mroeschke commented Aug 28, 2024

If a CategoricalDtype is passed to CategoricalDtype.update_dtype, this API will attempt to unnecessarily re-validate the categories if it was not None.

CategoricalDtype.update_dtype is called in constructors like Categorical.__init__ and Categorical._simple_new where there is an attempt to update the passed dtype with ordered=False if it was None. A fully validated CategoricalDtype should just return itself if passed to update_dtype

In [1]: import pandas as pd In [2]: cdtype = pd.CategoricalDtype(categories=list(range(100_000)), ordered=True) In [3]: base_dtype = pd.CategoricalDtype(ordered=False) In [4]: %timeit base_dtype.update_dtype(cdtype) 2.5 μs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [4]: %timeit base_dtype.update_dtype(cdtype) 865 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
@mroeschke mroeschke added Performance Memory or execution speed performance Categorical Categorical Data Type labels Aug 28, 2024
@mroeschke mroeschke added this to the 3.0 milestone Aug 28, 2024
@galipremsagar
Copy link

Thanks for the fix @mroeschke !

@mroeschke
Copy link
Member Author

Looks like tests are passing here so merging. Happy to follow up if needed

@mroeschke mroeschke merged commit 85be99e into pandas-dev:main Sep 5, 2024
@mroeschke mroeschke deleted the perf/categoricaldtype/update_dtype branch September 5, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Performance Memory or execution speed performance

2 participants