-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
PERF/ENH: add fast astyping for Categorical #37355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| This changes the behaviour unfortunately: >>> n = 100_000 >>> cat = pd.Categorical(["a", "b"] * n) >>> ser = pd.Series(cat) >>> ser.astype(str).dtype dtype('O') # master CategoricalDtype(categories=['a', 'b'], ordered=False) # this PR, different dtypeCan you make it work in some different way? |
Thanks for the catch! Fixed this now |
pandas/core/internals/blocks.py Outdated
| | ||
| elif ( # GH8628 | ||
| is_categorical_dtype(self.values.dtype) | ||
| and not (is_object_dtype(dtype) or is_string_like_dtype(dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could define a new method for this in core/dtypes/common.py (is_string_like_or_object_dtype) ?
| It seems like adding this to astype has a lot of repercussions downstream, casting objects in undesirable ways. The special casing needed to make the tests pass is a code smell. I wonder if it might make sense, instead of adding this to |
| see also #37371; this might be the same path. |
pandas/core/internals/blocks.py Outdated
| or is_datetime_or_timedelta_dtype(dtype) | ||
| ) | ||
| and copy is True | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems really convoluted, in a method that is already too complicated as it is (xref #22369)
Do you have a good idea where the perf improvement comes from? e.g. could we push this down into Categorical.astype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that the amount of special casing here is not good (and even with that a bunch of tests are still failing)
The perf improvement is from astyping just the category labels instead of astyping each array entry separately.
Categorical.astype seems like a good location for this - will give that a go
pandas/core/arrays/categorical.py Outdated
| | ||
| new_categories = self.categories.astype(dtype) | ||
| obj = Categorical.from_codes(self.codes, categories=new_categories) | ||
| return np.array(obj.categories[self.codes], copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to pass dtype here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no because we've already astyped the category a few lines up
asv_bench/benchmarks/categoricals.py Outdated
| for col in self.df.columns: | ||
| self.df[col] = self.df[col].astype("category") | ||
| | ||
| def astype_unicode(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add benchmarks for other types of categories (int, dti) for example. show the results of the benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
Posted int benchmark in main thread + will add/post more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right pls update these for int,float,string,datetime
| Re: review comments: |
| this is not showing much of an improvement. can you try someother dtypes as well. |
pandas/core/arrays/categorical.py Outdated
| try: | ||
| astyped_cats = self.categories.astype(dtype=dtype, copy=copy) | ||
| except (TypeError, ValueError): | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change TypeError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to fix the error message for CategoricalIndex. If we don't catch TypeError we end up with TypeError: Cannot cast Index to dtype float64 (below) versus something like TypeError: Cannot cast object to dtype float64
In [2]: idx = pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"]) In [3]: idx.astype('float') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy) 700 try: --> 701 casted = self._values.astype(dtype, copy=copy) 702 except (TypeError, ValueError) as err: ValueError: could not convert string to float: 'a' The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) <ipython-input-4-38d56ec15c36> in <module> ----> 1 idx.astype('float') /workspaces/pandas-arw2019/pandas/core/indexes/category.py in astype(self, dtype, copy) 369 @doc(Index.astype) 370 def astype(self, dtype, copy=True): --> 371 res_data = self._data.astype(dtype, copy=copy) 372 return Index(res_data, name=self.name) 373 /workspaces/pandas-arw2019/pandas/core/arrays/categorical.py in astype(self, dtype, copy) 427 # GH8628 (PERF): astype category codes instead of astyping array 428 try: --> 429 astyped_cats = self.categories.astype(dtype=dtype, copy=copy) 430 except (ValueError): 431 raise ValueError( /workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy) 701 casted = self._values.astype(dtype, copy=copy) 702 except (TypeError, ValueError) as err: --> 703 raise TypeError( 704 f"Cannot cast {type(self).__name__} to dtype {dtype}" 705 ) from err TypeError: Cannot cast Index to dtype float64There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is fine, but can you add a comment for this then right here (so future readers understand)
| ci / checks failing |
Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs) Anyway will keep looking and ping when green |
oh i think that's fixed on master |
jreback left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, ping on green.
pandas/core/arrays/categorical.py Outdated
| try: | ||
| astyped_cats = self.categories.astype(dtype=dtype, copy=copy) | ||
| except (TypeError, ValueError): | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is fine, but can you add a comment for this then right here (so future readers understand)
| @jreback Green + addressed comment |
| thanks @arw2019 very nice |
| thanks @jreback @jbrockmendel for the reviews! |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffTo illustrate the speed-up, the set-up is (from OP):
On master
versus on this branch