Skip to content

Conversation

@arw2019
Copy link
Contributor

@arw2019 arw2019 commented Oct 23, 2020

  • closes ENH: decode for Categoricals #8628
  • tests added / passed
  • benchmarks added
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

To illustrate the speed-up, the set-up is (from OP):

import numpy as np import pandas as pd rng = np.random.default_rng() df = pd.DataFrame( rng.choice(np.array(list("abcde")), 4_000_000).reshape(1_000_000, 4), columns=list("ABCD"), ) for col in df.columns: df[col] = df[col].astype("category")

On master

In [5]: %timeit [df[col].astype('unicode') for col in df.columns] 250 ms ± 601 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

versus on this branch

In [5]: %timeit [df[col].astype('unicode') for col in df.columns] 5.38 ms ± 47 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@arw2019 arw2019 changed the title PERF/ENH: add fast astyping for categorical input PERF/ENH: add fast astyping for Categorical Oct 23, 2020
@topper-123
Copy link
Contributor

This changes the behaviour unfortunately:

>>> n = 100_000 >>> cat = pd.Categorical(["a", "b"] * n) >>> ser = pd.Series(cat) >>> ser.astype(str).dtype dtype('O') # master CategoricalDtype(categories=['a', 'b'], ordered=False) # this PR, different dtype

Can you make it work in some different way?

@arw2019
Copy link
Contributor Author

arw2019 commented Oct 23, 2020

This changes the behaviour unfortunately:

>>> n = 100_000 >>> cat = pd.Categorical(["a", "b"] * n) >>> ser = pd.Series(cat) >>> ser.astype(str).dtype dtype('O') # master CategoricalDtype(categories=['a', 'b'], ordered=False) # this PR, different dtype

Can you make it work in some different way?

Thanks for the catch! Fixed this now


elif ( # GH8628
is_categorical_dtype(self.values.dtype)
and not (is_object_dtype(dtype) or is_string_like_dtype(dtype))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could define a new method for this in core/dtypes/common.py (is_string_like_or_object_dtype) ?

@arw2019
Copy link
Contributor Author

arw2019 commented Oct 23, 2020

It seems like adding this to astype has a lot of repercussions downstream, casting objects in undesirable ways. The special casing needed to make the tests pass is a code smell.

I wonder if it might make sense, instead of adding this to astype, to create a dedicated method for astyping a Categorical via this method?

In []: cat = pd.Categorical(["a", "b", "a", "a"] ) ...: s = pd.Series(cat) ...: s.astype_categories(dtype=str) Out[]: 0 a 1 b 2 a 3 a dtype: object 
@jreback
Copy link
Contributor

jreback commented Oct 23, 2020

see also #37371; this might be the same path.

or is_datetime_or_timedelta_dtype(dtype)
)
and copy is True
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems really convoluted, in a method that is already too complicated as it is (xref #22369)

Do you have a good idea where the perf improvement comes from? e.g. could we push this down into Categorical.astype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the amount of special casing here is not good (and even with that a bunch of tests are still failing)

The perf improvement is from astyping just the category labels instead of astyping each array entry separately.

Categorical.astype seems like a good location for this - will give that a go

@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Oct 26, 2020

new_categories = self.categories.astype(dtype)
obj = Categorical.from_codes(self.codes, categories=new_categories)
return np.array(obj.categories[self.codes], copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to pass dtype here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no because we've already astyped the category a few lines up

for col in self.df.columns:
self.df[col] = self.df[col].astype("category")

def astype_unicode(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add benchmarks for other types of categories (int, dti) for example. show the results of the benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

Posted int benchmark in main thread + will add/post more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right pls update these for int,float,string,datetime

@arw2019
Copy link
Contributor Author

arw2019 commented Oct 29, 2020

Re: review comments:

int benchmark results

In [1]: import numpy as np ...: import pandas as pd ...: ...: rng = np.random.default_rng() ...: ...: df = pd.DataFrame( ...: rng.choice(np.arange(8), 4_000_000).reshape(1_000_000, 4), ...: columns=list("ABCD"), ...: ) ...: ...: for col in df.columns: ...: df[col] = df[col].astype("category") ...: 

On master:

In [2]: %timeit [df[col].astype('int') for col in df.columns] 20.9 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 

On this branch:

In [4]: %timeit [df[col].astype('int') for col in df.columns] 18.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
@arw2019 arw2019 marked this pull request as draft October 31, 2020 05:30
@jreback
Copy link
Contributor

jreback commented Oct 31, 2020

this is not showing much of an improvement. can you try someother dtypes as well.

try:
astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
except (TypeError, ValueError):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change TypeError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to fix the error message for CategoricalIndex. If we don't catch TypeError we end up with TypeError: Cannot cast Index to dtype float64 (below) versus something like TypeError: Cannot cast object to dtype float64

In [2]: idx = pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"]) In [3]: idx.astype('float') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy) 700 try: --> 701 casted = self._values.astype(dtype, copy=copy) 702 except (TypeError, ValueError) as err: ValueError: could not convert string to float: 'a' The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) <ipython-input-4-38d56ec15c36> in <module> ----> 1 idx.astype('float') /workspaces/pandas-arw2019/pandas/core/indexes/category.py in astype(self, dtype, copy) 369 @doc(Index.astype) 370 def astype(self, dtype, copy=True): --> 371 res_data = self._data.astype(dtype, copy=copy) 372 return Index(res_data, name=self.name) 373 /workspaces/pandas-arw2019/pandas/core/arrays/categorical.py in astype(self, dtype, copy) 427 # GH8628 (PERF): astype category codes instead of astyping array 428 try: --> 429 astyped_cats = self.categories.astype(dtype=dtype, copy=copy) 430 except (ValueError): 431 raise ValueError( /workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy) 701 casted = self._values.astype(dtype, copy=copy) 702 except (TypeError, ValueError) as err: --> 703 raise TypeError( 704 f"Cannot cast {type(self).__name__} to dtype {dtype}" 705 ) from err TypeError: Cannot cast Index to dtype float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine, but can you add a comment for this then right here (so future readers understand)

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

ci / checks failing

@arw2019
Copy link
Contributor Author

arw2019 commented Nov 18, 2020

ci / checks failing

Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs)

mypy --version mypy 0.782 Performing static analysis using mypy pandas/core/indexes/datetimelike.py:776: error: Argument 1 to "_simple_new" of "DatetimeIndexOpsMixin" has incompatible type "Union[ExtensionArray, Any]"; expected "Union[DatetimeArray, TimedeltaArray, PeriodArray]" [arg-type] Found 1 error in 1 file (checked 1119 source files) Performing static analysis using mypy DONE 

Anyway will keep looking and ping when green

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

ci / checks failing

Hmmm the mypy complaint looks unrelated (also getting the same thing on my other PRs)

mypy --version mypy 0.782 Performing static analysis using mypy pandas/core/indexes/datetimelike.py:776: error: Argument 1 to "_simple_new" of "DatetimeIndexOpsMixin" has incompatible type "Union[ExtensionArray, Any]"; expected "Union[DatetimeArray, TimedeltaArray, PeriodArray]" [arg-type] Found 1 error in 1 file (checked 1119 source files) Performing static analysis using mypy DONE 

Anyway will keep looking and ping when green

oh i think that's fixed on master

@arw2019 arw2019 closed this Nov 18, 2020
@arw2019 arw2019 reopened this Nov 18, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, ping on green.

try:
astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
except (TypeError, ValueError):
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine, but can you add a comment for this then right here (so future readers understand)

@arw2019
Copy link
Contributor Author

arw2019 commented Nov 18, 2020

@jreback Green + addressed comment

@jreback jreback merged commit cc957d1 into pandas-dev:master Nov 18, 2020
@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

thanks @arw2019 very nice

@arw2019
Copy link
Contributor Author

arw2019 commented Nov 18, 2020

thanks @jreback @jbrockmendel for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Performance Memory or execution speed performance

5 participants