PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

nlee737 · 2018-05-10T14:38:06Z

closes Categorical.from_codes shouldn't coerce to int64 #18501
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In [3]: arr = np.ones(10000000,dtype='int8') # master In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 44.2 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # after patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

 before after ratio [6d5d7015] [fb1f7b84] 9.24±1ms 9.94±0.6ms 1.08 categoricals.Concat.time_concat 5.52±0.1ms 5.41±0.05ms 0.98 categoricals.Concat.time_union 32.0±0.3ms 32.3±0.3ms 1.01 categoricals.Constructor.time_all_nan 1.32±0.02ms 1.28±0.01ms 0.97 categoricals.Constructor.time_datetimes 1.26±0.01ms 1.29±0.02ms 1.02 categoricals.Constructor.time_datetimes_with_nat 354±3μs 349±7μs 0.99 categoricals.Constructor.time_fastpath 20.0±0.08ms 20.1±0.3ms 1.01 categoricals.Constructor.time_regular 185±1ms 186±0.6ms 1.01 categoricals.Constructor.time_with_nan 10.1ms 10.1ms 0.99 categoricals.Isin.time_isin_categorical('int64') 10.7±0.08ms 10.8±0.07ms 1.00 categoricals.Isin.time_isin_categorical('object') 9.11±0.1ms 9.04±0.2ms 0.99 categoricals.Rank.time_rank_int 9.33±0.1ms 9.37±0.1ms 1.00 categoricals.Rank.time_rank_int_cat 9.13±0.1ms 8.97±0.05ms 0.98 categoricals.Rank.time_rank_int_cat_ordered 141±0.9ms 136±1ms 0.97 categoricals.Rank.time_rank_string 11.2±0.2ms 11.1±0.1ms 0.99 categoricals.Rank.time_rank_string_cat 9.04±0.1ms 9.23±0.1ms 1.02 categoricals.Rank.time_rank_string_cat_ordered 592±5μs 586±3μs 0.99 categoricals.Repr.time_rendering 32.8±2ms 28.4±0.6ms ~0.86 categoricals.SetCategories.time_set_categories 31.8±2ms 29.6±0.1ms 0.93 categoricals.ValueCounts.time_value_counts(False) 30.7±0.1ms 29.3±0.2ms 0.96 categoricals.ValueCounts.time_value_counts(True)

TomAugspurger · 2018-05-10T15:52:10Z

pandas/core/arrays/categorical.py

 """
 try:
- codes = np.asarray(codes, np.int64)
+ if (type(codes) == np.ndarray


I wonder, is it sufficient to call coerce_indexer_dtype(codes, categories)? If so, how's the performance of that?

arr = np.ones(10000000, 'int8') # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.13 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # coerce_indexer_dtype In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.14 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But for lists,

arr = [1 for x in range(10000000)] # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 600 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # coerce_indexer_dtype In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 2.32 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I don't think coerce_indexer_dtype turns codes into a Numpy array.

diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index 93e7aa7ca..8a096e4c8 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -578,18 +578,14 @@ class Categorical(ExtensionArray, PandasObject): unordered. """ try: - if (type(codes) == np.ndarray - and np.issubdtype(codes.dtype, np.integer)): - codes = np.asarray(codes) - else: - codes = np.asarray(codes, np.int64) + coerce_indexer_dtype(codes, categories) except (ValueError, TypeError): raise ValueError( "codes need to be convertible to an arrays of integers") categories = CategoricalDtype.validate_categories(categories) - if len(codes) and (codes.max() >= len(categories) or codes.min() < -1): + if len(codes) and (np.max(codes) >= len(categories) or np.min(codes) < -1): raise ValueError("codes need to be between -1 and " "len(categories)-1")

Indeed. And if you combine the two approaches like

codes = coerce_indexer_dtype(np.asarray(codes), categories)

does that work?

Yep, it works and performance seems on par.

In [3]: arr = np.ones(10000000, 'int8') # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.06 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # coerce_indexer_dtype + np.asarray In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.03 ms ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: lst = [1 for x in range(10000000)] # patch In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 588 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # coerce_indexer_dtype + np.asarray In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 583 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thoughts on that implementation? I slightly prefer it to the if / else version.

I prefer it as well. I pushed a commit with your implementation.

TomAugspurger · 2018-05-11T21:18:38Z

Could you also add a release note (performance improvement)?

…m_codes

codecov · 2018-05-12T12:53:47Z

Codecov Report

Merging #21000 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@ Coverage Diff @@ ## master #21000 +/- ## ========================================== - Coverage 91.82% 91.82% -0.01%  ========================================== Files 153 153 Lines 49505 49502 -3 ========================================== - Hits 45460 45457 -3  Misses 4045 4045

Flag	Coverage Δ
#multiple	`90.22% <100%> (-0.01%)`	⬇️
#single	`41.88% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.67% <100%> (ø)`	⬆️
pandas/core/dtypes/concat.py	`99.17% <0%> (-0.01%)`	⬇️
pandas/core/indexes/base.py	`96.64% <0%> (-0.01%)`	⬇️
pandas/io/parsers.py	`95.46% <0%> (ø)`	⬆️
pandas/core/internals.py	`95.59% <0%> (ø)`	⬆️
pandas/core/window.py	`96.28% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d03fdb...75da8a3. Read the comment docs.

jreback

can u add an asv for this

nlee737 · 2018-05-12T14:06:06Z

Updated ASV:

 before after ratio [3d03fdb2] [d83abca4] 9.27±0.6ms 9.13±0.3ms 0.98 categoricals.Concat.time_concat 5.80±0.1ms 6.22±0.03ms 1.07 categoricals.Concat.time_union 32.7±0.6ms 32.8±0.4ms 1.00 categoricals.Constructor.time_all_nan 1.26±0.03ms 1.24±0.01ms 0.99 categoricals.Constructor.time_datetimes 1.28±0ms 1.28±0.01ms 1.00 categoricals.Constructor.time_datetimes_with_nat 364±8μs 348±5μs 0.96 categoricals.Constructor.time_fastpath 19.9±0.08ms 20.0±0.06ms 1.00 categoricals.Constructor.time_regular 181±0.5ms 182±2ms 1.00 categoricals.Constructor.time_with_nan 10.2ms 10.3ms 1.01 categoricals.Isin.time_isin_categorical('int64') 10.8±0.1ms 10.5±0.09ms 0.97 categoricals.Isin.time_isin_categorical('object') 8.89±0.08ms 8.94±0.1ms 1.01 categoricals.Rank.time_rank_int 9.33±0.2ms 9.44±0.05ms 1.01 categoricals.Rank.time_rank_int_cat 8.95±0.08ms 8.95±0.05ms 1.00 categoricals.Rank.time_rank_int_cat_ordered 135±2ms 140±2ms 1.03 categoricals.Rank.time_rank_string 11.0±0.06ms 11.1±0.2ms 1.01 categoricals.Rank.time_rank_string_cat 9.14±0.2ms 9.11±0.1ms 1.00 categoricals.Rank.time_rank_string_cat_ordered 590±7μs 599±20μs 1.01 categoricals.Repr.time_rendering 27.7±0.5ms 29.1±0.2ms 1.05 categoricals.SetCategories.time_set_categories 30.0±0.4ms 29.6±0.4ms 0.99 categoricals.ValueCounts.time_value_counts(False) 29.6±0.4ms 29.6±0.2ms 1.00 categoricals.ValueCounts.time_value_counts(True)

TomAugspurger · 2018-05-12T18:43:22Z

I think @jbreack was requesting a new ASV that measures Categorical.from_codes with an int8 type array.

It can go in asv_bench/benchmarks/categoricals.py under Constructor.

nlee737 · 2018-05-12T19:51:13Z

Sorry, I misunderstood. Thanks for the clarification. Is the following asv sufficient?

diff --git a/asv_bench/benchmarks/categoricals.py b/asv_bench/benchmarks/categoricals.py index 0ffd5f881..ae1d70292 100644 --- a/asv_bench/benchmarks/categoricals.py +++ b/asv_bench/benchmarks/categoricals.py @@ -51,6 +51,7 @@ class Constructor(object): self.values_some_nan = list(np.tile(self.categories + [np.nan], N)) self.values_all_nan = [np.nan] * len(self.values) + self.values_all_int8 = np.ones(N, 'int8') def time_regular(self): pd.Categorical(self.values, self.categories) @@ -70,6 +71,9 @@ class Constructor(object): def time_all_nan(self): pd.Categorical(self.values_all_nan) + def time_from_codes_all_int8(self): + pd.Categorical.from_codes(self.values_all_int8, self.categories) +

TomAugspurger

Perfect, thanks.

nlee737 mentioned this pull request May 10, 2018

PERF: removed coercion to int64 in Categorical.from_codes #20961

Closed

4 tasks

TomAugspurger reviewed May 10, 2018

View reviewed changes

PERF: removed coercion to int64 for arrays of ints in Categorical.fro…

49ffb8f

…m_codes

CLN: using coerce_indexer_dtype for codes in Categorical.from_codes

d83abca

jreback requested changes May 12, 2018

View reviewed changes

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 12, 2018

PERF: added asv for Categorical.from_codes (#18501)

75da8a3

TomAugspurger approved these changes May 15, 2018

View reviewed changes

TomAugspurger added this to the 0.23.0 milestone May 15, 2018

TomAugspurger merged commit 363426f into pandas-dev:master May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

Uh oh!

nlee737 commented May 10, 2018 •

edited

Loading

TomAugspurger May 10, 2018

nlee737 May 10, 2018 •

edited

Loading

TomAugspurger May 11, 2018

nlee737 May 11, 2018 •

edited

Loading

TomAugspurger May 11, 2018

nlee737 May 12, 2018

TomAugspurger commented May 11, 2018

codecov bot commented May 12, 2018 •

edited

Loading

jreback left a comment

nlee737 commented May 12, 2018 •

edited

Loading

TomAugspurger commented May 12, 2018

nlee737 commented May 12, 2018

TomAugspurger left a comment

Labels

3 participants

Uh oh!

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

Uh oh!

Conversation

nlee737 commented May 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TomAugspurger May 10, 2018

Choose a reason for hiding this comment

nlee737 May 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

TomAugspurger May 11, 2018

Choose a reason for hiding this comment

nlee737 May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

TomAugspurger May 11, 2018

Choose a reason for hiding this comment

nlee737 May 12, 2018

Choose a reason for hiding this comment

TomAugspurger commented May 11, 2018

codecov bot commented May 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

nlee737 commented May 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TomAugspurger commented May 12, 2018

nlee737 commented May 12, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Labels

3 participants

nlee737 commented May 10, 2018 •

edited

Loading

nlee737 May 10, 2018 •

edited

Loading

nlee737 May 11, 2018 •

edited

Loading

codecov bot commented May 12, 2018 •

edited

Loading

nlee737 commented May 12, 2018 •

edited

Loading