Skip to content

Conversation

@nlee737
Copy link
Contributor

@nlee737 nlee737 commented May 10, 2018

In [3]: arr = np.ones(10000000,dtype='int8') # master In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 44.2 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # after patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
 before after ratio [6d5d7015] [fb1f7b84] 9.24±1ms 9.94±0.6ms 1.08 categoricals.Concat.time_concat 5.52±0.1ms 5.41±0.05ms 0.98 categoricals.Concat.time_union 32.0±0.3ms 32.3±0.3ms 1.01 categoricals.Constructor.time_all_nan 1.32±0.02ms 1.28±0.01ms 0.97 categoricals.Constructor.time_datetimes 1.26±0.01ms 1.29±0.02ms 1.02 categoricals.Constructor.time_datetimes_with_nat 354±3μs 349±7μs 0.99 categoricals.Constructor.time_fastpath 20.0±0.08ms 20.1±0.3ms 1.01 categoricals.Constructor.time_regular 185±1ms 186±0.6ms 1.01 categoricals.Constructor.time_with_nan 10.1ms 10.1ms 0.99 categoricals.Isin.time_isin_categorical('int64') 10.7±0.08ms 10.8±0.07ms 1.00 categoricals.Isin.time_isin_categorical('object') 9.11±0.1ms 9.04±0.2ms 0.99 categoricals.Rank.time_rank_int 9.33±0.1ms 9.37±0.1ms 1.00 categoricals.Rank.time_rank_int_cat 9.13±0.1ms 8.97±0.05ms 0.98 categoricals.Rank.time_rank_int_cat_ordered 141±0.9ms 136±1ms 0.97 categoricals.Rank.time_rank_string 11.2±0.2ms 11.1±0.1ms 0.99 categoricals.Rank.time_rank_string_cat 9.04±0.1ms 9.23±0.1ms 1.02 categoricals.Rank.time_rank_string_cat_ordered 592±5μs 586±3μs 0.99 categoricals.Repr.time_rendering 32.8±2ms 28.4±0.6ms ~0.86 categoricals.SetCategories.time_set_categories 31.8±2ms 29.6±0.1ms 0.93 categoricals.ValueCounts.time_value_counts(False) 30.7±0.1ms 29.3±0.2ms 0.96 categoricals.ValueCounts.time_value_counts(True) 
"""
try:
codes = np.asarray(codes, np.int64)
if (type(codes) == np.ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, is it sufficient to call coerce_indexer_dtype(codes, categories)? If so, how's the performance of that?

Copy link
Contributor Author

@nlee737 nlee737 May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arr = np.ones(10000000, 'int8') # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.13 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # coerce_indexer_dtype In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.14 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

But for lists,

arr = [1 for x in range(10000000)] # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 600 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # coerce_indexer_dtype In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 2.32 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

I don't think coerce_indexer_dtype turns codes into a Numpy array.

diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index 93e7aa7ca..8a096e4c8 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -578,18 +578,14 @@ class Categorical(ExtensionArray, PandasObject): unordered. """ try: - if (type(codes) == np.ndarray - and np.issubdtype(codes.dtype, np.integer)): - codes = np.asarray(codes) - else: - codes = np.asarray(codes, np.int64) + coerce_indexer_dtype(codes, categories) except (ValueError, TypeError): raise ValueError( "codes need to be convertible to an arrays of integers") categories = CategoricalDtype.validate_categories(categories) - if len(codes) and (codes.max() >= len(categories) or codes.min() < -1): + if len(codes) and (np.max(codes) >= len(categories) or np.min(codes) < -1): raise ValueError("codes need to be between -1 and " "len(categories)-1") 
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. And if you combine the two approaches like

codes = coerce_indexer_dtype(np.asarray(codes), categories)

does that work?

Copy link
Contributor Author

@nlee737 nlee737 May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it works and performance seems on par.

In [3]: arr = np.ones(10000000, 'int8') # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.06 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # coerce_indexer_dtype + np.asarray In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.03 ms ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: lst = [1 for x in range(10000000)] # patch In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 588 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # coerce_indexer_dtype + np.asarray In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 583 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on that implementation? I slightly prefer it to the if / else version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer it as well. I pushed a commit with your implementation.

@TomAugspurger
Copy link
Contributor

Could you also add a release note (performance improvement)?

@codecov
Copy link

codecov bot commented May 12, 2018

Codecov Report

Merging #21000 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #21000 +/- ## ========================================== - Coverage 91.82% 91.82% -0.01%  ========================================== Files 153 153 Lines 49505 49502 -3 ========================================== - Hits 45460 45457 -3  Misses 4045 4045
Flag Coverage Δ
#multiple 90.22% <100%> (-0.01%) ⬇️
#single 41.88% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.67% <100%> (ø) ⬆️
pandas/core/dtypes/concat.py 99.17% <0%> (-0.01%) ⬇️
pandas/core/indexes/base.py 96.64% <0%> (-0.01%) ⬇️
pandas/io/parsers.py 95.46% <0%> (ø) ⬆️
pandas/core/internals.py 95.59% <0%> (ø) ⬆️
pandas/core/window.py 96.28% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d03fdb...75da8a3. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add an asv for this

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 12, 2018
@nlee737
Copy link
Contributor Author

nlee737 commented May 12, 2018

Updated ASV:

 before after ratio [3d03fdb2] [d83abca4] 9.27±0.6ms 9.13±0.3ms 0.98 categoricals.Concat.time_concat 5.80±0.1ms 6.22±0.03ms 1.07 categoricals.Concat.time_union 32.7±0.6ms 32.8±0.4ms 1.00 categoricals.Constructor.time_all_nan 1.26±0.03ms 1.24±0.01ms 0.99 categoricals.Constructor.time_datetimes 1.28±0ms 1.28±0.01ms 1.00 categoricals.Constructor.time_datetimes_with_nat 364±8μs 348±5μs 0.96 categoricals.Constructor.time_fastpath 19.9±0.08ms 20.0±0.06ms 1.00 categoricals.Constructor.time_regular 181±0.5ms 182±2ms 1.00 categoricals.Constructor.time_with_nan 10.2ms 10.3ms 1.01 categoricals.Isin.time_isin_categorical('int64') 10.8±0.1ms 10.5±0.09ms 0.97 categoricals.Isin.time_isin_categorical('object') 8.89±0.08ms 8.94±0.1ms 1.01 categoricals.Rank.time_rank_int 9.33±0.2ms 9.44±0.05ms 1.01 categoricals.Rank.time_rank_int_cat 8.95±0.08ms 8.95±0.05ms 1.00 categoricals.Rank.time_rank_int_cat_ordered 135±2ms 140±2ms 1.03 categoricals.Rank.time_rank_string 11.0±0.06ms 11.1±0.2ms 1.01 categoricals.Rank.time_rank_string_cat 9.14±0.2ms 9.11±0.1ms 1.00 categoricals.Rank.time_rank_string_cat_ordered 590±7μs 599±20μs 1.01 categoricals.Repr.time_rendering 27.7±0.5ms 29.1±0.2ms 1.05 categoricals.SetCategories.time_set_categories 30.0±0.4ms 29.6±0.4ms 0.99 categoricals.ValueCounts.time_value_counts(False) 29.6±0.4ms 29.6±0.2ms 1.00 categoricals.ValueCounts.time_value_counts(True) 
@TomAugspurger
Copy link
Contributor

I think @jbreack was requesting a new ASV that measures Categorical.from_codes with an int8 type array.

It can go in asv_bench/benchmarks/categoricals.py under Constructor.

@nlee737
Copy link
Contributor Author

nlee737 commented May 12, 2018

Sorry, I misunderstood. Thanks for the clarification. Is the following asv sufficient?

diff --git a/asv_bench/benchmarks/categoricals.py b/asv_bench/benchmarks/categoricals.py index 0ffd5f881..ae1d70292 100644 --- a/asv_bench/benchmarks/categoricals.py +++ b/asv_bench/benchmarks/categoricals.py @@ -51,6 +51,7 @@ class Constructor(object): self.values_some_nan = list(np.tile(self.categories + [np.nan], N)) self.values_all_nan = [np.nan] * len(self.values) + self.values_all_int8 = np.ones(N, 'int8') def time_regular(self): pd.Categorical(self.values, self.categories) @@ -70,6 +71,9 @@ class Constructor(object): def time_all_nan(self): pd.Categorical(self.values_all_nan) + def time_from_codes_all_int8(self): + pd.Categorical.from_codes(self.values_all_int8, self.categories) + 
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks.

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone May 15, 2018
@TomAugspurger TomAugspurger merged commit 363426f into pandas-dev:master May 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Performance Memory or execution speed performance

3 participants