Skip to content

Conversation

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 13, 2017

Mater:

In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

HEAD:

In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Closes #17508

I'll rebase #16015 on top of this. Running an ASV now.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Performance Memory or execution speed performance labels Sep 13, 2017
@TomAugspurger
Copy link
Contributor Author

Here's the ASV:

[100.00%] ··· Running categoricals.Categoricals2.time_set_categories 80.8±1ms before after ratio [f11bbf2f] [6d72836e] - 80.8±1ms 15.4±0.3ms 0.19 categoricals.Categoricals2.time_set_categories SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. 
@TomAugspurger TomAugspurger force-pushed the set_categories-perf branch 2 times, most recently from 17dd9f9 to f74da7e Compare September 13, 2017 19:46
@pep8speaks
Copy link

pep8speaks commented Sep 13, 2017

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 14, 2017 at 10:55 Hours UTC
@codecov
Copy link

codecov bot commented Sep 13, 2017

Codecov Report

Merging #17515 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #17515 +/- ## ========================================== - Coverage 91.24% 91.2% -0.05%  ========================================== Files 163 163 Lines 49582 49586 +4 ========================================== - Hits 45242 45225 -17  - Misses 4340 4361 +21
Flag Coverage Δ
#multiple 88.99% <100%> (-0.03%) ⬇️
#single 40.2% <18.18%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 98.26% <100%> (-0.03%) ⬇️
pandas/core/categorical.py 95.57% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.23% <0%> (-1.82%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97abd2c...d238e3e. Read the comment docs.

@jreback jreback added this to the 0.21.0 milestone Sep 13, 2017
Examples
--------
>>> old_cat = pd.Index(['b', 'a', 'c'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this just a special case of union_categoricals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these both remap codes; seems like they've should share

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. There was a little bit I could extract from union_categorical. Simplified things a bit too by using take_1d. See my latest commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 14, 2017 via email

Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508
@jreback jreback merged commit 0097cb7 into pandas-dev:master Sep 14, 2017
@jreback
Copy link
Contributor

jreback commented Sep 14, 2017

@TomAugspurger TomAugspurger deleted the set_categories-perf branch September 14, 2017 23:30
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Mater: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)]; s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 68.5 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` HEAD: ```python In [1]: import pandas as pd; import numpy as np In [2]: arr = ['s%04d' % i for i in np.random.randint(0, 500000 // 10, size=500000)] s = pd.Series(arr).astype('category') In [3]: %timeit s.cat.set_categories(s.cat.categories) 7.43 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Closes pandas-dev#17508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Performance Memory or execution speed performance

3 participants