Skip to content

Conversation

@immerrr
Copy link
Contributor

@immerrr immerrr commented Jul 1, 2014

This PR speeds up MultiIndex.from_product employing the fact that operating on categorical codes is faster than on the values themselves.

This yields about 2x improvement in the benchmark

In [1]: import pandas.util.testing as tm In [2]: data = [tm.makeStringIndex(10000), tm.makeFloatIndex(20)] In [3]: %timeit pd.MultiIndex.from_product(data) 100 loops, best of 3: 10.6 ms per loop In [4]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data)) 10 loops, best of 3: 23.4 ms per loop

It's only marginally slower in small size cases:

In [1]: data = [np.arange(20).astype(object), np.arange(20)] In [2]: %timeit pd.MultiIndex.from_product(data) 1000 loops, best of 3: 317 µs per loop In [3]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data)) 1000 loops, best of 3: 308 µs per loop In [4]: data_int = [np.arange(20), np.arange(20)] In [5]: %timeit pd.MultiIndex.from_product(data_int) 1000 loops, best of 3: 285 µs per loop In [6]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data_int)) 1000 loops, best of 3: 269 µs per loop

And this case came as a surprise because the cartesian product is blazingly fast both in old and new versions, but profiling showed that factorization is a lot faster when done on a smaller array:

In [7]: data_large = [np.arange(10000), np.arange(20)] In [8]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data_large)) 100 loops, best of 3: 9.88 ms per loop In [9]: %timeit pd.MultiIndex.from_product(data_large) 100 loops, best of 3: 2.74 ms per loop
@immerrr immerrr changed the title (WIP) PERF: optimize MultiIndex.from_product PERF: optimize MultiIndex.from_product Jul 1, 2014
@shoyer
Copy link
Member

shoyer commented Jul 1, 2014

Looks like a nice speedup, but could you please verify that #6439 (cartesian product of a DatetimeIndex) is still fixed?

e.g.:

import pandas as pd idx = pd.MultiIndex.from_product([[1, 2], pd.date_range('2000-01-01', periods=2)]).values print [x.day for _, x in idx] # should print [1, 2, 1, 2]

In retrospect, I should have added a test for MultiIndex.from_product in #6451.

@jorisvandenbossche
Copy link
Member

@shoyer You can always still add a test for that in a new PR

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

@shoyer works for me

In [24]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :import pandas as pd :idx = pd.MultiIndex.from_product([[1, 2], pd.date_range('2000-01-01', periods=2)]).values :print [x.day for _, x in idx] :-- [1, 2, 1, 2]
@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

IIRC their are tests for using DatetimeIndex in MultiIndex.from_product. @immerrr if not enough test coverage, pls add (otherwise ok)

@jreback jreback added this to the 0.14.1 milestone Jul 1, 2014
@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

@immerrr looks ok, pls verify test coverage then can merge

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

Ok, added the test

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

ok, ping when green

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

good to go

jreback added a commit that referenced this pull request Jul 1, 2014
@jreback jreback merged commit c8a3eba into pandas-dev:master Jul 1, 2014
@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

thansk!

@immerrr immerrr deleted the perf-multiindex-fromproduct branch July 1, 2014 11:38
@shoyer
Copy link
Member

shoyer commented Jul 1, 2014

@immerrr Thanks for adding that test!

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

@shoyer you're welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MultiIndex Performance Memory or execution speed performance

4 participants