PERF: use .values in index difference #11279

max-sixty · 2015-10-10T01:16:40Z

The existing .difference method 'unboxed' all the objects, which has a severe performance impact on PeriodIndex in particular.

In [3]: long_index = pd.period_range(start='2000', freq='s', periods=1000) In [4]: empty_index = pd.PeriodIndex([],freq='s') In [24]: %timeit long_index.difference(empty_index) # existing: 1 loops, best of 1: 1.02 s per loop # updated:  1000 loops, best of 3: 538 µs per loop

...so around 2000x

I haven't worked with asv or the like - is this a case where a test like that is required?

jorisvandenbossche · 2015-10-10T10:25:07Z

pandas/core/index.py

maybe nice to add a test for this one? (that it keeps the correct class)

jreback · 2015-10-10T18:53:58Z

there are quite a number of tests in tseries/tests/test_base for this type of behavior FYI

max-sixty · 2015-10-10T19:28:21Z

OK cheers @jreback. At the moment I'm getting a number of failures similar to the one below - I think it's where this operates on MultiIndexes.
I don't know how well multi_index._shallow_copy(multi_index.values) == multi_index works?
I can branch the logic depending on whether it's a MultiIndex or not - unless you have an alternative?

====================================================================== ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13998, in test_stack_partial_multiIndex _test_stack_with_multiindex(full_multiindex[multiindex_columns]) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13969, in _test_stack_with_multiindex result = df.stack(level=level, dropna=False) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3745, in stack return stack(self, level, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack return _stack_multi_columns(frame, level_num=level_num, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns result = DataFrame(new_data, index=new_index, columns=new_columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 227, in __init__ mgr = self._init_dict(data, index, columns, dtype=dtype) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 322, in _init_dict data = dict((k, v) for k, v in compat.iteritems(data) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in <genexpr> if k in columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1116, in __contains__ return key in self._engine File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749) File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304) File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408) File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12850) ValueError: Does not understand character buffer dtype format string ('w')

jreback · 2015-10-10T19:38:29Z

looks like something else is going on
shallow_cooy should work it overridden for MultIndex

jreback · 2015-10-15T22:26:21Z

any progress?

max-sixty · 2015-10-15T22:37:58Z

@jreback not yet - will look at it this weekend. Thanks for the ping

jreback · 2015-11-18T20:16:12Z

@MaximilianR if you'd like to update would be gr8

max-sixty · 2015-11-19T01:16:06Z

I had a go at debugging this. But I'm struggling, since the errors happen on the Cython side - I need to get up to speed on how to debug those.
If anyone has any guidance, I'm very open to ideas. Otherwise it'll be a few weeks at least, I think.

jreback · 2015-12-06T19:17:55Z

@MaximilianR can you rebase / update

max-sixty · 2015-12-08T04:07:51Z

I still get this error below. I'm really not sure how to debug the pyx files - although keen to learn. Any guidance?

====================================================================== ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14305, in test_stack_partial_multiIndex _test_stack_with_multiindex(full_multiindex[multiindex_columns]) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14276, in _test_stack_with_multiindex result = df.stack(level=level, dropna=False) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3803, in stack return stack(self, level, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack return _stack_multi_columns(frame, level_num=level_num, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns result = DataFrame(new_data, index=new_index, columns=new_columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 226, in __init__ mgr = self._init_dict(data, index, columns, dtype=dtype) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in _init_dict data = dict((k, v) for k, v in compat.iteritems(data) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 324, in <genexpr> if k in columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1161, in __contains__ return key in self._engine File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749) File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304) File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408) File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12518) ValueError: Does not understand character buffer dtype format string ('w') ----------------------------------------------------------------------

jreback · 2015-12-09T15:12:17Z

go up the stack when debugging. somehow the new_columns is created with a dtype of S1 which is invalid this violates some guarantees there. So you have to trace where this is happening (prob the _shallow_copy may need a hint)

> /Users/jreback/pandas/pandas/core/reshape.py(648)_stack_multi_columns() -> result = DataFrame(new_data, index=new_index, columns=new_columns) (Pdb) p new_data {'A': array([ nan, 2., nan, nan, 5., nan, nan, 8., nan]), 'B': array([ 0., nan, 1., 3., nan, 4., 6., nan, 7.])} (Pdb) p new_index MultiIndex(levels=[[0, 1, 2], [u'u', u'x', u'y', u'z']], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3, 1, 2, 3]], names=[None, u'Lower']) (Pdb) p new_columns Index([u'A', u'B'], dtype='|S1', name=u'Upper') (Pdb) !new_columns = Index(new_columns.values,name=new_columns.name) *** NameError: name 'Index' is not defined (Pdb) from pandas import Index (Pdb) !new_columns = Index(new_columns.values,name=new_columns.name) (Pdb) p new_columns Index([u'A', u'B'], dtype='object', name=u'Upper') (Pdb) p DataFrame(new_data, index=new_index, columns=new_columns) Upper A B Lower 0 x NaN 0 y 2 NaN z NaN 1 1 x NaN 3 y 5 NaN z NaN 4 2 x NaN 6 y 8 NaN z NaN 7

max-sixty · 2015-12-09T15:34:38Z

OK thanks, I'll try that angle

jreback · 2016-01-06T17:18:33Z

@MaximilianR pls reopen if you would like to update

max-sixty · 2016-01-06T17:33:28Z

OK, I will aim to come back to this one at some point

jreback · 2016-01-06T17:35:24Z

np. just trying to keep out outstanding PR's to minimum.

max-sixty changed the title ~~PER: use .values in index difference~~ PERF: use .values in index difference Oct 10, 2015

max-sixty force-pushed the index-setops-speed branch from c610191 to d483846 Compare October 10, 2015 03:21

jorisvandenbossche reviewed Oct 10, 2015
View reviewed changes

pandas/core/index.py

Copy link

Member

jorisvandenbossche Oct 10, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe nice to add a test for this one? (that it keeps the correct class)

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Period Period data type labels Oct 10, 2015

max-sixty force-pushed the index-setops-speed branch from d483846 to b3fbdd5 Compare October 10, 2015 19:05

max-sixty force-pushed the index-setops-speed branch from b3fbdd5 to 224791a Compare October 17, 2015 18:51

use .values in index difference

19cc65d

max-sixty force-pushed the index-setops-speed branch from 224791a to 19cc65d Compare December 8, 2015 04:02

jreback closed this Jan 6, 2016

jreback mentioned this pull request Jan 15, 2016

Index.difference performance #12044

Closed

max-sixty deleted the index-setops-speed branch December 22, 2016 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: use .values in index difference #11279

PERF: use .values in index difference #11279

Uh oh!

max-sixty commented Oct 10, 2015

jorisvandenbossche Oct 10, 2015

jreback commented Oct 10, 2015

max-sixty commented Oct 10, 2015

jreback commented Oct 10, 2015

jreback commented Oct 15, 2015

max-sixty commented Oct 15, 2015

jreback commented Nov 18, 2015

max-sixty commented Nov 19, 2015

jreback commented Dec 6, 2015

max-sixty commented Dec 8, 2015

jreback commented Dec 9, 2015

max-sixty commented Dec 9, 2015

jreback commented Jan 6, 2016

max-sixty commented Jan 6, 2016

jreback commented Jan 6, 2016

Labels

3 participants

Uh oh!

PERF: use .values in index difference #11279

PERF: use .values in index difference #11279

Uh oh!

Conversation

max-sixty commented Oct 10, 2015

jorisvandenbossche Oct 10, 2015

Choose a reason for hiding this comment

jreback commented Oct 10, 2015

max-sixty commented Oct 10, 2015

jreback commented Oct 10, 2015

jreback commented Oct 15, 2015

max-sixty commented Oct 15, 2015

jreback commented Nov 18, 2015

max-sixty commented Nov 19, 2015

jreback commented Dec 6, 2015

max-sixty commented Dec 8, 2015

jreback commented Dec 9, 2015

max-sixty commented Dec 9, 2015

jreback commented Jan 6, 2016

max-sixty commented Jan 6, 2016

jreback commented Jan 6, 2016

Labels

3 participants