Skip to content

Conversation

@max-sixty
Copy link
Contributor

The existing .difference method 'unboxed' all the objects, which has a severe performance impact on PeriodIndex in particular.

In [3]: long_index = pd.period_range(start='2000', freq='s', periods=1000) In [4]: empty_index = pd.PeriodIndex([],freq='s') In [24]: %timeit long_index.difference(empty_index) # existing: 1 loops, best of 1: 1.02 s per loop # updated:  1000 loops, best of 3: 538 µs per loop

...so around 2000x

I haven't worked with asv or the like - is this a case where a test like that is required?

@max-sixty max-sixty changed the title PER: use .values in index difference PERF: use .values in index difference Oct 10, 2015
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe nice to add a test for this one? (that it keeps the correct class)

@jreback
Copy link
Contributor

jreback commented Oct 10, 2015

there are quite a number of tests in tseries/tests/test_base for this type of behavior FYI

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Period Period data type labels Oct 10, 2015
@max-sixty
Copy link
Contributor Author

OK cheers @jreback. At the moment I'm getting a number of failures similar to the one below - I think it's where this operates on MultiIndexes.
I don't know how well multi_index._shallow_copy(multi_index.values) == multi_index works?
I can branch the logic depending on whether it's a MultiIndex or not - unless you have an alternative?

====================================================================== ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13998, in test_stack_partial_multiIndex _test_stack_with_multiindex(full_multiindex[multiindex_columns]) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 13969, in _test_stack_with_multiindex result = df.stack(level=level, dropna=False) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3745, in stack return stack(self, level, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack return _stack_multi_columns(frame, level_num=level_num, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns result = DataFrame(new_data, index=new_index, columns=new_columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 227, in __init__ mgr = self._init_dict(data, index, columns, dtype=dtype) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 322, in _init_dict data = dict((k, v) for k, v in compat.iteritems(data) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in <genexpr> if k in columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1116, in __contains__ return key in self._engine File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749) File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304) File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408) File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12850) ValueError: Does not understand character buffer dtype format string ('w')
@jreback
Copy link
Contributor

jreback commented Oct 10, 2015

looks like something else is going on
shallow_cooy should work it overridden for MultIndex

@jreback
Copy link
Contributor

jreback commented Oct 15, 2015

any progress?

@max-sixty
Copy link
Contributor Author

@jreback not yet - will look at it this weekend. Thanks for the ping

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

@MaximilianR if you'd like to update would be gr8

@max-sixty
Copy link
Contributor Author

I had a go at debugging this. But I'm struggling, since the errors happen on the Cython side - I need to get up to speed on how to debug those.
If anyone has any guidance, I'm very open to ideas. Otherwise it'll be a few weeks at least, I think.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2015

@MaximilianR can you rebase / update

@max-sixty
Copy link
Contributor Author

I still get this error below. I'm really not sure how to debug the pyx files - although keen to learn. Any guidance?

====================================================================== ERROR: test_stack_partial_multiIndex (pandas.tests.test_frame.TestDataFrame) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14305, in test_stack_partial_multiIndex _test_stack_with_multiindex(full_multiindex[multiindex_columns]) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/tests/test_frame.py", line 14276, in _test_stack_with_multiindex result = df.stack(level=level, dropna=False) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 3803, in stack return stack(self, level, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 481, in stack return _stack_multi_columns(frame, level_num=level_num, dropna=dropna) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/reshape.py", line 648, in _stack_multi_columns result = DataFrame(new_data, index=new_index, columns=new_columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 226, in __init__ mgr = self._init_dict(data, index, columns, dtype=dtype) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 323, in _init_dict data = dict((k, v) for k, v in compat.iteritems(data) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/frame.py", line 324, in <genexpr> if k in columns) File "/Users/maximilianroos/Dropbox/workspace/pandas/pandas/core/index.py", line 1161, in __contains__ return key in self._engine File "pandas/index.pyx", line 99, in pandas.index.IndexEngine.__contains__ (pandas/index.c:2749) File "pandas/index.pyx", line 261, in pandas.index.IndexEngine._ensure_mapping_populated (pandas/index.c:5304) File "pandas/index.pyx", line 267, in pandas.index.IndexEngine.initialize (pandas/index.c:5408) File "pandas/hashtable.pyx", line 703, in pandas.hashtable.PyObjectHashTable.map_locations (pandas/hashtable.c:12518) ValueError: Does not understand character buffer dtype format string ('w') ----------------------------------------------------------------------
@jreback
Copy link
Contributor

jreback commented Dec 9, 2015

go up the stack when debugging. somehow the new_columns is created with a dtype of S1 which is invalid this violates some guarantees there. So you have to trace where this is happening (prob the _shallow_copy may need a hint)

> /Users/jreback/pandas/pandas/core/reshape.py(648)_stack_multi_columns() -> result = DataFrame(new_data, index=new_index, columns=new_columns) (Pdb) p new_data {'A': array([ nan, 2., nan, nan, 5., nan, nan, 8., nan]), 'B': array([ 0., nan, 1., 3., nan, 4., 6., nan, 7.])} (Pdb) p new_index MultiIndex(levels=[[0, 1, 2], [u'u', u'x', u'y', u'z']], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3, 1, 2, 3]], names=[None, u'Lower']) (Pdb) p new_columns Index([u'A', u'B'], dtype='|S1', name=u'Upper') (Pdb) !new_columns = Index(new_columns.values,name=new_columns.name) *** NameError: name 'Index' is not defined (Pdb) from pandas import Index (Pdb) !new_columns = Index(new_columns.values,name=new_columns.name) (Pdb) p new_columns Index([u'A', u'B'], dtype='object', name=u'Upper') (Pdb) p DataFrame(new_data, index=new_index, columns=new_columns) Upper A B Lower 0 x NaN 0 y 2 NaN z NaN 1 1 x NaN 3 y 5 NaN z NaN 4 2 x NaN 6 y 8 NaN z NaN 7 
@max-sixty
Copy link
Contributor Author

OK thanks, I'll try that angle

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

@MaximilianR pls reopen if you would like to update

@jreback jreback closed this Jan 6, 2016
@max-sixty
Copy link
Contributor Author

OK, I will aim to come back to this one at some point

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

np. just trying to keep out outstanding PR's to minimum.

@max-sixty max-sixty deleted the index-setops-speed branch December 22, 2016 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Period Period data type

3 participants