Skip to content

Conversation

@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Aug 12, 2016

AFAIK this only affects 64 bit python on Windows.

numpy wants an np.intp (i8 on Windows) as a indexer for take, but pandas defines a "platform int" as a np.int_ (i4 on Windows). This hits performance twice, because we often start with i8, cast to i4, then numpy will cast back to i8 in its take.

This is an alternative to #13924 - there I explored replacing ndarray.take with our take_nd fully, but this approach solves the perf problem without much pain or risking new segfaults.

I'd still need to adjust a bunch of tests here to pass on Windows. This is an API change for "advanced" methods like get_indexer, but I don't think anything is necessary beyond a doc note?

ASV:

 before after ratio [4a80521 ] [3ced5d5 ] 3.49ms 2.86ms 0.82 indexing.series_take_dtindex.time_series_take_dtindex 3.28ms 3.03ms 0.92 indexing.series_take_intindex.time_series_take_intindex 15.16ms 14.57ms 0.96 join_merge.join_dataframe_index_single_key_bigger.time_join_dataframe_index_single_key_bigger 19.81ms 16.62ms 0.84 join_merge.join_dataframe_index_single_key_bigger_sort.time_join_dataframe_index_single_key_bigger_sort 14.05ms 13.63ms 0.97 join_merge.join_dataframe_index_single_key_small.time_join_dataframe_index_single_key_small 4.83ms 4.64ms 0.96 join_merge.join_dataframe_integer_2key.time_join_dataframe_integer_2key 1.80ms 1.63ms 0.90 join_merge.join_dataframe_integer_key.time_join_dataframe_integer_key 633.31us 570.37us 0.90 join_merge.join_non_unique_equal.time_join_non_unique_equal 2.39s 1.36s 0.57 join_merge.left_outer_join_index.time_left_outer_join_index 
@codecov-io
Copy link

codecov-io commented Aug 12, 2016

Current coverage is 85.29% (diff: 100%)

Merging #13972 into master will increase coverage by 0.02%

@@ master #13972 diff @@ ========================================== Files 139 139 Lines 50219 50245 +26 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== + Hits 42819 42854 +35  + Misses 7400 7391 -9  Partials 0 0 

Powered by Codecov. Last update 4a80521...322b11a

@sinhrks sinhrks added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Windows Windows OS labels Aug 12, 2016
- Bug in single row slicing on multi-type ``SparseDataFrame``s, types were previously forced to float (:issue:`13917`)

Indexer dtype Changes
^^^^^^^^^^^^^^^^^^^^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue ref here.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2016

yes this looks reasonable. go ahead an make ready on windows. Also add an asv as above.

@jreback
Copy link
Contributor

jreback commented Aug 13, 2016

pls update the docs as indicated. otherwise lgtm.

@chris-b1
Copy link
Contributor Author

What else were you looking for in the docs? I did add an issue ref in the first paragraph.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

ok this looks fine. can you run this on 32-bit linux as well and see that it comes up clean (IIRC you tests on windows so that should be good).

@jreback jreback added this to the 0.19.0 milestone Aug 15, 2016
@chris-b1
Copy link
Contributor Author

The changes I just pushed were to get this passing on 32 bit windows. I don't have anything handy to test 32 bit Linux, I can probably set up a VM, although I think it should generally act the same 32 bit Windows (np.int_ == np.intp == np.int32)

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

ok I'll run on 32 bit tomorrow

I just use a macosx / vm

@jreback
Copy link
Contributor

jreback commented Aug 16, 2016

This currently fails on 32-bit linux (before your PR). I recall had to do something on windows to get this to pass. Something wrong somewhere. Your PR fixes this! yeah!

====================================================================== FAIL: test_alignment_non_pandas (pandas.tests.frame.test_operators.TestDataFrameOperators) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jreback/pandas/pandas/tests/frame/test_operators.py", line 1210, in test_alignment_non_pandas Series([1, 2, 3], index=df.index)) File "/home/jreback/pandas/pandas/util/testing.py", line 1154, in assert_series_equal assert_attr_equal('dtype', left, right) File "/home/jreback/pandas/pandas/util/testing.py", line 878, in assert_attr_equal left_attr, right_attr) File "/home/jreback/pandas/pandas/util/testing.py", line 1018, in raise_assert_detail raise AssertionError(msg) AssertionError: Attributes are different Attribute "dtype" are different [left]: int32 [right]: int64 

So as-is this passes on linux32 (and fixes the above).

lmk when good to go.

@chris-b1
Copy link
Contributor Author

Cool - fixed the lint issue so this should be good to go.

@chris-b1 chris-b1 changed the title PERF/COMPAT: define platform int to np.intp (WIP) PERF/COMPAT: define platform int to np.intp Aug 16, 2016
@jreback jreback closed this in 0780443 Aug 17, 2016
@jreback
Copy link
Contributor

jreback commented Aug 17, 2016

thanks!

@chris-b1 chris-b1 deleted the platform-int branch August 17, 2016 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Windows Windows OS

4 participants