PERF: Limit memmove to >= 256 bytes, relax contiguity requirements #3130

stephenwlin · 2013-03-21T20:37:49Z

As per #3089, for now, I'm putting in a "magic" lower limit for memmove of 256 bytes (32 bytes was the case in the affected tests), which seems to be reasonable from local testing (::dubious::), unless someone on StackOverflow gives me a better idea of what to do.

Also, realized that only the stride in the dimension of the copy matters (i.e. the entire array doesn't have to be contiguous, only the copied subarrays do), so I relaxed that requirement (non-contiguous cases don't seem to be tested in our performance regressions, since they're pretty shallow unfortunately, but they do happen often in practice...this should be addressed by #3114).

Here are the vbench results on the low (improved) end (<90%):

series_drop_duplicates_string 0.7474 1.0182 0.7340 frame_multi_and_st 763.4459 1008.7359 0.7568 reindex_frame_level_align 1.5111 1.9937 0.7579 reindex_frame_level_reindex 1.5304 1.9711 0.7764 panel_from_dict_equiv_indexes 34.4702 42.6086 0.8090 groupby_frame_cython_many_columns 6.4044 7.4999 0.8539 read_csv_comment2 40.2233 45.7705 0.8788 groupby_series_simple_cython 7.5779 8.4962 0.8919

and the high (regressed) end (>105%, as there were no cases of >110%):

frame_ctor_nested_dict 108.7902 103.3220 1.0529 stats_rolling_mean 2.8128 2.6283 1.0702

I suspect the last results are just noise.

This is 32-bit Linux GCC 4.6.3, mileage may vary (still haven't set up at 64-bit environment), if anyone else could test this commit too that would be great.

EDIT

repeat run results:

reindex_frame_level_reindex 1.4861 1.9335 0.7686 reindex_frame_level_align 1.5578 1.9846 0.7849 panel_from_dict_equiv_indexes 35.0858 42.0521 0.8343 groupby_frame_cython_many_columns 6.5504 7.8101 0.8387 read_csv_comment2 40.2616 46.1578 0.8723 groupby_last_float32 7.7265 8.6995 0.8882

and

frame_reindex_columns 0.4452 0.4237 1.0507 append_frame_single_homogenous 0.4923 0.4669 1.0543 reshape_pivot_time_series 292.3670 277.2169 1.0547 join_dataframe_integer_key 2.8652 2.6741 1.0714

ghost · 2013-03-21T21:38:35Z

64 bit linux, gcc 4.7.2, sse2

./test_perf.sh -t e852b36 -b 4d3e34e -r reindex ----------------------------------------------------------- Test name | target[ms] | base[ms] | ratio ----------------------------------------------------------- frame_reindex_upcast 21.6771 32.5228 0.6665 frame_reindex_both_axes_ix 0.4005 0.6000 0.6674 frame_reindex_axis0 0.8208 0.9360 0.8769 reindex_multiindex 1.0762 1.1021 0.9765 frame_reindex_both_axes 0.3261 0.3307 0.9862 reindex_daterange_pad 0.1768 0.1786 0.9901 reindex_frame_level_align 0.9876 0.9950 0.9925 reindex_fillna_backfill_float32 0.0964 0.0971 0.9927 reindex_frame_level_reindex 0.9515 0.9580 0.9931 reindex_fillna_backfill 0.1036 0.1042 0.9940 reindex_fillna_pad_float32 0.0964 0.0969 0.9951 reindex_fillna_pad 0.1033 0.1037 0.9965 frame_reindex_columns 0.3024 0.3012 1.0041 dataframe_reindex 0.3868 0.3837 1.0080 reindex_daterange_backfill 0.1661 0.1639 1.0135 frame_reindex_axis1 1.9141 1.6649 1.1497 ----------------------------------------------------------- Test name | target[ms] | base[ms] | ratio ----------------------------------------------------------- Ratio < 1.0 means the target commit is faster then the baseline. Target [e852b36] : PERF: Limit memmove to >= 256 bytes, relax contiguity requirements (only the stride in the dimension of the copy matters) Base [4d3e34e] : Merge pull request #3127 from jreback/reshape

repeated run

----------------------------------------------------------- Test name | target[ms] | base[ms] | ratio ----------------------------------------------------------- reindex_frame_level_reindex 0.9444 1.0020 0.9426 frame_reindex_upcast 23.8891 24.6829 0.9678 frame_reindex_columns 0.3012 0.3107 0.9695 reindex_multiindex 1.1992 1.2264 0.9778 reindex_daterange_backfill 0.1637 0.1666 0.9827 reindex_fillna_pad_float32 0.0959 0.0974 0.9842 reindex_fillna_backfill_float32 0.0957 0.0966 0.9906 reindex_daterange_pad 0.1759 0.1774 0.9916 reindex_fillna_pad 0.1029 0.1034 0.9944 reindex_fillna_backfill 0.1035 0.1035 0.9993 frame_reindex_both_axes_ix 0.3980 0.3977 1.0008 frame_reindex_both_axes 0.3220 0.3217 1.0011 frame_reindex_axis1 1.6690 1.6468 1.0135 frame_reindex_axis0 0.8296 0.8147 1.0183 dataframe_reindex 0.3955 0.3852 1.0269 reindex_frame_level_align 1.2472 1.0184 1.2247 ----------------------------------------------------------- Test name | target[ms] | base[ms] | ratio -----------------------------------------------------------

third run

----------------------------------------------------------- Test name | target[ms] | base[ms] | ratio ----------------------------------------------------------- reindex_daterange_backfill 0.1659 0.2478 0.6693 frame_reindex_both_axes_ix 0.4036 0.5960 0.6772 dataframe_reindex 0.4049 0.5648 0.7169 frame_reindex_upcast 21.8160 30.0523 0.7259 reindex_frame_level_align 1.0126 1.3830 0.7322 frame_reindex_axis1 1.8061 2.0984 0.8607 frame_reindex_axis0 0.8023 0.9220 0.8702 reindex_frame_level_reindex 0.9319 0.9538 0.9770 frame_reindex_both_axes 0.3199 0.3229 0.9908 frame_reindex_columns 0.3020 0.3034 0.9953 reindex_daterange_pad 0.1769 0.1769 0.9997 reindex_fillna_pad 0.1032 0.1029 1.0028 reindex_fillna_backfill_float32 0.0962 0.0957 1.0055 reindex_fillna_backfill 0.1045 0.1034 1.0101 reindex_fillna_pad_float32 0.0969 0.0956 1.0136 reindex_multiindex 1.2075 1.0829 1.1151 ----------------------------------------------------------- Test name | target[ms] | base[ms] | ratio -----------------------------------------------------------

still wondering about repeatability of vbenches...

stephenwlin · 2013-03-21T21:43:02Z

@y-p, weird, it doesn't revert the 50% regression (on 64-bit) in reindex_frame_level_align and reindex_frame_level_reindex? Or maybe your machine doesn't have the regression to begin with? were you able to reproduce the regression found by @jreback with my original memmove commit? also, I guess 256 bytes might be too restrictive, since that means we're undoing an optimization we want; maybe I can bump it down to something lower (or make it machine word-size dependent...)...actually, even weirder is that reindex_frame_level_align goes from 95% to 122% on repeat run?

jreback · 2013-03-21T21:46:50Z

Results: t_head t_baseline ratio name reindex_frame_level_reindex 0.5425 1.0084 0.5379 reindex_frame_level_align 0.5625 1.0328 0.5447 groupby_simple_compress_timing 25.1635 31.8418 0.7903 Target [4bc26b4] : PERF: Limit memmove to >= 64 bytes, relax contiguity requirements (only the stride in the dimension of the copy matters) Baseline [a6db105] : Merge pull request #3103 from jreback/ensure_clean

Linux goat 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux gcc version 4.4.5 (Debian 4.4.5-8)

stephenwlin · 2013-03-21T21:47:25Z

@y-p @jreback, I suspect you have processor/compiler differences that are relevant for reindex_frame_level_reindex and reindex_frame_level_align then

jreback · 2013-03-21T21:47:55Z

@y-p isn't the point of your build CACHER to not rebuild the cython code? so can't use it here, because we WANT to rebuild

jreback · 2013-03-21T21:48:55Z

@stephenwlin time to upgrade my gcc I guess!

stephenwlin · 2013-03-21T21:48:55Z

@jreback, nothing regressed on the high end, right?

stephenwlin · 2013-03-21T21:56:17Z

@jreback anyway, I guess we reverted the regression found originally, so that's good, without causing another one or undoing the optimizations in cases where they're useful (as far as we know...); we definitely need to find ways to be more systematic about low-level testing though

jreback · 2013-03-21T21:57:01Z

i'll post my 2nd run in a min

jreback · 2013-03-21T22:06:02Z

no neg results

Results: t_head t_baseline ratio name reindex_frame_level_reindex 0.5394 1.0058 0.5363 reindex_frame_level_align 0.5718 1.0237 0.5585

even 20% can be random (because of the random num generation)

groupby_simple_compress_timing 32.0354 32.5414 0.9845

jreback · 2013-03-21T22:18:34Z

running same commits as @y-p (as I was running your 1st memmove commit), though I think you said just the comment changed...., will post in a few

stephenwlin · 2013-03-21T22:21:49Z

@jreback just the comment changed, I diffed to double check

jreback · 2013-03-21T22:36:23Z

virtually identical to my above post

4bc26b48d9eec28efe0b43e1e314940052de8c1d to a6db105

ghost · 2013-03-21T23:23:45Z

modulo bugs, the cython cache hack hashes all the input files that go into building the .so,
so changes should trigger a rebuild, and anecdotally, I've confirmed that indeed it does.

ghost · 2013-03-21T23:26:24Z

@stephenwlin , give me a commit hash, and I'll rerun an average of 3.

ghost · 2013-03-21T23:30:45Z

re diff between jeff's machine and me,
numpy/numpy#324
#2490

demonstrated that there are explicit differences in the memmove family between ubuntu
and debian libc. I'm on debian testing, I suspect jeff is on ubuntu, that might be a factor.

stephenwlin · 2013-03-21T23:34:03Z

the original regression was from 759b0db (good) to 10f65e0 (bad); fix should be from 4d3e34e (bad) to e852b36 (good)

wesm · 2013-03-21T23:40:08Z

Nice work. Maybe in like 2017 we'll have an ATLAS-like tuning system for pandas

stephenwlin · 2013-03-21T23:42:35Z

@wesm low-level tuning is fun :)

my senior thesis was on compiler optimization, specifically specialization based on propagating non-constant by range-limited value information...virtual table pointers in my case but this is the same general concept...I'm very very disappointed with gcc here that it can't figure this out...i might just go patch gcc (or clang, if it doesn't do this either...the codebase is much cleaner) for this...that doesn't help us anytime soon but it could by 2017

jreback · 2013-03-22T00:11:53Z

for consistency 4bc26b4 to 05a737d (v0.10.1)

Results: t_head t_baseline ratio name frame_get_dtype_counts 0.0971 207.3898 0.0005 frame_wide_repr 0.5320 207.6731 0.0026 groupby_first_float32 3.0259 335.9749 0.0090 groupby_last_float32 3.2037 334.8918 0.0096 frame_to_csv2 188.0970 2191.5438 0.0858 indexing_dataframe_boolean 12.8686 126.6370 0.1016 write_csv_standard 36.8581 232.5151 0.1585 frame_reindex_axis0 0.3152 1.1079 0.2845 frame_to_csv_mixed 349.8561 1112.4861 0.3145 frame_to_csv 116.9171 218.2992 0.5356 frame_add 22.0810 34.8116 0.6343 frame_mult 21.9506 34.0294 0.6450 frame_reindex_upcast 11.7105 17.3680 0.6743 frame_fancy_lookup_all 14.2124 19.7532 0.7195 reindex_frame_level_reindex 0.5592 0.7404 0.7553 reindex_frame_level_align 0.5713 0.7551 0.7567 series_string_vector_slice 146.6720 178.7109 0.8207 join_dataframe_index_single_key_bigger 12.6925 14.7876 0.8583 join_dataframe_index_single_key_bigger 5.7026 6.6035 0.8636 dataframe_reindex 0.3706 0.4290 0.8638 frame_fancy_lookup 1.6022 1.8267 0.8771 groupby_multi_different_numpy_functions 9.9613 11.2368 0.8865 frame_boolean_row_select 0.2046 0.2289 0.8937 groupby_multi_different_functions 9.9884 11.1221 0.8981

ghost · 2013-03-22T00:23:43Z

I'm getting too much variability in results on my machine and the iteration
turn around is killing me. There must be a rogue process hiding on my machine.

Just go with jeff's results. sorry.

stephenwlin · 2013-03-22T06:35:32Z

btw, these results:

reindex_frame_level_reindex 0.5592 0.7404 0.7553 reindex_frame_level_align 0.5713 0.7551 0.7567

are most likely due to keeping the output f-contiguous when the input is f-contiguous: I found that it helps even when not doing the memmove optimization; it's likely due to cache issues. this is why putting that change in was a win independently of doing memmove (both had noticable positive effects independently, they only had a negative effect when combined)

jreback · 2013-03-22T15:43:16Z

ok....unless any objections @y-p, @wesm going to merge this....

ghost · 2013-03-22T16:05:06Z

Nope, great job stephen.

I split up the cleanups from the real changes in a branch on my fork: stephenwlin-memmove-limit
and added pointers to this issue, since this is somewhat magical. Suggest you merge from
there once travis passes. I Preserved authorship ofcourse.

λ gd e852b36 bef52df diff --git a/pandas/src/generate_code.py b/pandas/src/generate_code.py index b80555d..2d58733 100644 --- a/pandas/src/generate_code.py +++ b/pandas/src/generate_code.py @@ -93,6 +93,7 @@ def take_2d_axis0_%(name)s_%(dest)s(ndarray[%(c_type_in)s, ndim=2] values, cdef: %(c_type_out)s *v, *o + #GH3130 if (values.strides[1] == out.strides[1] and values.strides[1] == sizeof(%(c_type_out)s) and sizeof(%(c_type_out)s) * n >= 256): @@ -141,6 +142,7 @@ def take_2d_axis1_%(name)s_%(dest)s(ndarray[%(c_type_in)s, ndim=2] values, cdef: %(c_type_out)s *v, *o + #GH3130 if (values.strides[0] == out.strides[0] and values.strides[0] == sizeof(%(c_type_out)s) and sizeof(%(c_type_out)s) * n >= 256):

wesm · 2013-03-22T16:16:56Z

I say bombs away whenever you guys are satisfied with this.

jreback · 2013-03-22T16:19:19Z

bow to git fu master @y-p
go ahead and merge when u r ready

ghost · 2013-03-22T16:25:05Z

why, thank you grasshopper.

I use some tools which might make you more productive, but the health and
safety persons are afraid your keyboard will catch on fire and associated
liability issues etc', so I'm holding off.

will merge.

jreback · 2013-03-22T16:40:02Z

@y-p slighlty OT, but I think to avoid merge conflicts on the relase notes...I should always add the issue reference at the end of the list (even though it makes it out of order)?

ghost · 2013-03-22T16:42:03Z

exactly the opposite, the diff context is shared by multiple commits that way,
which creates the conflict. randomize to avoid hitting the worst case on average.

jreback · 2013-03-22T16:55:10Z

@y-p great thxs

stephenwlin · 2013-03-22T19:23:15Z

@y-p I don't know what's up, but I seem to be getting very different and inconsistent results when using the cache directory by the way...could be a fluke and ymmv, of course

ghost · 2013-03-22T19:44:29Z

oh no. how do I repro?

ghost · 2013-03-22T19:56:36Z

can you try a sanity check:
λ md5sum /tmp/.pandas_build_cache/* | sort
λ md5sum pandas/*.so | sort

the latter is a subset of the first.

stephenwlin · 2013-03-23T00:49:35Z

hmm, seems ok, could be a fluke

ghost · 2013-03-23T02:16:59Z

I've added and pushed a random seed option to test_perf. grepping for "seed"
in the vbench repo came up empty so that ought to have taken care of it.
(It seeds python random as well numpy.random).
all runs now use a default seed , if not specified explicitly.

I'm still seeing the same variability i've always seen with vb.

This was referenced Mar 21, 2013

PERF: regression from 0.10.1 #3089

Closed

TST/PERF: make a test input creater instead of np.random.randn mainly for perf testing #3114

Closed

ghost merged commit bef52df into pandas-dev:master Mar 22, 2013

stephenwlin deleted the memmove-limit branch March 22, 2013 17:41

This pull request was closed.

Uh oh!

PERF: Limit memmove to >= 256 bytes, relax contiguity requirements #3130

PERF: Limit memmove to >= 256 bytes, relax contiguity requirements #3130

Uh oh!

Conversation

stephenwlin commented Mar 21, 2013

ghost commented Mar 21, 2013

stephenwlin commented Mar 21, 2013

jreback commented Mar 21, 2013

stephenwlin commented Mar 21, 2013

jreback commented Mar 21, 2013

jreback commented Mar 21, 2013

stephenwlin commented Mar 21, 2013

stephenwlin commented Mar 21, 2013

jreback commented Mar 21, 2013

jreback commented Mar 21, 2013

jreback commented Mar 21, 2013

stephenwlin commented Mar 21, 2013

jreback commented Mar 21, 2013

ghost commented Mar 21, 2013

ghost commented Mar 21, 2013

ghost commented Mar 21, 2013

stephenwlin commented Mar 21, 2013

wesm commented Mar 21, 2013

stephenwlin commented Mar 21, 2013

jreback commented Mar 22, 2013

ghost commented Mar 22, 2013

stephenwlin commented Mar 22, 2013

jreback commented Mar 22, 2013

ghost commented Mar 22, 2013

wesm commented Mar 22, 2013

jreback commented Mar 22, 2013

ghost commented Mar 22, 2013

jreback commented Mar 22, 2013

ghost commented Mar 22, 2013

jreback commented Mar 22, 2013

stephenwlin commented Mar 22, 2013

ghost commented Mar 22, 2013

ghost commented Mar 22, 2013

stephenwlin commented Mar 23, 2013

ghost commented Mar 23, 2013

Labels

3 participants