PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341

immerrr · 2014-02-13T13:59:25Z

While working on #6328 I've stumbled upon this piece that deserves optimization.

Current implementation Index.is_all_dates invokes lib.infer_dtype which does a full-scan dtype inference which is not not always necessary. If I read cython sources correctly, new code is equivalent to the old one and the performance increase comes from short-circuiting the check on the first non-datetime element.

This change cuts roughly 50% of runtime of the following snippet:

import numpy as np import pandas as pd NUM_ELEMENTS = 1e7 s = pd.Series(np.arange(NUM_ELEMENTS), index=['a%s' % i for i in np.arange(NUM_ELEMENTS)]) print("pd.version", pd.__version__) %timeit s[:int(NUM_ELEMENTS/2)]

On recent master:

$ ipython test_index_is_all_dates.ipy pd.version 0.13.1-108-g87b4308 10 loops, best of 3: 55.5 ms per loop

On this branch:

$ ipython test_index_is_all_dates.ipy pd.version 0.13.1-109-g3b43e2b 10 loops, best of 3: 28 ms per loop

jreback · 2014-02-13T14:01:32Z

fyi...this is a cached attribute so only hit once, but it is valid

the infer_dtype routines does do this optimization however

immerrr · 2014-02-13T14:09:37Z

fyi...this is a cached attribute so only hit once, but it is valid

I know that. Unfortunately, the cache doesn't help in case of slicing: the result is always a new object (unless the slice was itself cached), the cache is not calculated yet and the value is required during construction unconditionally (fastpath=True doesn't help).

the infer_dtype routines does do this optimization however

I'm not sure what you mean. lib.infer_dtype needs to know the exact type and thus is forced to scan all the elements. is_all_dates is only interested if all of the series elements are datetimes and the first element which doesn't fit is enough to have the answer.

jreback · 2014-02-13T14:40:45Z

look at the impl of infer_dtype it fails fast. (as does is_datetime_array)

immerrr · 2014-02-13T14:57:42Z

Hmmm... can it be that you confuse these two cases:

In [50]: arr_s = np.array(map(str, range(10000))) In [51]: arr_o = np.array(map(str, range(10000)), dtype=object) In [52]: timeit pd.lib.infer_dtype(arr_s) 1000000 loops, best of 3: 351 ns per loop In [53]: timeit pd.lib.infer_dtype(arr_o) 10000 loops, best of 3: 42 µs per loop In [54]: pd.lib.infer_dtype(arr_s) Out[54]: 'string' In [55]: pd.lib.infer_dtype(arr_o) Out[55]: 'string'

Because, I cannot see how infer_dtype can return 'string' for arr_o without scanning the whole array at line 124 and returning the value at line 125:

 elif PyString_Check(val): if is_string_array(values): return 'string'

jreback · 2014-02-13T15:08:05Z

no that is correct. The is_string_array like all of the array functions fails fast. Of course they DO scan the entire array; but some cases are very optimized, e.g. if its already dtype = datetime64[ns] then it doesn't even need to do anything.

Pls run a perf-check with the change and see if anything changes. and post the results

test_perf.sh

immerrr · 2014-02-13T18:50:05Z

Here's the result. Is there a benchmark for slice indexing?

jreback · 2014-02-13T19:03:15Z

look at vb_suite/indexing.py

so you prob need to create a benchmark which tests this change....nothing is obviously improved (but its a worthwhile change though)

you can use the example from above that you did

immerrr · 2014-02-14T06:28:03Z

Done and updated the gist.

jreback · 2014-02-14T11:24:24Z

nice!

ok just need an entry in release notes (ref this pr as their is no associated issue) and good 2 go

The patch avoids scanning the whole slice in Index.is_all_dates check.

immerrr · 2014-02-14T11:41:38Z

Done.

jreback · 2014-02-14T12:02:24Z

@immerrr thanks.....

sorry about being pendantic....but expect a lot of contributions from you!

jreback added the Performance label Feb 13, 2014

jreback added this to the 0.14.0 milestone Feb 13, 2014

PERF: speed up slice indexing on Series with string keys

e3f666f

The patch avoids scanning the whole slice in Index.is_all_dates check.

jreback merged commit e3f666f into pandas-dev:master Feb 14, 2014

immerrr deleted the fix-generic-is-all-dates-performance branch February 14, 2014 12:35

immerrr mentioned this pull request Feb 16, 2014

PERF: fix performance for series slices (even more) #6372

Merged

dbew mentioned this pull request Jun 11, 2014

series created with DatetimeIndex when index is [None] #7431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341

PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341

Uh oh!

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 14, 2014

jreback commented Feb 14, 2014

immerrr commented Feb 14, 2014

jreback commented Feb 14, 2014

Labels

2 participants

Uh oh!

PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341

PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341

Uh oh!

Conversation

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 14, 2014

jreback commented Feb 14, 2014

immerrr commented Feb 14, 2014

jreback commented Feb 14, 2014

Labels

2 participants