Skip to content

Conversation

@immerrr
Copy link
Contributor

@immerrr immerrr commented Feb 13, 2014

While working on #6328 I've stumbled upon this piece that deserves optimization.

Current implementation Index.is_all_dates invokes lib.infer_dtype which does a full-scan dtype inference which is not not always necessary. If I read cython sources correctly, new code is equivalent to the old one and the performance increase comes from short-circuiting the check on the first non-datetime element.

This change cuts roughly 50% of runtime of the following snippet:

import numpy as np import pandas as pd NUM_ELEMENTS = 1e7 s = pd.Series(np.arange(NUM_ELEMENTS), index=['a%s' % i for i in np.arange(NUM_ELEMENTS)]) print("pd.version", pd.__version__) %timeit s[:int(NUM_ELEMENTS/2)]

On recent master:

$ ipython test_index_is_all_dates.ipy pd.version 0.13.1-108-g87b4308 10 loops, best of 3: 55.5 ms per loop 

On this branch:

$ ipython test_index_is_all_dates.ipy pd.version 0.13.1-109-g3b43e2b 10 loops, best of 3: 28 ms per loop 
@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

fyi...this is a cached attribute so only hit once, but it is valid

the infer_dtype routines does do this optimization however

@jreback jreback added this to the 0.14.0 milestone Feb 13, 2014
@immerrr
Copy link
Contributor Author

immerrr commented Feb 13, 2014

fyi...this is a cached attribute so only hit once, but it is valid

I know that. Unfortunately, the cache doesn't help in case of slicing: the result is always a new object (unless the slice was itself cached), the cache is not calculated yet and the value is required during construction unconditionally (fastpath=True doesn't help).

the infer_dtype routines does do this optimization however

I'm not sure what you mean. lib.infer_dtype needs to know the exact type and thus is forced to scan all the elements. is_all_dates is only interested if all of the series elements are datetimes and the first element which doesn't fit is enough to have the answer.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

look at the impl of infer_dtype it fails fast. (as does is_datetime_array)

@immerrr
Copy link
Contributor Author

immerrr commented Feb 13, 2014

Hmmm... can it be that you confuse these two cases:

In [50]: arr_s = np.array(map(str, range(10000))) In [51]: arr_o = np.array(map(str, range(10000)), dtype=object) In [52]: timeit pd.lib.infer_dtype(arr_s) 1000000 loops, best of 3: 351 ns per loop In [53]: timeit pd.lib.infer_dtype(arr_o) 10000 loops, best of 3: 42 µs per loop In [54]: pd.lib.infer_dtype(arr_s) Out[54]: 'string' In [55]: pd.lib.infer_dtype(arr_o) Out[55]: 'string'

Because, I cannot see how infer_dtype can return 'string' for arr_o without scanning the whole array at line 124 and returning the value at line 125:

 elif PyString_Check(val): if is_string_array(values): return 'string'
@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

no that is correct. The is_string_array like all of the array functions fails fast. Of course they DO scan the entire array; but some cases are very optimized, e.g. if its already dtype = datetime64[ns] then it doesn't even need to do anything.

Pls run a perf-check with the change and see if anything changes. and post the results

test_perf.sh

@immerrr
Copy link
Contributor Author

immerrr commented Feb 13, 2014

Here's the result. Is there a benchmark for slice indexing?

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

look at vb_suite/indexing.py

so you prob need to create a benchmark which tests this change....nothing is obviously improved (but its a worthwhile change though)

you can use the example from above that you did

@immerrr
Copy link
Contributor Author

immerrr commented Feb 14, 2014

Done and updated the gist.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

nice!

ok just need an entry in release notes (ref this pr as their is no associated issue) and good 2 go

The patch avoids scanning the whole slice in Index.is_all_dates check.
@immerrr
Copy link
Contributor Author

immerrr commented Feb 14, 2014

Done.

@jreback jreback merged commit e3f666f into pandas-dev:master Feb 14, 2014
@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@immerrr thanks.....

sorry about being pendantic....but expect a lot of contributions from you!

@immerrr immerrr deleted the fix-generic-is-all-dates-performance branch February 14, 2014 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

2 participants