-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Closed
Labels
Milestone
Description
Found when tracking down what was going on with this question about performance.
First the case that makes sense:
>>> s = pd.Series(range(10**3), dtype=np.int32) >>> s.dtype dtype('int32') >>> s.dtype.type <type 'numpy.int32'> >>> s.dtype.type in pd.lib._TYPE_MAP True >>> >>> orig_sum_type = (s+s).dtype.type >>> orig_sum_type <type 'numpy.int32'> >>> orig_sum_type in pd.lib._TYPE_MAP True Now let's increase the length of the series.
>>> s = pd.Series(range(10**5), dtype=np.int32) >>> s.dtype dtype('int32') >>> s.dtype.type <type 'numpy.int32'> >>> s.dtype.type in pd.lib._TYPE_MAP True >>> >>> new_sum_type = (s+s).dtype.type >>> new_sum_type <type 'numpy.int32'> >>> new_sum_type in pd.lib._TYPE_MAP False .. wait, what?
>>> orig_sum_type, new_sum_type (<type 'numpy.int32'>, <type 'numpy.int32'>) >>> orig_sum_type == new_sum_type False >>> orig_sum_type is new_sum_type False >>> np.int32 is orig_sum_type True >>> np.int32 is new_sum_type False We've now got a new numpy.int32 type floating around, not equal to the one in numpy. The crossover seems to be at 10k:
>>> def find_first(): ... for i in range(1, 10**5): ... s = pd.Series(range(i), dtype=np.int32) ... if (s+s).dtype.type not in pd.lib._TYPE_MAP: ... return i ... >>> find_first() 10001 ISTM that this lack of recognition of the dtype as in _TYPE_MAP prevents the early exit from being taken in infer_dtype upon recognition that it's an integer dtype, and that slows things down considerably.