Skip to content

BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

@kingsykes

Description

@kingsykes

Code Sample:

import numpy as np import pandas as pd # construct IntervalIndex from timestamps # --> Length of index is > 100, the default leaf_size # --> index is heavily skewed towards Timstamp.max idx = pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365) # type conversion from IntervalIndex._engine left = idx._maybe_convert_i8(idx.left) right = idx._maybe_convert_i8(idx.right) # pivot point calculation from IntervalNode constructor, when n_elements > leaf_size pivot = np.median(left + right) / 2 print(pd.Timestamp(pivot)) # >>> 1848-02-11 00:06:21.572612096 # This is below the minimum left-side value, which should not be print(pd.Timestamp(np.median(idx.mid.asi8))) # >>> 2140-05-21 23:53:38.427387904 # This is the correct midpoint # NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition # proposed alternate logic for midpoint alternate = np.median(left/2 + right/2) print(pd.Timestamp(alternate) # >>> 2140-05-21 23:53:38.427387904 # Matches the correct midpoint, and should never over/under flow

Description

Current behavior causes a hard crash in the python process under certain conditions:

  • Length of the index is > 100, the default leaf_size of the IntervalTree
  • The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)

Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.

By emulating the IntervalNode.classify_intervals method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.

As an additional note, I also encountered an OverflowError when accessing the mid attribute of the IntervalIndex with my own data, but this does not occur in the code sample above. I traced it back through the length attribute, where left bounds are subtracted from right bounds.

idx.right - idx.left # >>> Traceback (most recent call last): # >>> File "<input>", line 1, in <module> # >>> File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__ # >>> result = self._data.__sub__(maybe_unwrap_index(other)) # >>> File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__ # >>> result = self._sub_datetime_arraylike(other) # >>> File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike # >>> arr_mask=self._isnan) # >>> File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr # >>> raise OverflowError("Overflow in int64 addition") # >>> OverflowError: Overflow in int64 addition

Expected Output

Changing the pivot calculation from
self.pivot = np.median(left + right) / 2
to
self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None ------------------ pandas: 0.24.1 pytest: None pip: 10.0.1 setuptools: 40.2.0 Cython: None numpy: 1.16.1 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None 

Metadata

Metadata

Assignees

No one assigned

    Labels

    IntervalInterval data typeNumeric OperationsArithmetic, Comparison, and Logical operations

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions