-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Code Sample:
import numpy as np import pandas as pd # construct IntervalIndex from timestamps # --> Length of index is > 100, the default leaf_size # --> index is heavily skewed towards Timstamp.max idx = pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365) # type conversion from IntervalIndex._engine left = idx._maybe_convert_i8(idx.left) right = idx._maybe_convert_i8(idx.right) # pivot point calculation from IntervalNode constructor, when n_elements > leaf_size pivot = np.median(left + right) / 2 print(pd.Timestamp(pivot)) # >>> 1848-02-11 00:06:21.572612096 # This is below the minimum left-side value, which should not be print(pd.Timestamp(np.median(idx.mid.asi8))) # >>> 2140-05-21 23:53:38.427387904 # This is the correct midpoint # NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition # proposed alternate logic for midpoint alternate = np.median(left/2 + right/2) print(pd.Timestamp(alternate) # >>> 2140-05-21 23:53:38.427387904 # Matches the correct midpoint, and should never over/under flowDescription
Current behavior causes a hard crash in the python process under certain conditions:
- Length of the index is > 100, the default leaf_size of the IntervalTree
- The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)
Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.
By emulating the IntervalNode.classify_intervals method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.
As an additional note, I also encountered an OverflowError when accessing the mid attribute of the IntervalIndex with my own data, but this does not occur in the code sample above. I traced it back through the length attribute, where left bounds are subtracted from right bounds.
idx.right - idx.left # >>> Traceback (most recent call last): # >>> File "<input>", line 1, in <module> # >>> File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__ # >>> result = self._data.__sub__(maybe_unwrap_index(other)) # >>> File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__ # >>> result = self._sub_datetime_arraylike(other) # >>> File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike # >>> arr_mask=self._isnan) # >>> File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr # >>> raise OverflowError("Overflow in int64 addition") # >>> OverflowError: Overflow in int64 additionExpected Output
Changing the pivot calculation from
self.pivot = np.median(left + right) / 2
to
self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.5.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None ------------------ pandas: 0.24.1 pytest: None pip: 10.0.1 setuptools: 40.2.0 Cython: None numpy: 1.16.1 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None