Skip to content

Conversation

@realead
Copy link
Contributor

@realead realead commented Aug 12, 2018

it is more or less the clean-up after PR #21904 and PR #22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.

As a collateral, the mangling of NaNs and NaT is fixed.

@realead
Copy link
Contributor Author

realead commented Aug 12, 2018

Performance tests:

asv continuous -f 1.01 upstream/master HEAD -b ^series_methods -b ^algorithms -b ^categorial

· Creating environments · Discovering benchmarks ·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt ·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.. · Running 66 total benchmarks (2 commits * 1 environments * 33 benchmarks) [ 0.00%] · For pandas commit hash e14fd992: [ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 0.00%] ··· Running benchmarks............................ [ 0.00%] ··· Setting up algorithms.py:95 [ 0.00%] ··· Running benchmarks......... [ 1.52%] ··· algorithms.Duplicated.time_duplicated_float 19.1±1ms;... [ 3.03%] ··· algorithms.Duplicated.time_duplicated_int 10.5±0.9ms;... [ 4.55%] ··· algorithms.Duplicated.time_duplicated_string 11.0±0.2μs;... [ 6.06%] ··· algorithms.DuplicatedUniqueIndex.time_duplicated_unique_int 31.5±0.2μs [ 7.58%] ··· algorithms.Factorize.time_factorize_float 67.7±8ms;... [ 9.09%] ··· algorithms.Factorize.time_factorize_int 23.1±4ms;... [ 10.61%] ··· algorithms.Factorize.time_factorize_string 162±20ms;... [ 12.12%] ··· algorithms.Match.time_match_string 376±4μs [ 13.64%] ··· series_methods.Clip.time_clip 126±3μs [ 15.15%] ··· series_methods.Dir.time_dir_strings 2.06±0.03ms [ 16.67%] ··· series_methods.Dropna.time_dropna 2.85±0.08ms;... [ 18.18%] ··· series_methods.IsIn.time_isin 1.67±0.05ms;... [ 19.70%] ··· series_methods.IsInFloat64.time_isin_few_different 39.5±2ms [ 21.21%] ··· series_methods.IsInFloat64.time_isin_many_different 48.9±9ms [ 22.73%] ··· series_methods.IsInFloat64.time_isin_nan_values 38.8±0.3ms [ 24.24%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values 3.62±0.1ms [ 25.76%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values_floats 6.46±0.09ms [ 27.27%] ··· series_methods.IsInForObjects.time_isin_long_series_short_values 2.37±0.05ms [ 28.79%] ··· series_methods.IsInForObjects.time_isin_nans 717±40μs [ 30.30%] ··· series_methods.IsInForObjects.time_isin_short_series_long_values 1.41±0.1ms [ 31.82%] ··· series_methods.Map.time_map 1.10±0.05ms;... [ 33.33%] ··· series_methods.NSort.time_nlargest 2.89±0.1ms;... [ 34.85%] ··· series_methods.NSort.time_nsmallest 2.42±0.07ms;... [ 36.36%] ··· series_methods.SeriesConstructor.time_constructor 163±3μs;... [ 37.88%] ··· series_methods.SeriesGetattr.time_series_datetimeindex_repr 3.19±0.09μs [ 39.39%] ··· series_methods.ValueCounts.time_value_counts 2.19±0.3ms;... [ 40.91%] ··· algorithms.Hashing.time_frame 22.5±0.6ms [ 42.42%] ··· algorithms.Hashing.time_series_categorical 5.35±0.1ms [ 43.94%] ··· algorithms.Hashing.time_series_dates 3.56±0.03ms [ 45.45%] ··· algorithms.Hashing.time_series_float 3.59±0.05ms [ 46.97%] ··· algorithms.Hashing.time_series_int 3.37±0.2ms [ 48.48%] ··· algorithms.Hashing.time_series_string 14.0±0.9ms [ 50.00%] ··· algorithms.Hashing.time_series_timedeltas 3.42±0.06ms [ 50.00%] · For pandas commit hash 03707400: [ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt... [ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 50.00%] ··· Running benchmarks............................ [ 50.00%] ··· Setting up algorithms.py:95 [ 50.00%] ··· Running benchmarks......... [ 51.52%] ··· algorithms.Duplicated.time_duplicated_float 15.5±0.3ms;... [ 53.03%] ··· algorithms.Duplicated.time_duplicated_int 11.3±2ms;... [ 54.55%] ··· algorithms.Duplicated.time_duplicated_string 10.6±0.2μs;... [ 56.06%] ··· algorithms.DuplicatedUniqueIndex.time_duplicated_unique_int 31.3±0.3μs [ 57.58%] ··· algorithms.Factorize.time_factorize_float 63.8±3ms;... [ 59.09%] ··· algorithms.Factorize.time_factorize_int 22.6±3ms;... [ 60.61%] ··· algorithms.Factorize.time_factorize_string 145±6ms;... [ 62.12%] ··· algorithms.Match.time_match_string 367±3μs [ 63.64%] ··· series_methods.Clip.time_clip 120±2μs [ 65.15%] ··· series_methods.Dir.time_dir_strings 2.03±0.1ms [ 66.67%] ··· series_methods.Dropna.time_dropna 3.03±0.1ms;... [ 68.18%] ··· series_methods.IsIn.time_isin 1.68±0.01ms;... [ 69.70%] ··· series_methods.IsInFloat64.time_isin_few_different 39.0±0.5ms [ 71.21%] ··· series_methods.IsInFloat64.time_isin_many_different 43.6±7ms [ 72.73%] ··· series_methods.IsInFloat64.time_isin_nan_values 38.9±0.1ms [ 74.24%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values 3.55±0.05ms [ 75.76%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values_floats 6.12±0.1ms [ 77.27%] ··· series_methods.IsInForObjects.time_isin_long_series_short_values 2.25±0.03ms [ 78.79%] ··· series_methods.IsInForObjects.time_isin_nans 680±10μs [ 80.30%] ··· series_methods.IsInForObjects.time_isin_short_series_long_values 1.28±0.1ms [ 81.82%] ··· series_methods.Map.time_map 1.11±0.04ms;... [ 83.33%] ··· series_methods.NSort.time_nlargest 2.94±0.06ms;... [ 84.85%] ··· series_methods.NSort.time_nsmallest 2.45±0.09ms;... [ 86.36%] ··· series_methods.SeriesConstructor.time_constructor 169±10μs;... [ 87.88%] ··· series_methods.SeriesGetattr.time_series_datetimeindex_repr 3.53±0.2μs [ 89.39%] ··· series_methods.ValueCounts.time_value_counts 2.17±0.2ms;... [ 90.91%] ··· algorithms.Hashing.time_frame 21.4±0.4ms [ 92.42%] ··· algorithms.Hashing.time_series_categorical 5.31±0.09ms [ 93.94%] ··· algorithms.Hashing.time_series_dates 3.51±0.1ms [ 95.45%] ··· algorithms.Hashing.time_series_float 3.35±0.03ms [ 96.97%] ··· algorithms.Hashing.time_series_int 3.31±0.06ms [ 98.48%] ··· algorithms.Hashing.time_series_string 15.2±0.4ms [100.00%] ··· algorithms.Hashing.time_series_timedeltas 3.44±0.07ms 
BENCHMARKS NOT SIGNIFICANTLY CHANGED.
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u run perf to make sure this is ok

@gfyoung gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 13, 2018
@realead
Copy link
Contributor Author

realead commented Aug 15, 2018

@jreback, @jorisvandenbossche, @mroeschke, @WillAyd wanted to ask for your advice, concerning this PR.

There is currently a problem, that np.unique() mangles pd.NaT and np.nan but not None:

 if not checknull(val) or val is None: .... 

on the other hand other functions of hash_table mangle all three of them, for example map_locations:

if val != val or val is None: val = na_sentinel .... 

Now, this leads to problems as soon as these two definition of unique meet, for example #22332, where mangling/not mangling None and np.nan is a problem.

This PR would not change the situation much: None+np.nan stays a problem, but the PR also demangles np.nan and pd.NaTin pd.unique() and thus the pair np.nan+pd.NaT becomes also a problem in situation #22332.

I see the following options:

  1. leave it as it is/do nothing
  2. mangle None with others in pd.unique(). i.e. rollback the fix for Change in behaviour of unique method regarding None values #20866
  3. go on with this PR, accept that also [np.nan, pd.NaT] will be a problem similar to [None, np.nan] for pd.to_datetime(). Disabling/changing the one failing unit test (see related #22305, right now [None`, pd.NaT] is also problem, but not tested in the test-suite)
  4. Unmangle None, np.nan and pd.NaT everywhere. But this mangling is almost everywhere...

What would you suggest?

@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2018

No idea how much effort this is signing up for but option 4 imo seems to be the best and cleanest

@mroeschke
Copy link
Member

Option 4 would be very nice but as you mention would take a lot of untangling. Additionally, there's already some institutional mangling of None and np.nan when it comes to converting None to np.nan https://pandas.pydata.org/pandas-docs/stable/missing_data.html#inserting-missing-data

@codecov
Copy link

codecov bot commented Aug 17, 2018

Codecov Report

Merging #22296 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #22296 +/- ## ========================================== + Coverage 92.16% 92.16% +<.01%  ========================================== Files 169 169 Lines 50708 50711 +3 ========================================== + Hits 46734 46737 +3  Misses 3974 3974
Flag Coverage Δ
#multiple 90.57% <100%> (ø) ⬆️
#single 42.35% <80%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.44% <ø> (-0.01%) ⬇️
pandas/core/indexes/numeric.py 97.3% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 996f361...b6459f3. Read the comment docs.

@realead
Copy link
Contributor Author

realead commented Aug 17, 2018

@jreback After running asv for the whole suite and rerunning for flashing test cases, that is what is left:

 before after ratio [03707400] [a6eb063c] - 29.7±1ms 25.6±0.5ms 0.86 index_object.SetOperations.time_operation('strings', 'symmetric_difference') - 41.0±1ms 35.2±0.5ms 0.86 index_object.SetOperations.time_operation('strings', 'intersection') - 179±3ms 150±2ms 0.84 index_object.Indexing.time_get_loc_non_unique('String') - 32.0±0.5ms 26.4±0.5ms 0.82 index_object.SetOperations.time_operation('date_string', 'symmetric_difference') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. 

The patch is slightly bigger than I would like, but one has to change everything at once in order not to get inconsistencies. In the end it was not as much work as expected to unmangle NA-values for object-type:

  1. Don't mangle NA-values in the hash-map and in the Base-Index
  2. Coarse NA-values to nans in Numeric-Index (to keep the current behavior), which probably makes sense.

There are probably still some hiccups, but at least the test suite doesn't show anything.

@realead realead force-pushed the cleanup_nans_in_unique branch 2 times, most recently from 8227280 to a608644 Compare August 23, 2018 20:26
@realead
Copy link
Contributor Author

realead commented Aug 26, 2018

@jreback pinging you to ask, whether you are Ok with the result of the performance test suite, or maybe I didn't understand your request correctly?

Btw. do I have to retrigger the CI, because there are some errors but unrelated to my changes?

@realead realead closed this Sep 5, 2018
@realead realead reopened this Sep 5, 2018
@pep8speaks
Copy link

Hello @realead! Thanks for updating the PR.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls rebase as well

- Bug when indexing :class:`DatetimeIndex` with nanosecond resolution dates and timezones (:issue:`11679`)
- Bug where indexing with a Numpy array containing negative values would mutate the indexer (:issue:`21867`)
- ``Float64Index.get_loc`` now raises ``KeyError`` when boolean key passed. (:issue:`19087`)
- :class:`Index` no longer mangles None, NaN and NaT (:issue:`22332`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be a little more explicit, use double back-ticks around None, NaN, NaT.

- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`)
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
- :func:`unique` no longer manges NaN-float-values and `pd.NaT` (:issue:`22295`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back-ticks around NaT (no pd.) required

def insert(self, loc, item):
# treat NA values as nans:
if is_scalar(item) and isna(item):
item = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self._na_value, hmm isna not already imported?

# GH 22332
# check pairwise, that no pair of na values
# is mangled
na_values = [None, np.nan, pd.NaT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the nulls_fixture

struct.pack("d", result[0]))[0]
assert result_nan_bits == bits_for_nan1

def test_do_not_mangle_na_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the fixture

e1 = np.array([1, 3, -1], dtype=np.intp)
assert_almost_equal(r1, e1)

def test_get_indexer_with_NA_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move near the other get_loc tests

@jreback jreback added this to the 0.24.0 milestone Sep 5, 2018
@realead realead force-pushed the cleanup_nans_in_unique branch from a608644 to 0e80f70 Compare September 5, 2018 21:04
@realead
Copy link
Contributor Author

realead commented Sep 7, 2018

@jreback done & green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, ping on green.

- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`)
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
- :func:`unique` no longer manges NaN-float-values and ``NaT`` (:issue:`22295`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manges -> mangles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the same sentence as above note (the i.e. part)


@pytest.fixture(params=[None, np.nan, pd.NaT])
def unique_nulls_fixture(request):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you use the existing nulls_fixture? (ahh or that has the float NaN which are the same as np.nan)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I could have used nulls_fixture, but then as you already have pointed out, I would have to filter out [np.nan, float('nan')] & Co. (which are not list of unique elements) at several places in tests, so I assumed adding unique_nulls_fixture would be cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is ok (i see how you are using it below).

def test_insert_missing(self, nulls_fixture):
# GH 18295 (test missing)
expected = Index(['a', np.nan, 'b', 'c'])
# test there is no mangling of NA values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

- Bug where indexing with a Numpy array containing negative values would mutate the indexer (:issue:`21867`)
- Bug where mixed indexes wouldn't allow integers for ``.at`` (:issue:`19860`)
- ``Float64Index.get_loc`` now raises ``KeyError`` when boolean key passed. (:issue:`19087`)
- :class:`Index` no longer mangles ``None``, ``NaN`` and ``NaT``, i.e. they are treated as three different keys. However, for :class:`NumericIndex` all three are still coarsed to a ``NaN`` (:issue:`22332`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use "numeric Index" instead of "NumericIndex" ? (such a class is not public so not known to users)

"""
return False

def insert(self, loc, item):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a docstring. I think you can add @Appender(Index.insert.__doc__) as decorator to do it (but best check in an interactive session that it indeed works)

- Bug in :func:`DataFrame.fillna` where a ``ValueError`` would raise when one column contained a ``datetime64[ns, tz]`` dtype (:issue:`15522`)
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
- :func:`Series.isin` now treats all nans as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
- :func:`unique` no longer manges NaN-float-values and ``NaT`` (:issue:`22295`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new addition (unique not mangling different NAs) seems in constrast with isin actually treating them equal (the line above) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isin-fix is about isin treating different nan-float-object as the same entity, i.e.

algos.isin([float('nan')], [float('nan')]) # different nan-objects

should yield [True] (and not [False] as it is in 0.23).

The unique-fix is about treating np.nan and pd.NaT as two different entities, ie.

pd.unique([np.nan, pd.NaT])

should yield [np.nan, pd.NaT] (and not [np.nan] as in 0.23). It still treats all nan-object as the same, i.e.:

pd.unique([np.nan, float('nan'), -float('nan')] ) # three different nan-objects

results in [np.nan]

I hope it is clearer with the updated more detailed description for the unique-change.

@realead realead force-pushed the cleanup_nans_in_unique branch from 6e02bd4 to b6459f3 Compare September 8, 2018 05:14
@realead
Copy link
Contributor Author

realead commented Sep 8, 2018

@jreback @jorisvandenbossche your remarks are adopted, PTAL.

@realead realead force-pushed the cleanup_nans_in_unique branch from b6459f3 to 8760066 Compare September 12, 2018 20:13
@jreback
Copy link
Contributor

jreback commented Sep 15, 2018

can you rebase

@realead realead force-pushed the cleanup_nans_in_unique branch from 8760066 to a258997 Compare September 15, 2018 18:51
it is more or less the clean-up after PR pandas-dev#21904 and PR pandas-dev#22207, the underlying hash-map handles all cases correctly out-of-the box and thus no special handling is needed.
@realead realead force-pushed the cleanup_nans_in_unique branch from a258997 to 356b8aa Compare September 17, 2018 19:01
@realead
Copy link
Contributor Author

realead commented Sep 18, 2018

@jreback rebased & green.

@jreback jreback merged commit d39249a into pandas-dev:master Sep 18, 2018
@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

thankks @realead nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

7 participants