BUG: Fix nanvar for large float16 arrays #40738

trentondelahaye · 2021-04-01T19:15:53Z

Summary of bug:

Above a certain length, calling .std() and .var() on a Series or DataFrame fails for float types. The is easily shown with float16 in the following example:

>>> import pandas as pd >>> import numpy as np >>> zeros = 32760 >>> ones = 32759 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.25 >>> zeros = 32760 >>> ones = 32760 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.0

What is happening here is that both count and d overflow to inf as 65519 is the largest integer float16 can represent. Example:

>>> import numpy as np >>> np.float16(65519) 65500.0 >>> np.float16(65520) inf

This makes the variance go to 0.0. Since the avg calculation on line 927 of this file casts to float64 anyway, it seems like there is no downside to not casting count and d. It will also improve accuracy in cases where the overflow does not happen.

INSTALLED VERSIONS

commit : f2c8480
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 20.3.0
Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.2.3
numpy : 1.20.1

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Summary of bug: Above a certain length, calling `.std() and `.var()` on a Series or DataFrame fails for float types. The is easily shown with float16 in the following example: ```python >>> import pandas as pd >>> import numpy as np >>> zeros = 32760 >>> ones = 32759 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.25 >>> zeros = 32760 >>> ones = 32760 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.0 ``` What is happening here is that both `count` and `d` overflow to `inf` as 65519 is the largest integer float16 can represent. Example: ```python >>> import numpy as np >>> np.float16(65519) 65500.0 >>> np.float16(65520) inf ``` This makes the variance go to 0.0. Since the `avg` calculation on 927 casts to float64 anyway, it seems like there is no downside to not casting `count` and `d`. It will also improve accuracy in cases where the overflow does not happen. INSTALLED VERSIONS ------------------ commit : f2c8480 python : 3.8.2.final.0 python-bits : 64 OS : Darwin OS-release : 20.3.0 Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.2.3 numpy : 1.20.1

trentondelahaye · 2021-04-01T20:04:01Z

Seems like the unit test failures are related to matplotlib not this change so I do not think this logic is tested.

If someone agrees this is a useful fix, I am happy to make an issue, add tests and make a whatsnew entry.

jreback

we have almost 0 float16 support, what exactly are you trying to fix?

trentondelahaye · 2021-04-01T20:24:44Z

we have almost 0 float16 support, what exactly are you trying to fix?

@jreback I do agree with you float16 is very niche, but since this isn't a huge change and is something that is support in numpy (see below) I think it's worth changing.

The problem is that once you have a DataFrame/Series longer than 65519 entries, it will count those entries and cast it to a float16. This will fail as it can't be represented so will be inf and since you are dividing by this, it will cause the standard deviation and variance to be 0.0, which is incorrect and a bug. Note it does not occur using numpy and it definitely is not intended/expected for users:

>>> import numpy as np >>> zeros = 32760 >>> ones = 32760 >>> np.array([0] * zeros + [1] * ones).astype("float16").var() 0.2498

It is also not an issue with astype in Pandas, I've isolated it to this line of code.

trentondelahaye · 2021-04-01T20:27:41Z

To clarify the bug in one line:

Any Series or DataFrame with type float16 that has more than 65519 entries will unexpectedly return 0.0 variance and standard deviation, no matter their values and not in accordance with numpy.

jreback · 2021-04-01T20:33:38Z

pandas/core/nanops.py

 if mask is not None:
 values[mask] = np.nan

- if is_float_dtype(values.dtype):


if it doesn't break anything else would take the change, pls add a test and a whatsnew note

github-actions · 2021-05-02T00:28:14Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-05-02T23:48:38Z

closing this as stale

jreback · 2021-05-02T23:48:50Z

ping if you would like to address and can reopen

jreback requested changes Apr 1, 2021

View reviewed changes

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions labels Apr 1, 2021

jreback reviewed Apr 1, 2021

View reviewed changes

github-actions bot added the Stale label May 2, 2021

jreback closed this May 2, 2021

tushushu mentioned this pull request Oct 8, 2021

BUG: Fix the Float type overflow issue for statistical functions. #43929

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Fix nanvar for large float16 arrays #40738

BUG: Fix nanvar for large float16 arrays #40738

Uh oh!

trentondelahaye commented Apr 1, 2021 •

edited

Loading

trentondelahaye commented Apr 1, 2021

jreback left a comment

trentondelahaye commented Apr 1, 2021

trentondelahaye commented Apr 1, 2021

jreback Apr 1, 2021

github-actions bot commented May 2, 2021

jreback commented May 2, 2021

jreback commented May 2, 2021

Labels

2 participants

Uh oh!

BUG: Fix nanvar for large float16 arrays #40738

BUG: Fix nanvar for large float16 arrays #40738

Uh oh!

Conversation

trentondelahaye commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

INSTALLED VERSIONS

trentondelahaye commented Apr 1, 2021

jreback left a comment

Choose a reason for hiding this comment

trentondelahaye commented Apr 1, 2021

trentondelahaye commented Apr 1, 2021

jreback Apr 1, 2021

Choose a reason for hiding this comment

github-actions bot commented May 2, 2021

jreback commented May 2, 2021

jreback commented May 2, 2021

Labels

2 participants

trentondelahaye commented Apr 1, 2021 •

edited

Loading