Skip to content

Conversation

@trentondelahaye
Copy link

@trentondelahaye trentondelahaye commented Apr 1, 2021

Summary of bug:

Above a certain length, calling .std() and .var() on a Series or DataFrame fails for float types. The is easily shown with float16 in the following example:

>>> import pandas as pd >>> import numpy as np >>> zeros = 32760 >>> ones = 32759 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.25 >>> zeros = 32760 >>> ones = 32760 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.0

What is happening here is that both count and d overflow to inf as 65519 is the largest integer float16 can represent. Example:

>>> import numpy as np >>> np.float16(65519) 65500.0 >>> np.float16(65520) inf

This makes the variance go to 0.0. Since the avg calculation on line 927 of this file casts to float64 anyway, it seems like there is no downside to not casting count and d. It will also improve accuracy in cases where the overflow does not happen.

INSTALLED VERSIONS

commit : f2c8480
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 20.3.0
Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.2.3
numpy : 1.20.1

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry
Summary of bug: Above a certain length, calling `.std() and `.var()` on a Series or DataFrame fails for float types. The is easily shown with float16 in the following example: ```python >>> import pandas as pd >>> import numpy as np >>> zeros = 32760 >>> ones = 32759 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.25 >>> zeros = 32760 >>> ones = 32760 >>> pd.Series([0] * zeros + [1] * ones).astype("float16").var() 0.0 ``` What is happening here is that both `count` and `d` overflow to `inf` as 65519 is the largest integer float16 can represent. Example: ```python >>> import numpy as np >>> np.float16(65519) 65500.0 >>> np.float16(65520) inf ``` This makes the variance go to 0.0. Since the `avg` calculation on 927 casts to float64 anyway, it seems like there is no downside to not casting `count` and `d`. It will also improve accuracy in cases where the overflow does not happen. INSTALLED VERSIONS ------------------ commit : f2c8480 python : 3.8.2.final.0 python-bits : 64 OS : Darwin OS-release : 20.3.0 Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.2.3 numpy : 1.20.1
@trentondelahaye
Copy link
Author

Seems like the unit test failures are related to matplotlib not this change so I do not think this logic is tested.

If someone agrees this is a useful fix, I am happy to make an issue, add tests and make a whatsnew entry.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have almost 0 float16 support, what exactly are you trying to fix?

@trentondelahaye
Copy link
Author

we have almost 0 float16 support, what exactly are you trying to fix?

@jreback I do agree with you float16 is very niche, but since this isn't a huge change and is something that is support in numpy (see below) I think it's worth changing.

The problem is that once you have a DataFrame/Series longer than 65519 entries, it will count those entries and cast it to a float16. This will fail as it can't be represented so will be inf and since you are dividing by this, it will cause the standard deviation and variance to be 0.0, which is incorrect and a bug. Note it does not occur using numpy and it definitely is not intended/expected for users:

>>> import numpy as np >>> zeros = 32760 >>> ones = 32760 >>> np.array([0] * zeros + [1] * ones).astype("float16").var() 0.2498

It is also not an issue with astype in Pandas, I've isolated it to this line of code.

@trentondelahaye
Copy link
Author

To clarify the bug in one line:

Any Series or DataFrame with type float16 that has more than 65519 entries will unexpectedly return 0.0 variance and standard deviation, no matter their values and not in accordance with numpy.

@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions labels Apr 1, 2021
if mask is not None:
values[mask] = np.nan

if is_float_dtype(values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it doesn't break anything else would take the change, pls add a test and a whatsnew note

@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 2, 2021
@jreback
Copy link
Contributor

jreback commented May 2, 2021

closing this as stale

@jreback jreback closed this May 2, 2021
@jreback
Copy link
Contributor

jreback commented May 2, 2021

ping if you would like to address and can reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions Stale

2 participants