-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Implement some reductions for string Series #31757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 29 commits
c87ce47 5365fa1 460030f 6497e16 c052feb 4312b4f b67cc56 214995f 95c83fb 4082ae3 12a2323 4a61b81 c6465e5 5c1a5fb ca99650 18800e1 3eddb73 0c1a2a3 5714c06 5e01c3f 0e4a4e2 ed01138 1128950 7c0f05d d84f5bd 7db2052 e19bbe1 df99b4f ba20705 d69b587 File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -228,7 +228,9 @@ def _maybe_get_mask( | |
| # Boolean data cannot contain nulls, so signal via mask being None | ||
| return None | ||
| | ||
| if skipna: | ||
| if skipna or is_object_dtype(values.dtype): | ||
| # The masking in nanminmax does not work for object dtype, so always | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rather than do this, what exactly s the issue? 'does not work' is not very descriptive and generally we don't put comments like this, we simply fix it Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So what nanops.nanminmax appears to do when taking the min or max in the presence of missing values is to fill them with an appropriate infinite number that has the effect of ignoring those missing values (if we're taking the minimum replace with positive infinity, if we're taking the max replace with negative infinity). The problem is that this makes no sense for strings (there is as far as I know no "infinite string"), and that's why we get the error about comparing floats (infinity) and strings. The easiest workaround seems to be to mask them out instead. To make things more complicated the Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. len let's actualy fix this properly. this is going to need either a branch for object dtypes and perf tests. Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly as #30982, I would personally rather have a separate implementation for one using masks instead of the filling logic of the nanops (with sharing helper functions where appropriate), instead of trying to fit it into the existing nanops code (which gives those special cases / complex if checks like the one below) Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche Fair point for the string dtype, although I think some kind of logic like this would be necessary to fix Edit: Actually looks like maybe you're already addressing this in your PR. Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yep, indeed, that's what you mentioned as argument before as well for doing it in nanops, so the non-extension array object dtype would benefit as well. For this, we could also add a check for object dtype, and then calculate the mask and use the masked op implementation. Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think it might also be worth trying to refactor nanminmax in nanops to use masking in general instead of the current filling approach (from what I could tell this was only really needed for arrays with more than one dimension)? Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes | ||
| # compute a mask for use later when skipna is False | ||
| mask = isna(values) | ||
| | ||
| return mask | ||
| | @@ -865,6 +867,17 @@ def reduction( | |
| result.fill(np.nan) | ||
| except (AttributeError, TypeError, ValueError): | ||
| result = np.nan | ||
| elif ( | ||
| is_object_dtype(dtype) | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do you need all of these condtions? this is complicated Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only looking at objects is for the reason above, I can try to be a bit more explicit in the comments if that would be helpful. | ||
| and values.ndim == 1 | ||
| and mask is not None | ||
| and mask.any() | ||
| ): | ||
| # Need to explicitly mask NA values for object dtypes | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? | ||
| if skipna: | ||
| result = getattr(values[~mask], meth)(axis) | ||
| else: | ||
| result = np.nan | ||
| else: | ||
| result = getattr(values, meth)(axis) | ||
| | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -77,7 +77,20 @@ class TestMissing(base.BaseMissingTests): | |
| | ||
| | ||
| class TestNoReduce(base.BaseNoReduceTests): | ||
| pass | ||
| @pytest.mark.parametrize("skipna", [True, False]) | ||
| def test_reduce_series_numeric(self, data, all_numeric_reductions, skipna): | ||
| if isinstance(data, pd.arrays.StringArray) and all_numeric_reductions in [ | ||
| "min", | ||
| "max", | ||
| ]: | ||
| # Tested in pandas/tests/arrays/string_/test_string.py | ||
| pytest.skip("These reductions are implemented") | ||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add here a comment saying that those are tested in tests/arrays/test_string.py ? | ||
| | ||
| op_name = all_numeric_reductions | ||
| s = pd.Series(data) | ||
| | ||
| with pytest.raises(TypeError): | ||
| getattr(s, op_name)(skipna=skipna) | ||
| | ||
| | ||
| class TestMethods(base.BaseMethodsTests): | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -1406,8 +1406,8 @@ def test_apply_datetime_tz_issue(self): | |
| @pytest.mark.parametrize("method", ["min", "max", "sum"]) | ||
| def test_consistency_of_aggregates_of_columns_with_missing_values(self, df, method): | ||
| # GH 16832 | ||
| none_in_first_column_result = getattr(df[["A", "B"]], method)() | ||
| none_in_second_column_result = getattr(df[["B", "A"]], method)() | ||
| none_in_first_column_result = getattr(df[["A", "B"]], method)().sort_index() | ||
| none_in_second_column_result = getattr(df[["B", "A"]], method)().sort_index() | ||
| Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously the column with the missing value was getting dropped from the result so it only had a single row and the order didn't matter jorisvandenbossche marked this conversation as resolved. Show resolved Hide resolved | ||
| | ||
| tm.assert_series_equal( | ||
| none_in_first_column_result, none_in_second_column_result | ||
| | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely too invasive for 1.02, move to 1.1