-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Convert DataFrame.rename to keyword only; simplify axis validation #29140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| [("copy", True), ("inplace", False), ("level", None), ("errors", "ignore")], | ||
| ) | ||
| def rename(self, *args, **kwargs): | ||
| def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change here was to be explicit about what is accepted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to being explicit. The only reason we had *args, **kwargs in the first place was to support the ambiguous case.
pandas/core/generic.py Outdated
| raise TypeError("Cannot specify both 'mapper' and any of 'index' or 'columns'") | ||
| else: | ||
| # use the mapper argument | ||
| if axis in {1, "columns"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We accept 3 arguments, mapper, index and columns. mapper can conflict with index or columns if either is provided, which the checks before this take care of. At this point if there aren't any conflicts, I think easiest to just combine the object
Note that this is arguably a little too strict, as this is no longer valid:
df.rename({"key": "val"}, axis=1, index={"key2": "val2"})But I don't think there is a lot to gain in supporting that given the variety of other options available here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is arguably a little too strict, as this is no longer valid:
I don't recall if that case was originally discussed, but I think it's OK to break it.
| for axis in range(self._AXIS_LEN): | ||
| v = axes.get(self._AXIS_NAMES[axis]) | ||
| if v is None: | ||
| for axis_no, replacements in enumerate((index, columns)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately I am trying to get rid of all of the internal _AXIS_* functions and mappings in core.generic, which seem like relics of the panel days
If we buy into just having index and columns being the potential axes in that order things can be greatly simplified
| ) | ||
| | ||
| def rename(self, index=None, **kwargs): | ||
| def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Series method doesn't completely align with the DataFrame method, but I think this is an easy way to go about it while maintain backwards compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value in accepting axis here, if we don't also have mapper and columns?
I don't see an easy way to accept mapper. That's essentially fulfilled by index.
So maybe my question is: should we also accept columns=? (Not arguing for this btw. I don't know what's best here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we validate axis at all, or just ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here axis gets ignored. Looks like this was implemented as part of #18589
So maybe my question is: should we also accept
columns=? (Not arguing for this btw. I don't know what's best here).
Yea I'm not sure myself. I was thinking ideally it would have been great to align the signatures better but for where we are that would cause some API churn. So I guess best to leave unless there's a ringing need to do something here (?)
| result = result.reindex(columns=df.columns) | ||
| tm.assert_frame_equal(result, df) | ||
| | ||
| def test_ambiguous_warns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These use cases are now covered by the test_rename_positional_raises test case above, so deleted as duplicative
| s.rename({}, axis="index") | ||
| with pytest.raises(ValueError, match="No axis named 5"): | ||
| s.rename({}, axis=5) | ||
| # TODO: clean up shared index validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented this out for now as I didn't think it was hugely critical and seemed somewhat orthoganal to the test, but can add back in if there is a huge objection.
Ideally though we would have a centralized place for index validation that doesn't mutate the arguments of the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to move this elsewhere. I do have a slight preference for not completely ignoring arguments that are there just for compatibility, but I don't know what our general rule is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea let me take another look I think should be able to keep this. Both self._get_axis_number and self._get_axis_name have some kind of validation in generic.
I don't know if there is one consistent way we do this right now, but want to march towards that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the preference should be that the axis parameter should be validated if specified on a Series for consistency with most Series methods that accept axis (e.g. fillna) and all DataFrame methods that accept axis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both
self._get_axis_numberandself._get_axis_namehave some kind of validation in generic.
normally you can just pass axis to super and it'll get validated? in this case there is also the call to self._set_name (cause of #28920). so probably will need the call to self._get_axis_number.
| 2 2 5 | ||
| 4 3 6 | ||
| """ | ||
| axes = validate_axis_style_args(self, args, kwargs, "mapper", "rename") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note - there is only one other use now of this validate_axis_style_args function (see DataFrame.reindex). The intent here is good, but I think the fact that it mutates the args and kwargs being passed it makes it very difficult to reason about. As a follow up to this would like to replace entirely with something simple that just operates on the axis argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything blocking a similar PR for .reindex?
As a follow up to this would like to replace entirely with something simple that just operates on the axis argument
Is this referring to DataFrame.reindex, or validate_axis_style_args? Ideally we can completely remove validate_axis_style_args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look before because I was trying to keep this focused, but it might not be too much extra effort. I'll take a look and if it gets out of hand let you know
pandas/core/frame.py Outdated
| Union[Mapping[Hashable, Hashable], Callable[[Hashable], Hashable]] | ||
| ] = None, | ||
| columns: Optional[ | ||
| Union[Mapping[Hashable, Hashable], Callable[[Hashable], Hashable]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth making a name for this in _typing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking the same thing. Any suggestions? LabelReplacement?
| [("copy", True), ("inplace", False), ("level", None), ("errors", "ignore")], | ||
| ) | ||
| def rename(self, *args, **kwargs): | ||
| def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to being explicit. The only reason we had *args, **kwargs in the first place was to support the ambiguous case.
| 2 2 5 | ||
| 4 3 6 | ||
| """ | ||
| axes = validate_axis_style_args(self, args, kwargs, "mapper", "rename") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything blocking a similar PR for .reindex?
As a follow up to this would like to replace entirely with something simple that just operates on the axis argument
Is this referring to DataFrame.reindex, or validate_axis_style_args? Ideally we can completely remove validate_axis_style_args.
pandas/core/generic.py Outdated
| raise TypeError("Cannot specify both 'mapper' and any of 'index' or 'columns'") | ||
| else: | ||
| # use the mapper argument | ||
| if axis in {1, "columns"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is arguably a little too strict, as this is no longer valid:
I don't recall if that case was originally discussed, but I think it's OK to break it.
| ) | ||
| | ||
| def rename(self, index=None, **kwargs): | ||
| def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value in accepting axis here, if we don't also have mapper and columns?
I don't see an easy way to accept mapper. That's essentially fulfilled by index.
So maybe my question is: should we also accept columns=? (Not arguing for this btw. I don't know what's best here).
| ) | ||
| | ||
| def rename(self, index=None, **kwargs): | ||
| def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we validate axis at all, or just ignore?
| # https://github.com/pandas-dev/pandas/issues/12392 | ||
| df = DataFrame({"a": [1, 2], "b": [1, 2]}, index=["X", "Y"]) | ||
| result = df.rename(str.lower, columns=str.upper) | ||
| result = df.rename(index=str.lower, columns=str.upper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So did this break? If so, is it solely because of the keyword only change to index and columns?
If so, I might prefer a proper deprecation warning for this change... But could be convinced otherwise, since I'd consider the old code bad style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be related to https://github.com/pandas-dev/pandas/pull/29140/files#r337242915
This is a little tough to support without keeping the ambiguity, and it doesn't really work outside of this one niche application.
For instance, if you tried to apply str.lower to the columns and str.upper to the index you would get the following error:
>>> df.rename(str.lower, index=str.upper, axis=1) TypeError: Cannot specify both 'axis' and any of 'index' or 'columns'.Even explicitly stating the axis in this case would raise an error:
>>> df.rename(str.lower, columns=str.upper, axis=0) TypeError: Cannot specify both 'axis' and any of 'index' or 'columns'.So we could still support this but it would take some gymnastics and not sure it makes things clearer, though open to the thoughts of others.
Maybe can be clarified further in the whatsnew to mitigate confusion?
| s.rename({}, axis="index") | ||
| with pytest.raises(ValueError, match="No axis named 5"): | ||
| s.rename({}, axis=5) | ||
| # TODO: clean up shared index validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to move this elsewhere. I do have a slight preference for not completely ignoring arguments that are there just for compatibility, but I don't know what our general rule is.
| Closing to clear queue; will pick back up when I get a little more free time |
| Re-open? This looks like a nice bugfix |
| Yea sure I'll try to take another look |
| | ||
| mapper = {self._info_axis_name: f} | ||
| return self.rename(**mapper) | ||
| return self.rename(**mapper) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't seem to get dict unpacking to place nicely with mypy. @simonjayhawkins
jreback left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @TomAugspurger
TomAugspurger left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glancing through, https://github.com/pandas-dev/pandas/pull/29140/files#r337263499 is I think my main outstanding concern.
df.reanme(str.lower, columns=str.upper)will I think be broken, which is unfortunate. I think I'm OK with it if others are though, since we'd recommend .rename(index=str.lower, columns=str.upper) anyway.
| @WillAyd conflict and some comments |
| @TomAugspurger yea that is the one case where this would break things, but the flip side is that also invites ambiguous uses like |
| Mmm I'd forgotten that case. So +1 to whatever the group thinks here :) |
| Planning to merge this in an hour, just before tagging the RC. |
| Thanks @WillAyd! |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffI think we can greatly simplify core if we update our assumptions around the axes of dimensions we deal with. I came to this change trying to get rid of
NDFrame._construct_axes_from_argumentswhich appears to do some argument mutation that isn't very clear. This was one of the few methods that called that, while also callingvalidate_axis_style_argswhich is guilty of the same kind of mutationBy being more explicit about the signature and converting this to keyword only arguments (which a FutureWarning detailed anyway) we can simplify the code and I think make more readable
cc @TomAugspurger