-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Refactor string methods for StringArray + return IntegerArray for numeric results #29640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
14ca804 07fc53d ddef378 a147081 36b1bf7 d01ddfe 22a2c79 686de87 0ea2d6b 358ce59 7b25d8d 46dfc20 1b96e53 3b20ffc e4bae5e File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -63,7 +63,7 @@ Previously, strings were typically stored in object-dtype NumPy arrays. | |
| ``StringDtype`` is currently considered experimental. The implementation | ||
| and parts of the API may change without warning. | ||
| | ||
| The text extension type solves several issues with object-dtype NumPy arrays: | ||
| The ``'string'`` extension type solves several issues with object-dtype NumPy arrays: | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this need any updating an/or reference to your new section that you added in text.rst? Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger you can probably update the sentence "The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype." 20 lines below this line, to include that it can also return IntegerDtype in certain cases. | ||
| | ||
| 1. You can accidentally store a *mixture* of strings and non-strings in an | ||
| ``object`` dtype array. A ``StringArray`` can only store strings. | ||
| | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -2,7 +2,7 @@ | |
| from functools import wraps | ||
| import re | ||
| import textwrap | ||
| from typing import Dict, List | ||
| from typing import TYPE_CHECKING, Any, Callable, Dict, List | ||
| import warnings | ||
| | ||
| import numpy as np | ||
| | @@ -15,10 +15,14 @@ | |
| ensure_object, | ||
| is_bool_dtype, | ||
| is_categorical_dtype, | ||
| is_extension_array_dtype, | ||
| is_integer, | ||
| is_integer_dtype, | ||
| is_list_like, | ||
| is_object_dtype, | ||
| is_re, | ||
| is_scalar, | ||
| is_string_dtype, | ||
| ) | ||
| from pandas.core.dtypes.generic import ( | ||
| ABCDataFrame, | ||
| | @@ -28,9 +32,14 @@ | |
| ) | ||
| from pandas.core.dtypes.missing import isna | ||
| | ||
| from pandas._typing import ArrayLike, Dtype | ||
| from pandas.core.algorithms import take_1d | ||
| from pandas.core.base import NoNewAttributesMixin | ||
| import pandas.core.common as com | ||
| from pandas.core.construction import extract_array | ||
| | ||
| if TYPE_CHECKING: | ||
| from pandas.arrays import StringArray | ||
| | ||
| _cpython_optimized_encoders = ( | ||
| "utf-8", | ||
| | @@ -109,9 +118,51 @@ def cat_safe(list_of_columns: List, sep: str): | |
| | ||
| def _na_map(f, arr, na_result=np.nan, dtype=object): | ||
| # should really _check_ for NA | ||
| if is_extension_array_dtype(arr.dtype): | ||
| Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you sync the signatures up with _map, also rename _map -> _map_arraylike (or similar) Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, don't follow this. I think they do match, aside from @jbrockmendel's request to call it I guess Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean with "-1 on 2 different branches here" ? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's why my initial PR was so much more complex, since it tried to handle both cases similarly. I think that was more complex than this. As Joris says, the main point of divergence is that for StringArray we usually know the result dtype exactly. It doesn't depend on the presence of NAs. Additionally,
Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, at the very least these signatures for what you call _map and _stringarray_map should be exactly the same. I think this is crucial for not adding technical debt. Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback is your suggestion to add Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What's the technical debt we're adding here? By definition, we need to handle StringArray differently, since its result type doesn't depend on the presence of NAs. Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And there is also a good reason for that, as Contributor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR. TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| arr = extract_array(arr) | ||
| return _stringarray_map(f, arr, na_value=na_result, dtype=dtype) | ||
| ||
| return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype) | ||
| | ||
| | ||
| def _stringarray_map( | ||
| func: Callable[[str], Any], arr: "StringArray", na_value: Any, dtype: Dtype | ||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. na_value restricted to scalar? Contributor Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I don't think so. | ||
| ) -> ArrayLike: | ||
| from pandas.arrays import IntegerArray, StringArray | ||
| | ||
| mask = isna(arr) | ||
| | ||
| assert isinstance(arr, StringArray) | ||
| arr = arr._ndarray | ||
jorisvandenbossche marked this conversation as resolved. Outdated Show resolved Hide resolved | ||
| | ||
| if is_integer_dtype(dtype): | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| na_value_is_na = isna(na_value) | ||
| if na_value_is_na: | ||
| na_value = 1 | ||
| result = lib.map_infer_mask( | ||
| arr, | ||
| func, | ||
| mask.view("uint8"), | ||
| convert=False, | ||
| na_value=na_value, | ||
| dtype=np.dtype("int64"), | ||
| ) | ||
| | ||
| if not na_value_is_na: | ||
| mask[:] = False | ||
| | ||
| return IntegerArray(result, mask) | ||
| | ||
| elif is_string_dtype(dtype) and not is_object_dtype(dtype): | ||
| # i.e. StringDtype | ||
| result = lib.map_infer_mask( | ||
| arr, func, mask.view("uint8"), convert=False, na_value=na_value | ||
| ) | ||
| return StringArray(result) | ||
| # TODO: BooleanArray | ||
| else: | ||
TomAugspurger marked this conversation as resolved. Show resolved Hide resolved | ||
| return lib.map_infer_mask(arr, func, mask.view("uint8")) | ||
| | ||
| | ||
| def _map(f, arr, na_mask=False, na_value=np.nan, dtype=object): | ||
TomAugspurger marked this conversation as resolved. Outdated Show resolved Hide resolved | ||
| if not len(arr): | ||
| return np.ndarray(0, dtype=dtype) | ||
| | @@ -634,7 +685,7 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True): | |
| raise ValueError("Cannot use a callable replacement when regex=False") | ||
| f = lambda x: x.replace(pat, repl, n) | ||
| | ||
| return _na_map(f, arr) | ||
| return _na_map(f, arr, dtype=str) | ||
| | ||
| | ||
| def str_repeat(arr, repeats): | ||
| | @@ -685,7 +736,7 @@ def scalar_rep(x): | |
| except TypeError: | ||
| return str.__mul__(x, repeats) | ||
| | ||
| return _na_map(scalar_rep, arr) | ||
| return _na_map(scalar_rep, arr, dtype=str) | ||
| else: | ||
| | ||
| def rep(x, r): | ||
| | @@ -1150,7 +1201,7 @@ def str_join(arr, sep): | |
| 4 NaN | ||
| dtype: object | ||
| """ | ||
| return _na_map(sep.join, arr) | ||
| return _na_map(sep.join, arr, dtype=str) | ||
| | ||
| | ||
| def str_findall(arr, pat, flags=0): | ||
| | @@ -1381,7 +1432,7 @@ def str_pad(arr, width, side="left", fillchar=" "): | |
| else: # pragma: no cover | ||
| raise ValueError("Invalid side") | ||
| | ||
| return _na_map(f, arr) | ||
| return _na_map(f, arr, dtype=str) | ||
| | ||
| | ||
| def str_split(arr, pat=None, n=None): | ||
| | @@ -1487,7 +1538,7 @@ def str_slice(arr, start=None, stop=None, step=None): | |
| """ | ||
| obj = slice(start, stop, step) | ||
| f = lambda x: x[obj] | ||
| return _na_map(f, arr) | ||
| return _na_map(f, arr, dtype=str) | ||
| | ||
| | ||
| def str_slice_replace(arr, start=None, stop=None, repl=None): | ||
| | @@ -1578,7 +1629,7 @@ def f(x): | |
| y += x[local_stop:] | ||
| return y | ||
| | ||
| return _na_map(f, arr) | ||
| return _na_map(f, arr, dtype=str) | ||
| | ||
| | ||
| def str_strip(arr, to_strip=None, side="both"): | ||
| | @@ -1603,7 +1654,7 @@ def str_strip(arr, to_strip=None, side="both"): | |
| f = lambda x: x.rstrip(to_strip) | ||
| else: # pragma: no cover | ||
| raise ValueError("Invalid side") | ||
| return _na_map(f, arr) | ||
| return _na_map(f, arr, dtype=str) | ||
| | ||
| | ||
| def str_wrap(arr, width, **kwargs): | ||
| | @@ -1667,7 +1718,7 @@ def str_wrap(arr, width, **kwargs): | |
| | ||
| tw = textwrap.TextWrapper(**kwargs) | ||
| | ||
| return _na_map(lambda s: "\n".join(tw.wrap(s)), arr) | ||
| return _na_map(lambda s: "\n".join(tw.wrap(s)), arr, dtype=str) | ||
| | ||
| | ||
| def str_translate(arr, table): | ||
| | @@ -1687,7 +1738,7 @@ def str_translate(arr, table): | |
| ------- | ||
| Series or Index | ||
| """ | ||
| return _na_map(lambda x: x.translate(table), arr) | ||
| return _na_map(lambda x: x.translate(table), arr, dtype=str) | ||
| | ||
| | ||
| def str_get(arr, i): | ||
| | @@ -3025,7 +3076,7 @@ def normalize(self, form): | |
| import unicodedata | ||
| | ||
| f = lambda x: unicodedata.normalize(form, x) | ||
| result = _na_map(f, self._parent) | ||
| result = _na_map(f, self._parent, dtype=str) | ||
| return self._wrap_result(result) | ||
| | ||
| _shared_docs[ | ||
| | @@ -3223,31 +3274,37 @@ def rindex(self, sub, start=0, end=None): | |
| lambda x: x.lower(), | ||
| name="lower", | ||
| docstring=_shared_docs["casemethods"] % _doc_args["lower"], | ||
| dtype=str, | ||
| ) | ||
| upper = _noarg_wrapper( | ||
| lambda x: x.upper(), | ||
| name="upper", | ||
| docstring=_shared_docs["casemethods"] % _doc_args["upper"], | ||
| dtype=str, | ||
| ) | ||
| title = _noarg_wrapper( | ||
| lambda x: x.title(), | ||
| name="title", | ||
| docstring=_shared_docs["casemethods"] % _doc_args["title"], | ||
| dtype=str, | ||
| ) | ||
| capitalize = _noarg_wrapper( | ||
| lambda x: x.capitalize(), | ||
| name="capitalize", | ||
| docstring=_shared_docs["casemethods"] % _doc_args["capitalize"], | ||
| dtype=str, | ||
| ) | ||
| swapcase = _noarg_wrapper( | ||
| lambda x: x.swapcase(), | ||
| name="swapcase", | ||
| docstring=_shared_docs["casemethods"] % _doc_args["swapcase"], | ||
| dtype=str, | ||
| ) | ||
| casefold = _noarg_wrapper( | ||
| lambda x: x.casefold(), | ||
| name="casefold", | ||
| docstring=_shared_docs["casemethods"] % _doc_args["casefold"], | ||
| dtype=str, | ||
| ) | ||
| | ||
| _shared_docs[ | ||
| | ||
Uh oh!
There was an error while loading. Please reload this page.