Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions asv_bench/benchmarks/categoricals.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,18 @@ def time_rank_int_cat(self):

def time_rank_int_cat_ordered(self):
self.s_int_cat_ordered.rank()


class IsIn(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make Isin


goal_time = 0.2

def setup(self):
n = 5 * 10**5
sample_size = 100
arr = ['s%04d' % i for i in np.random.randint(0, n // 10, size=n)]
self.sample = np.random.choice(arr, sample_size)
self.ts = pd.Series(arr).astype('category')

def time_isin_categorical_strings(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 4 cases in the original issue can you cover them

self.ts.isin(self.sample)
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -896,6 +896,7 @@ Performance Improvements
- Improved performance of :func:`pandas.core.groupby.GroupBy.ffill` and :func:`pandas.core.groupby.GroupBy.bfill` (:issue:`11296`)
- Improved performance of :func:`pandas.core.groupby.GroupBy.any` and :func:`pandas.core.groupby.GroupBy.all` (:issue:`15435`)
- Improved performance of :func:`pandas.core.groupby.GroupBy.pct_change` (:issue:`19165`)
- Improved performance of :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)

.. _whatsnew_0230.docs:

Expand Down
11 changes: 11 additions & 0 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@
from pandas.util._decorators import (
Appender, cache_readonly, deprecate_kwarg, Substitution)

import pandas.core.algorithms as algorithms

from pandas.io.formats.terminal import get_terminal_size
from pandas.util._validators import validate_bool_kwarg, validate_fillna_kwargs
from pandas.core.config import get_option
Expand Down Expand Up @@ -2216,6 +2218,15 @@ def _concat_same_type(self, to_concat):
def _formatting_values(self):
return self

def isin(self, values):
from pandas.core.series import _sanitize_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the import to top of function

values = _sanitize_array(values, None, None)
null_mask = isna(values)
code_values = self.categories.get_indexer(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger is there an EA version of this? (meaning API) (maybe should raise NotImplementedError which could be caught with a default implementation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't yet, but I was thinking about add it. I actually define it on IPArray, since an IP address can be "in" a network.

I'll open an issue (shouldn't delay the changes here).

code_values = code_values[null_mask | (code_values >= 0)]
return algorithms.isin(self.codes, code_values)


# The Series.cat accessor


Expand Down
5 changes: 4 additions & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -3564,7 +3564,10 @@ def isin(self, values):
5 False
Name: animal, dtype: bool
"""
result = algorithms.isin(com._values_from_object(self), values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the _values_from_object can be moved to algorithms.isin? Then this could just be result = algorithms.isin(self, values)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's try to do this, @Ma3aXaKa can you make this change

if is_categorical_dtype(self):
result = self._values.isin(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more ducklike check would work here, something like

if hasattr(self._values, 'isin'): result = self._values.isin(values) else: ...... 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback What's the purpose? Is it better for performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its more generic. you don't have to know its a categorical here.

else:
result = algorithms.isin(com._values_from_object(self), values)
return self._constructor(result, index=self.index).__finalize__(self)

def between(self, left, right, inclusive=True):
Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/categorical/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,24 @@ def test_factorized_sort_ordered():

tm.assert_numpy_array_equal(labels, expected_labels)
tm.assert_categorical_equal(uniques, expected_uniques)


def test_isin_cats():
cat = pd.Categorical(["a", "b", np.nan])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add the issue number here as a comment

result = cat.isin(["a", np.nan])
expected = np.array([True, False, True], dtype=bool)
tm.assert_numpy_array_equal(expected, result)

result = cat.isin(["a", "c"])
expected = np.array([True, False, False], dtype=bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a couple of tests in pandas/tests/test_algos.py that test cats for isin, can you move here

tm.assert_numpy_array_equal(expected, result)


@pytest.mark.parametrize("empty", [[], pd.Series(), np.array([])])
def test_isin_empty(empty):
s = pd.Categorical(["a", "b"])
expected = np.array([False, False], dtype=bool)

result = s.isin(empty)
tm.assert_numpy_array_equal(expected, result)