How to test if a string contains one of the substrings in a list, in pandas?

Question

Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but 'pet'.

I have a solution, but it's rather inelegant:

searchfor = ['og', 'at'] found = [s.str.contains(x) for x in searchfor] result = pd.DataFrame[found] result.any()

Is there a better way to do this?

Note: There is a solution described by @unutbu which is more efficient than using pd.Series.str.contains. If performance is an issue, then this may be worth investigating. — jpp
– jpp, Commented May 6, 2018 at 22:09
Highly recommend checking out this answer for partial string search using multiple keywords/regexes (scroll down to the "Multiple Substring Search" subheading). — cs95
– cs95, Commented Apr 7, 2019 at 21:04
In the specific example in the question, you could use pd.Series.str.endswith with a tuple argument: pandas.pydata.org/docs/reference/api/… — user7868
– user7868, Commented Oct 20, 2022 at 0:19

Alex Riley · Accepted Answer · 2017-02-28 20:42:50Z

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at'] >>> s[s.str.contains('|'.join(searchfor))] 0 cat 1 hat 2 dog 3 fog dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re >>> matches = ['$money', 'x^y'] >>> safe_matches = [re.escape(m) for m in matches] >>> safe_matches ['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

maybe good to add this link pandas.pydata.org/pandas-docs/stable/… too. Starting from pandas 0.15, the string operations are even easier
one thing you have to take care with is if a string in searchfor has special regex characters (you can map with re.escape).
I don't know why your method doesn't work with "str.startswith('|'.join(searchfor))"
in this case I understand we use "|" for OR, how could we use AND??

l'L'l · Accepted Answer · 2014-10-26 21:33:30Z

110

You can use str.contains alone with a regex pattern using OR (|):

s[s.str.contains('og|at')]

Or you could add the series to a dataframe then use str.contains:

df = pd.DataFrame(s) df[s.str.contains('og|at')]

Output:

0 cat 1 hat 2 dog 3 fog

answered Oct 26, 2014 at 21:33

l'L'l

47.5k12 gold badges102 silver badges154 bronze badges

3 Comments

JacoSolari Over a year ago

how to do it for AND?

James Over a year ago

@JacoSolari check out this answer stackoverflow.com/questions/37011734/…

JacoSolari Over a year ago

@James yes, thanks. For completion here is the most upvoted oneliner in that answer. df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

Grant Shannon · Accepted Answer · 2020-04-01 21:30:05Z

Here is a one line lambda that also works:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

searchfor = ['og', 'at'] df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2']) col1 col2 0 cat 1000.0 1 hat 2000000.0 2 dog 1000.0 3 fog 330000.0 4 pet 330000.0

Apply Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

 col1 col2 TrueFalse 0 cat 1000.0 1 1 hat 2000000.0 1 2 dog 1000.0 1 3 fog 330000.0 1 4 pet 330000.0 0

I did it as df.loc[df.col1.apply(lambda x: True if any(i in x for i in searchfor) else False)] and it gone well, thanks.

Suraj Rao · Accepted Answer · 2022-12-16 04:41:58Z

1

Had the same issue. Without making it too complex, you can add | in between each entry, like fieldname.str.contains("cat|dog") works

edited Dec 16, 2022 at 4:41

Suraj Rao

29.7k11 gold badges96 silver badges104 bronze badges

answered Dec 16, 2022 at 4:26

Mammatt

193 bronze badges

1 Comment

Alexander L. Hayes Over a year ago

Hi there 👋 This solution was already provided (stackoverflow.com/a/26578218/12439119), try not to duplicate answers.

Collectives™ on Stack Overflow

How to test if a string contains one of the substrings in a list, in pandas?

4 Answers 4

4 Comments

3 Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

1 Comment

1 Comment

Linked

Related