1. Home
2. Questions
3. AI Assist Labs
4. Tags
6. Challenges
7. Chat
8. Articles
9. Users
11. Jobs
12. Companies
13. Collectives
14. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Stack Internal
Bring the best of human thought and AI automation together at your work. Learn more

Fastest way to filter out pandas dataframe rows containing special characters [duplicate]

Asked 7 years, 10 months ago

Modified 7 years, 10 months ago

Viewed 9k times

4

I have a list special characters. For example

BAD_CHARS = ['.', '&', '\(', '\)', ';', '-']

I want to remove all the rows from a pandas dataframe column containing these special characters. currently I am doing the following

df = ''' words frequency & 11 CONDUCTED 3 (E.G., 5 EXPERIMENT 6 (VS. 5 (WARD 3 - 14 2006; 3 3D 5 ABLE 5 ABSTRACT 3 ACCOMPANIED 5 ACTIVITY 11 AD 5 ADULTS 6 ''' for char in BAD_CHARS: df = df[~df['word'].str.contains(char)] # Expected Result words frequency CONDUCTED 3 EXPERIMENT 6 3D 5 ABLE 5 ABSTRACT 3 ACCOMPANIED 5 ACTIVITY 11 AD 5 ADULTS 6

First it is not working and secondly it is not fast i guess. So how can I do that in a faster way ? Thanks

asked Jan 17, 2018 at 13:14

5,08917 gold badges61 silver badges101 bronze badges

@Zero mark it, please.

cs95
– cs95

2018-01-17 13:17:57 +00:00
Commented Jan 17, 2018 at 13:17
1

First, don't escape the braces. BAD_CHARS = ['.', '&', '(', ')', ';', '-']. Next, you can either use a character class, or use re.escape. Something like this. df[~df['words'].str.contains("[{}]".format(''.join(BAD_CHARS)))]

cs95
– cs95

2018-01-17 13:20:51 +00:00
Commented Jan 17, 2018 at 13:20
If you have issues copying that, just type it out.

cs95
– cs95

2018-01-17 13:24:43 +00:00
Commented Jan 17, 2018 at 13:24

Add a comment |

1 Answer 1

Sorted by:

5

I believe you need first escape values and then join by | and as @cᴏʟᴅsᴘᴇᴇᴅ pointed remove \ from values in BAD_CHARS:

import re BAD_CHARS = ['.', '&', '(', ')', ';', '-'] pat = '|'.join(['({})'.format(re.escape(c)) for c in BAD_CHARS]) df = df[~df['words'].str.contains(pat)] print (df) words frequency 1 CONDUCTED 3 3 EXPERIMENT 6 8 3D 5 9 ABLE 5 10 ABSTRACT 3 11 ACCOMPANIED 5 12 ACTIVITY 11 13 AD 5 14 ADULTS 6

because this return empty frame:

df[~df['word'].str.contains('|'.join(BAD_CHARS))]

edited Jan 17, 2018 at 13:26

answered Jan 17, 2018 at 13:15

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

muazfaiz Over a year ago

It returns empty frame :(

cs95 Over a year ago

The question was closed as a dupe, and I've addressed the specifics of their question** in a comment. Or else, I could have posted the answer myself :/

muazfaiz Over a year ago

Thanks. How easy it was :)

jezrael Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ - I dont understand Or else, I could have posted the answer myself :/ Do you think I copy your comment answer? I use only part of comment - dont escape it, and add mentioned it.

Start asking to get answers

Find the answer to your question by asking.

Explore related questions

See similar questions with these tags.