0

I need to write a regular expression to get all the characters in the list below.. (remove all the characters not in the list)

allow_characters = "#.-_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" 

I don't know how to do it, should I even use re.match or re.findall or re.sub...?

Thanks a lot in advance.

0

1 Answer 1

7

Don't use regular expressions at all, first convert allow_characters to a set and then use ''.join() with a generator expression that strips out the unwanted characters. Assuming the string you are transforming is called s:

allow_char_set = set(allow_characters) s = ''.join(c for c in s if c in allow_char_set) 

That being said, here is how this might look with regex:

s = re.sub(r'[^#.\-_a-zA-Z0-9]+', '', s) 

You could convert your allow_characters string into this regex, but I think the first solution is significantly more straightforward.

Edit: As pointed out by DSM in comments, str.translate() is often a very good way to do something like this. In this case it is slightly complicated but you can still use it like this:

import string allow_characters = "#.-_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" all_characters = string.maketrans('', '') delete_characters = all_characters.translate(None, allow_characters) s = s.translate(None, delete_characters) 
Sign up to request clarification or add additional context in comments.

7 Comments

"convert your allow_characters string into this regex" isn't as complicated as it sounds; it's just r'[^{}]+'.format(allow_characters). But I agree that the first solution is clearer.
@F.J the set idea is extremely easy to understand but is there also a performance concern being taken into consideration here.. is one super fast against the other one? Is there an easy way to figure out which character should be escaped in the regular expression, why dash need to be escaped?
Dash needs to be escaped if it is within a character class because otherwise it will be interpreted as a range separator, for example [a-z] matches all lowercase letters, and [a\-z] would match "a", "-", or "z". You can use re.escape() to automatically do the escaping, just make sure this is the first step in the transformation (before you add the [ and ]+ to the ends).
Also I wouldn't be too surprised if the set solution was faster than regex anyway.
@DSM Nope, the wording of the question is slightly confusing but the OP wants to keep all characters in allow_characters by removing everything else.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.