Remove all special characters, punctuation and spaces from string

Question

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

wjandrea · Accepted Answer · 2019-09-26 22:48:45Z

570

This can be done without regex:

>>> string = "Special $#! characters spaces 888323" >>> ''.join(e for e in string if e.isalnum()) 'Specialcharactersspaces888323'

You can use str.isalnum:

S.isalnum() -> bool Return True if all characters in S are alphanumeric and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.

edited Sep 26, 2019 at 22:48

wjandrea

34k10 gold badges69 silver badges105 bronze badges

answered Apr 30, 2011 at 17:47

user225312

133k71 gold badges176 silver badges182 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Chris Dutrow Over a year ago

What is the reason not using regex as a rule of thumb?

Diego Navarro Over a year ago

@ChrisDutrow regex are slower than python string built-in functions

Francisco Over a year ago

@DiegoNavarro except that's not true, I benchmarked both the isalnum() and regex versions, and the regex one is 50-75% faster

576i Over a year ago

Tried this in Python3 - it accepts unicode chars so it's useless to me. Try string = "B223323\§§§$3\u445454" as an example. The result? 'B2233233䑔54'

Antti Haapala Over a year ago

Additionally: "For 8-bit strings, this method is locale-dependent."! Thus the regex alternative is strictly better!

|

wjandrea · Accepted Answer · 2019-09-26 22:51:22Z

408

Here is a regex to match a string of characters that are not a letters or numbers:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:

re.sub('[^A-Za-z0-9]+', '', mystring)

edited Sep 26, 2019 at 22:51

wjandrea

34k10 gold badges69 silver badges105 bronze badges

answered Apr 30, 2011 at 17:46

Andy White

88.8k48 gold badges178 silver badges212 bronze badges

7 Comments

Reihan_amn Over a year ago

this also removes the spaces between words, "great place" -> "greatplace". How to avoid it?

sougonde Over a year ago

@Reihan_amn Simply add a space to the regex, so it becomes: [^A-Za-z0-9 ]+

HuLu ViCa Over a year ago

I guess this doesn't work with modified character in other languages, like á, ö, ñ, etc. Am I right? If so, how would it be the regex for it?

tommy.carstensen Over a year ago

This doesn't work for Spanish, German, Danish and other languages.

EMT Over a year ago

just add the special characters of that particular language. For example, to use for german text, re.sub('[^A-Za-z0-9 ,.-_\'äöüÄÖÜß]+', '', sample_text) expression can be used.

|

tuxErrante · Accepted Answer · 2014-08-07 13:26:24Z

83

Shorter way :

import re cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute '' with ' '

answered Aug 7, 2014 at 13:26

tuxErrante

1,40216 silver badges20 bronze badges

6 Comments

kkurian Over a year ago

Except that _ is in \w and is a special character in the context of this question.

Echelon Over a year ago

Depends on the context - underscore is very useful for filenames and other identifiers, to the point that I don't treat it as a special character but rather a sanitised space.I generally use this method myself.

Bob Stein Over a year ago

r'\W+' - slightly off topic (and very pedantic) but I suggest a habit that all regex patterns be raw strings

Md Sabbir Ahmed Over a year ago

This procedure does not treat underscore(_) as a special character.

Tomerikoo Over a year ago

A simple change to remove _ as well: r"[^A-Za-z]+" instead of r"\W+"

|

mbeacom · Accepted Answer · 2021-09-18 20:39:16Z

TLDR

I timed the provided answers.

import re re.sub('\W+','', string)

is typically 3x faster than the next fastest provided top answer.

Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.

After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'

Example 1

'.join(e for e in string if e.isalnum())

string1 - Result: 10.7061979771
string2 - Result: 7.78372597694

Example 2

import re re.sub('[^A-Za-z0-9]+', '', string)

string1 - Result: 7.10785102844
string2 - Result: 4.12814903259

Example 3

import re re.sub('\W+','', string)

string1 - Result: 3.11899876595
string2 - Result: 2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.

@kkurian If you read the beginning of my answer, this is merely a comparison of the previously proposed solutions above. You might want to comment on the originating answer... stackoverflow.com/a/25183802/2560922
Third solution didn't remove the "ø" but the second one did remove it.

Grijesh Chauhan · Accepted Answer · 2019-04-26 13:18:31Z

34

Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.') Out[20]: 'stringwithspecialcharslikeetcs'

Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string))

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5

edited Apr 26, 2019 at 13:18

answered Apr 14, 2016 at 9:32

Grijesh Chauhan

58.6k20 gold badges145 silver badges214 bronze badges

4 Comments

Grijesh Chauhan Over a year ago

@Alexey correct, In python3 map, filter, and reduce returns itertable object instead. Still in Python3+ I will prefer ''.join(filter(str.isalnum, string)) (or to pass list in join use ''.join([*filter(str.isalnum, string)])) over accepted answer.

TheProletariat Over a year ago

I'm not certain ''.join(filter(str.isalnum, string)) is an improvement on filter(str.isalnum, string), at least to read. Is this really the Pythreenic (yeah, you can use that) way to do this?

Grijesh Chauhan Over a year ago

@TheProletariat The point is just filter(str.isalnum, string) do not return string in Python3 as filter( ) in Python-3 returns iterator rather than argument type unlike Python-2.+

mwfearnley Over a year ago

@GrijeshChauhan, I think you should update your answer to include both your Python2 and Python3 recommendations.

pkm · Accepted Answer · 2014-05-25 09:28:49Z

#!/usr/bin/python import re strs = "how much for the maple syrup? $20.99? That's ricidulous!!!" print strs nstr = re.sub(r'[?|$|.|!]',r'',strs) print nstr nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr) print nestr

you can add more special character and that will be replaced by '' means nothing i.e they will be removed.

Vlad Bezden · Accepted Answer · 2020-03-17 15:21:30Z

string.punctuation contains following characters:

'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string 'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output:

'This is A test'

Andrea · Accepted Answer · 2018-09-05 10:02:41Z

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.

For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:

import re s = re.sub(r"[^a-zA-Z0-9]","",s)

This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.

import re s = re.sub(r"[^a-z0-9]","",s.lower())

Zoe - Save the data dump · Accepted Answer · 2018-06-15 19:00:00Z

15

s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

edited Jun 15, 2018 at 19:00

Zoe - Save the data dump

28.4k22 gold badges130 silver badges163 bronze badges

answered Jun 15, 2018 at 12:09

sneha

8797 silver badges7 bronze badges

Comments

John Machin · Accepted Answer · 2011-04-30 21:07:48Z

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:

>>> import re >>> rx = re.compile(u'[\W_]+', re.UNICODE) >>> data = u''.join(unichr(i) for i in range(256)) >>> rx.sub(u'', data) u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff' >>>

BioGeek · Accepted Answer · 2012-05-18 13:28:01Z

The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:

import unicodedata # strip of crap characters (based on the Unicode database # categorization: # http://www.sql-und-xml.de/unicode-database/#kategorien PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs')) def filter_non_printable(s): result = [] ws_last = False for c in s: c = unicodedata.category(c) in PRINTABLE and c or u'#' result.append(c) return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.

petezurich · Accepted Answer · 2020-06-27 10:00:21Z

For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:

Example for German:

re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)

Shubham Shewdikar · Accepted Answer · 2021-05-11 08:29:55Z

This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.

import re sample_str = "Hel&&lo %% Wo$#rl@d" # using isalnum() print("".join(k for k in sample_str if k.isalnum())) # using regex op2 = re.sub("[^A-Za-z]", "", sample_str) print(f"op2 = ", op2) special_char_list = ["$", "@", "#", "&", "%"] # using list comprehension op1 = "".join([k for k in sample_str if k not in special_char_list]) print(f"op1 = ", op1) # using lambda function op3 = "".join(filter(lambda x: x not in special_char_list, sample_str)) print(f"op3 = ", op3)

jjmurre · Accepted Answer · 2016-03-23 19:37:46Z

4

Use translate:

import string def clean(instr): return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.

answered Mar 23, 2016 at 19:37

jjmurre

4321 gold badge5 silver badges16 bronze badges

2 Comments

matt wilkie Over a year ago

Version difference? I get TypeError: translate() takes exactly one argument (2 given) with py3.4

duburcqa Over a year ago

It is only working with Python2.7. See below answer for using translate with Python3.

Dharman · Accepted Answer · 2021-02-01 21:07:25Z

This will remove all non-alphanumeric characters except spaces.

string = "Special $#! characters spaces 888323" ''.join(e for e in string if (e.isalnum() or e.isspace()))

Special characters spaces 888323

Vinay Kumar Kuresi · Accepted Answer · 2018-07-16 11:52:40Z

import re my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the

same as double quotes."""

# if we need to count the word python that ends with or without ',' or '.' at end count = 0 for i in text: if i.endswith("."): text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i) count += 1 print("The count of Python : ", text.count("python"))

Viren Ramani · Accepted Answer · 2021-10-27 13:21:16Z

After 10 Years, below I wrote there is the best solution. You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.

from clean_text import clean string = 'Special $#! characters spaces 888323' new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='') print(new) Output ==> 'Special characters spaces 888323' you can replace space if you want. update = new.replace(' ','') print(update) Output ==> 'Specialcharactersspaces888323'

ArtBindu · Accepted Answer · 2022-04-06 15:02:44Z

function regexFuntion(st) { const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space] st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space] st = st.replace(/\s\s+/g, ' '); // remove multiple space return st; } console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;')); // Output: Hello world78asdfasdflkjlkjasdfj67

Dsw Wds · Accepted Answer · 2016-02-25 08:20:57Z

-4

import re abc = "askhnl#$%askdjalsdk" ddd = abc.replace("#$%","") print (ddd)

and you shall see your result as

'askhnlaskdjalsdk

edited Feb 25, 2016 at 8:20

answered Feb 25, 2016 at 8:00

Dsw Wds

5025 silver badges17 bronze badges

1 Comment

JChao Over a year ago

wait.... you imported re but never used it. Your replace criteria only works for this specific string. What if your string is abc = "askhnl#$%!askdjalsdk"? I don't think will work on anything other than the #$% pattern. Might wanna tweak it

Collectives™ on Stack Overflow

Remove all special characters, punctuation and spaces from string

19 Answers 19

7 Comments

7 Comments

6 Comments

TLDR

Example 1

Example 2

Example 3

9 Comments

Python 2.*

Python 3.*

4 Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

2 Comments

1 Comment

Comments

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

19 Answers 19

7 Comments

7 Comments

6 Comments

TLDR

Example 1

Example 2

Example 3

9 Comments

Python 2.*

Python 3.*

4 Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

2 Comments

1 Comment

Comments

Comments

Comments

1 Comment

Linked

Related