0

I want to split a string by remove everything expect alphabetical characters.

By default, split only splits by whitespace between words. But I want to split by everything expect alphabetical characters. How can I add multiple delimiter to split?

For example:

word1 = input().lower().split() # if you input " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr." #the result will be ['has', '15', 'science@and^engineering--departments,', 'affiliated', 'centers,', 'bandar', 'abbas&&and', 'mahshahr.'] 

But I am looking for this kind of result:

['has', '15', 'science', 'and', 'engineering', 'departments', 'affiliated', 'centers', 'bandar', 'abbas', 'and', 'mahshahr'] 
6
  • Also stackoverflow.com/questions/1059559/… Commented Jul 15, 2018 at 14:38
  • You could do import re and words = re.findall(r"\w+", input().lower()). Commented Jul 15, 2018 at 14:40
  • @jonrsharpe, I think this is a different question. I believe OP is trying to split by all alphanumerical characters. Not split by selected characters only. There may be another dup but I couldn't find it. Commented Jul 15, 2018 at 14:41
  • @jpp, if problem is to split on alphanumeric, wouldn't there be non-alphanumeric characters in the result? It seems that splitting on multiple delimiters is a duplicate regardless of which set of delimiters are used for the split - the only difference in a regex solution would be the pattern used. Commented Jul 15, 2018 at 14:58
  • 1
    @wwii, See my answer, seems to solve the problem without being an answer to the proposed duplicate. Although everyone seems to prefer regex. Possibly the question needs more clarity, but then it's unclear / too broad rather than a dup. Commented Jul 15, 2018 at 15:00

2 Answers 2

5

For performance, you should use regex as per the marked duplicate. See benchmarking below.

groupby + str.isalnum

You can use itertools.groupby with str.isalnum to group by characters which are alphanumeric.

With this solution you do not have to worry about splitting by explicitly specified characters.

from itertools import groupby x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr." res = [''.join(j) for i, j in groupby(x, key=str.isalnum) if i] print(res) ['has', '15', 'science', 'and', 'engineering', 'departments', 'affiliated', 'centers', 'Bandar', 'Abbas', 'and', 'Mahshahr'] 

Benchmarking vs regex

Some performance benchmarking versus regex solutions (tested on Python 3.6.5):

from itertools import groupby import re x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr." z = x*10000 %timeit [''.join(j) for i, j in groupby(z, key=str.isalnum) if i] # 184 ms %timeit list(filter(None, re.sub(r'\W+', ',', z).split(','))) # 82.1 ms %timeit list(filter(None, re.split('\W+', z))) # 63.6 ms %timeit [_ for _ in re.split(r'\W', z) if _] # 62.9 ms 
Sign up to request clarification or add additional context in comments.

2 Comments

What if we also want to remove the numbers ?
Possibly str.isalpha.
2

You can replace all the non-alphanumeric characters with a single character (I'm using comma)

s = 'has15science@and^engineering--departments,affiliatedcenters,bandarabbas&&andmahshahr.' alphanumeric = re.sub(r'\W+', ',',s) 

and then split it on comma:

splitted = alphanumeric.split(',') 

Edit:

As suggested by, @DeepSpace, this can be done in a single statement:

splitted = re.split('\W+', s) 

2 Comments

Or simply use re.split
@DeepSpace, Thanks, updated my answer :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.