How to split a string by any characters but letters? In other words, I want only words from the text, nothing else.
s="(This# is an5example!)" what_i_want=['This', 'is', 'an', 'example'] Please keep the content from @Andj below into account. I've copied the relevant part below:
"français".isalpha()will returnTruewhile"franc\u0327ais".isalpha()will returnFalse. For many languages for longer strings it will always returnFalse. Python's definition of Alphabetic differs from Unicode's definition.
You can iterate over the characters in the string and test if they are alphabetical characters. Add some logic so you don't put empty strings in the result list and you're done:
s = "(This# is an5example!)" word = "" word_list = [] for character in s: if character.isalpha(): word += character elif len(word) > 0: word_list.append(word) word = "" print(word_list) output
['This', 'is', 'an', 'example'] str.isalpha() is fine for English, but can be dangerous for many other languages. "français".isalpha() will return True while "franc\u0327ais".isalpha() will return False. For many languages for longer strings it will always return False. Python's definition of Alphabetic differs from Unicode's definition.
join. Simplyre.findall("[a-zA-Z]+", s)''.join([c if c.isalpha() else '\n' for c in s]).split()[A-Za-z], in Posix notation as a Unicode set[[:alpha:]&[:ASCII:]]. Although technically, Letters could be anything in[\p{L}]or maybe more importantly[\p{Alphabetic}], although begs the question do you include ideographs, etc in the mix? Although in strictest sense it would be[\p{L}]. But then, I tend to expect a high level of imprecision in terminology from Python developers, since Python itself tends towards the same imprecisions.