I'm trying to get a list of words from a string. Sounds like an easy task for Mathematica. I have the following code:
text = "Merçi d'avoir pris le temps."; ToLowerCase[#] & /@ StringSplit[text, Except[WordCharacter] ..] However, the output is
{"merçi", "d", "avoir", "pris", "le", "temps"} and not
{"merçi", "d'avoir", "pris", "le", "temps"} because the ' is not a word character. Hence, I'd like to ignore the ', just like the -. Any idea on how to do that?
é?WordCharacterdoes match it on my machine, asToUpperCase/ToLowerCasework fine on it. $\endgroup$'and-), as there are some pretty weird ones in these articles. I'd rather not have to specify them all manually. $\endgroup$éis matched byWordCharacter, I must've had some other error in my code. The other problem persists: I don't want'and-taken out. Is there any way I can changeExcept[WordCharacter]..to something similar toExcept[{WordCharacter,Characters["'-"]}]..? $\endgroup$Except[WordCharacter | "'" | "-"]? $\endgroup$