i have a string "Mikael Håfström" which contains some special characters how do i remove this using python?
- 1Is your string a unicode string? Do you want to remove the characters or rather replace by "standard" characters?Sven Marnach– Sven Marnach2011-03-10 10:54:19 +00:00Commented Mar 10, 2011 at 10:54
- 8Every character is special in its own way.Ignacio Vazquez-Abrams– Ignacio Vazquez-Abrams2011-03-10 10:59:50 +00:00Commented Mar 10, 2011 at 10:59
- 1Related: What is the best way to remove accents in a python unicode string?Sven Marnach– Sven Marnach2011-03-10 10:59:56 +00:00Commented Mar 10, 2011 at 10:59
3 Answers
You can use the unicodedata module to normalize unicode strings and encode them in their ASCII form like so:
>>> import unicodedata >>> source = u'Mikael Håfström' >>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore') 'Mikael Hafstrom' One notable exception is that the letters 'đ' and 'Đ' are not recognized by Python and they do not get encoded to 'd', so they will simply be omitted from the result. That's a voiced alveolo-palatal affricate present in the latin alphabet of some SEE languages, so it may or may not immediately concern you based on your audience or whether or not your providing full support for the Latin-1 character set. I currently have Python 2.6.5 (Mar 19 2010) running locally and the issue is present, though I'm sure it may have been resolved with newer releases.
4 Comments
unicodedata functions get their data directly from tables provided by unicode.org. There is no "issue".For example using the encode method: u"Mikael Håfström".encode("ascii", "ignore")
1 Comment
See this effbot article (includes code). It makes reasonable transliterations into ASCII characters where possible. It is possible to extend the built-in conversion table to handle many other characters (e.g. those used in Eastern European languages) that don't have a canonical decomposition.