Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.
import unicodedata, re, itertools, sys all_chars = (chr(i) for i in range(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0)))) control_char_re = re.compile('[%s]' % re.escape(control_chars)) def remove_control_chars(s): return control_char_re.sub('', s)
For Python2
import unicodedata, re, sys all_chars = (unichr(i) for i in xrange(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0))) control_char_re = re.compile('[%s]' % re.escape(control_chars)) def remove_control_chars(s): return control_char_re.sub('', s)
For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:
Cc (control): 65 Cf (format): 161 Cs (surrogate): 2048 Co (private-use): 137468 Cn (unassigned): 836601
Edit Adding suggestions from the comments.
regex.sub(r'[^[:print:]]+', '', text). But of course, there are a lot of alternatives.