115

I use to run

$s =~ s/[^[:print:]]//g; 

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

1
  • With the PyPi regex module, it is as easy as regex.sub(r'[^[:print:]]+', '', text). But of course, there are a lot of alternatives. Commented Jul 11, 2023 at 14:06

16 Answers 16

103

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys all_chars = (chr(i) for i in range(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0)))) control_char_re = re.compile('[%s]' % re.escape(control_chars)) def remove_control_chars(s): return control_char_re.sub('', s) 

For Python2

import unicodedata, re, sys all_chars = (unichr(i) for i in xrange(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0))) control_char_re = re.compile('[%s]' % re.escape(control_chars)) def remove_control_chars(s): return control_char_re.sub('', s) 

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

Edit Adding suggestions from the comments.

Sign up to request clarification or add additional context in comments.

12 Comments

Is 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well.
This function, as published, removes half of the Hebrew characters. I get the same effect for both of the methods given.
From performance perspective, wouldn't string.translate() work faster in this case? See stackoverflow.com/questions/265960/…
Use all_chars = (unichr(i) for i in xrange(sys.maxunicode)) to avoid the narrow build error.
For me control_chars == '\x00-\x1f\x7f-\x9f' (tested on Python 3.5.2)
|
89

As far as I know, the most pythonic/efficient method would be:

import string filtered_string = filter(lambda x: x in string.printable, myStr) 

Note: string.printable - String of ASCII characters which are considered printable.

Here is example for unicode:

def remove_non_printable(value: str) -> str: return ''.join(i for i in value if i.isprintable()) 

12 Comments

You probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr) so that you get back a string.
Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else?
You should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable)
@AaronGallagher: 99.9% faster? From whence do you pluck that figure? The performance comparison is nowhere near that bad.
Hi William. This method seems to remove all non-ASCII characters. There are many printable non-ASCII characters in Unicode!
|
20

You could try setting up a filter using the unicodedata.category() function:

import unicodedata printable = {'Lu', 'Ll'} def filter_non_printable(str): return ''.join(c for c in str if unicodedata.category(c) in printable) 

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

12 Comments

you started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely.
This seems the most direct, straightforward method. Thanks.
it should be printable = set(['Lu', 'Ll']) shouldn't it ?
@CsabaToth All three are valid and yield the same set. Your's is maybe the nicest way to specify a set literal.
@AnubhavJhalani You can add more Unicode categories to the filter. To reserve spaces and digits in addition to letters use printable = {'Lu', 'Ll', Zs', 'Nd'}
|
19

The following will work with Unicode input and is rather fast...

import sys # build a table mapping all non-printable characters to None NOPRINT_TRANS_TABLE = { i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable() } def make_printable(s): """Replace non-printable characters in a string.""" # the translate method on str removes characters # that map to None from the string return s.translate(NOPRINT_TRANS_TABLE) assert make_printable('Café') == 'Café' assert make_printable('\x00\x11Hello') == 'Hello' assert make_printable('') == '' 

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

2 Comments

This is the only answer that works for me with unicode characters. Awesome that you provided test cases!
If you want to allow for line breaks, add LINE_BREAK_CHARACTERS = set(["\n", "\r"]) and and not chr(i) in LINE_BREAK_CHARACTERS when building the table.
14

In Python 3,

def filter_nonprintable(text): import itertools # Use characters of control category nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0)) # Use translate to remove all non-printable characters return text.translate({character:None for character in nonprintable}) 

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

1 Comment

It would be better to use Unicode ranges (see @Ants Aasma's answer). The result would be text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))}).
8

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint def printable(input): return ''.join(char for char in input if isprint(char)) 

Comments

7

Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string) 

3 Comments

for some reason this works great on windows but cant use it on linux, i had to change the f for an r but i am not sure that is the solution.
Sounds like your Linux Python was too old to support f-strings then. r-strings are quite different, though you could say r'[^' + re.escape(string.printable) + r']'. (I don't think re.escape() is entirely correct here, but if it works...)
Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output...
6

Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata def filter_non_printable(s): return ''.join(c for c in s if not unicodedata.category(c).startswith('C')) 

2 Comments

You may be on to something with startswith('C') but this was far less performant in my testing than any other solution.
big-mclargehuge: The goal of my solution was the combination of completeness and simplicity/readability. You could try to use if unicodedata.category(c)[0] != 'C' instead. Does it perform better? If you prefer execution speed over memory requirements, one can pre-compute the table as shown in stackoverflow.com/a/93029/3779655
5

An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:

 ''.join(c for c in my_string if c.isprintable()) 

str.isprintable() Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

Comments

4

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str): return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9]) 

This is the only way I've found out that works with Unicode characters/strings

Any better options?

4 Comments

Unless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)"
Not quite redundant—they have different meanings (and performance characteristics), though the end result is the same.
Should the other end of the range not be protected too?: "ord(c) <= 126"
But there are Unicode characters which are not printable, too.
3

In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)

Comments

2

The one below performs faster than the others above. Take a look

''.join([x if x in string.printable else '' for x in Str]) 

1 Comment

"".join([c if 0x21<=ord(c) and ord(c)<=0x7e else "" for c in ss])
2

Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160)))) ord_dict = {ord(character):None for character in nonprintable} def filter_nonprintable(text): return text.translate(ord_dict) #use str = "this is my string" str = filter_nonprintable(str) print(str) 

tested on Python 3.7.7

Comments

1

To remove 'whitespace',

import re t = """ \n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p> """ pat = re.compile(r'[\t\n]') print(pat.sub("", t)) 

1 Comment

Actually you don't need the square brackets either then.
1
  1. Error description Run the copied and pasted python code report:

Python invalid non-printable character U+00A0

  1. The cause of the error The space in the copied code is not the same as the format in Python;

  2. Solution Delete the space and re-enter the space. For example, the red part in the above picture is an abnormal space. Delete and re-enter the space to run;

Source : Python invalid non-printable character U+00A0

Comments

0

I used this:

import sys import unicodedata # the test string has embedded characters, \u2069 \u2068 test_string = """"ABC⁩.⁨ 6", "}""" nonprintable = list((ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c) in ['Cc','Cf'])) translate_dict = {character: None for character in nonprintable} print("Before translate, using repr()", repr(test_string)) print("After translate, using repr()", repr(test_string.translate(translate_dict))) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.