Stripping non printable characters from a string in python

Question

I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

With the PyPi regex module, it is as easy as regex.sub(r'[^[:print:]]+', '', text). But of course, there are a lot of alternatives. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 11, 2023 at 14:06

darkdragon · Accepted Answer · 2020-06-24 21:54:34Z

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys all_chars = (chr(i) for i in range(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0)))) control_char_re = re.compile('[%s]' % re.escape(control_chars)) def remove_control_chars(s): return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys all_chars = (unichr(i) for i in xrange(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0))) control_char_re = re.compile('[%s]' % re.escape(control_chars)) def remove_control_chars(s): return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601

Edit Adding suggestions from the comments.

Is 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well.
This function, as published, removes half of the Hebrew characters. I get the same effect for both of the methods given.
From performance perspective, wouldn't string.translate() work faster in this case? See stackoverflow.com/questions/265960/…
Use all_chars = (unichr(i) for i in xrange(sys.maxunicode)) to avoid the narrow build error.
For me control_chars == '\x00-\x1f\x7f-\x9f' (tested on Python 3.5.2)

Vladimir · Accepted Answer · 2024-10-04 04:29:40Z

89

As far as I know, the most pythonic/efficient method would be:

import string filtered_string = filter(lambda x: x in string.printable, myStr)

Note: string.printable - String of ASCII characters which are considered printable.

Here is example for unicode:

def remove_non_printable(value: str) -> str: return ''.join(i for i in value if i.isprintable())

edited Oct 4, 2024 at 4:29

Vladimir

6,8602 gold badges36 silver badges40 bronze badges

answered Sep 18, 2008 at 13:23

William Keller

5,4201 gold badge28 silver badges22 bronze badges

12 Comments

Nathan Shively-Sanders Over a year ago

You probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr) so that you get back a string.

Vinko Vrsalovic Over a year ago

Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else?

habnabit Over a year ago

You should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable)

Chris Morgan Over a year ago

@AaronGallagher: 99.9% faster? From whence do you pluck that figure? The performance comparison is nowhere near that bad.

dotancohen Over a year ago

Hi William. This method seems to remove all non-ASCII characters. There are many printable non-ASCII characters in Unicode!

|

Ber · Accepted Answer · 2019-06-04 09:57:40Z

20

You could try setting up a filter using the unicodedata.category() function:

import unicodedata printable = {'Lu', 'Ll'} def filter_non_printable(str): return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

edited Jun 4, 2019 at 9:57

answered Sep 18, 2008 at 15:25

Ber

42k16 gold badges79 silver badges90 bronze badges

12 Comments

tzot Over a year ago

you started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely.

dotancohen Over a year ago

This seems the most direct, straightforward method. Thanks.

zillo Over a year ago

it should be printable = set(['Lu', 'Ll']) shouldn't it ?

Ber Over a year ago

@CsabaToth All three are valid and yield the same set. Your's is maybe the nicest way to specify a set literal.

Ber Over a year ago

@AnubhavJhalani You can add more Unicode categories to the filter. To reserve spaces and digits in addition to letters use printable = {'Lu', 'Ll', Zs', 'Nd'}

|

ChrisP · Accepted Answer · 2019-01-31 00:58:09Z

The following will work with Unicode input and is rather fast...

import sys # build a table mapping all non-printable characters to None NOPRINT_TRANS_TABLE = { i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable() } def make_printable(s): """Replace non-printable characters in a string.""" # the translate method on str removes characters # that map to None from the string return s.translate(NOPRINT_TRANS_TABLE) assert make_printable('Café') == 'Café' assert make_printable('\x00\x11Hello') == 'Hello' assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

This is the only answer that works for me with unicode characters. Awesome that you provided test cases!
If you want to allow for line breaks, add LINE_BREAK_CHARACTERS = set(["\n", "\r"]) and and not chr(i) in LINE_BREAK_CHARACTERS when building the table.

darkdragon · Accepted Answer · 2020-06-24 21:59:58Z

In Python 3,

def filter_nonprintable(text): import itertools # Use characters of control category nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0)) # Use translate to remove all non-printable characters return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

It would be better to use Unicode ranges (see @Ants Aasma's answer). The result would be text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))}).

rmmh · Accepted Answer · 2012-01-14 03:52:05Z

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint def printable(input): return ''.join(char for char in input if isprint(char))

Alex Myers · Accepted Answer · 2018-09-27 17:22:36Z

7

Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

edited Sep 27, 2018 at 17:22

Alex Myers

7,5157 gold badges28 silver badges43 bronze badges

answered Sep 27, 2018 at 15:16

c6401

911 silver badge2 bronze badges

3 Comments

Chop Labalagun Over a year ago

for some reason this works great on windows but cant use it on linux, i had to change the f for an r but i am not sure that is the solution.

tripleee Over a year ago

Sounds like your Linux Python was too old to support f-strings then. r-strings are quite different, though you could say r'[^' + re.escape(string.printable) + r']'. (I don't think re.escape() is entirely correct here, but if it works...)

the_economist Over a year ago

Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output...

darkdragon · Accepted Answer · 2020-06-24 07:09:57Z

6

Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata def filter_non_printable(s): return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

edited Jun 24, 2020 at 7:09

answered Jun 23, 2020 at 8:25

darkdragon

4706 silver badges14 bronze badges

2 Comments

Big McLargeHuge Over a year ago

You may be on to something with startswith('C') but this was far less performant in my testing than any other solution.

darkdragon Over a year ago

big-mclargehuge: The goal of my solution was the combination of completeness and simplicity/readability. You could try to use if unicodedata.category(c)[0] != 'C' instead. Does it perform better? If you prefer execution speed over memory requirements, one can pre-compute the table as shown in stackoverflow.com/a/93029/3779655

Thomas Juul Dyhr · Accepted Answer · 2022-02-13 23:30:11Z

An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:

 ''.join(c for c in my_string if c.isprintable())

str.isprintable() Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

Vinko Vrsalovic · Accepted Answer · 2008-09-18 13:47:28Z

4

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str): return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

edited Sep 18, 2008 at 13:47

answered Sep 18, 2008 at 13:17

Vinko Vrsalovic

342k55 gold badges341 silver badges374 bronze badges

4 Comments

habnabit Over a year ago

Unless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)"

Miles Over a year ago

Not quite redundant—they have different meanings (and performance characteristics), though the end result is the same.

Gearoid Murphy Over a year ago

Should the other end of the range not be protected too?: "ord(c) <= 126"

tripleee Over a year ago

But there are Unicode characters which are not printable, too.

Risadinha · Accepted Answer · 2018-07-05 07:04:04Z

In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)

Nilav Baran Ghosh · Accepted Answer · 2018-01-07 02:45:48Z

2

The one below performs faster than the others above. Take a look

''.join([x if x in string.printable else '' for x in Str])

edited Jan 7, 2018 at 2:45

answered Jan 7, 2018 at 2:13

Nilav Baran Ghosh

1,34911 silver badges18 bronze badges

1 Comment

evandrix Over a year ago

"".join([c if 0x21<=ord(c) and ord(c)<=0x7e else "" for c in ss])

Joe · Accepted Answer · 2020-06-17 19:42:17Z

Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160)))) ord_dict = {ord(character):None for character in nonprintable} def filter_nonprintable(text): return text.translate(ord_dict) #use str = "this is my string" str = filter_nonprintable(str) print(str)

tested on Python 3.7.7

knowingpark · Accepted Answer · 2017-09-11 05:22:02Z

To remove 'whitespace',

import re t = """ \n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p> """ pat = re.compile(r'[\t\n]') print(pat.sub("", t))

thrinadhn · Accepted Answer · 2022-07-06 07:05:49Z

Error description Run the copied and pasted python code report:

Python invalid non-printable character U+00A0

The cause of the error The space in the copied code is not the same as the format in Python;
Solution Delete the space and re-enter the space. For example, the red part in the above picture is an abnormal space. Delete and re-enter the space to run;

Source : Python invalid non-printable character U+00A0

Tim Richardson · Accepted Answer · 2022-11-24 04:20:17Z

I used this:

import sys import unicodedata # the test string has embedded characters, \u2069 \u2068 test_string = """"ABC⁩.⁨ 6", "}""" nonprintable = list((ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c) in ['Cc','Cf'])) translate_dict = {character: None for character in nonprintable} print("Before translate, using repr()", repr(test_string)) print("After translate, using repr()", repr(test_string.translate(translate_dict)))

Collectives™ on Stack Overflow

Stripping non printable characters from a string in python

16 Answers 16

12 Comments

12 Comments

12 Comments

2 Comments

1 Comment

Comments

3 Comments

2 Comments

Comments

4 Comments

Comments

1 Comment

Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

12 Comments

12 Comments

12 Comments

2 Comments

1 Comment

Comments

3 Comments

2 Comments

Comments

4 Comments

Comments

1 Comment

Comments

1 Comment

Comments

Comments

Linked

Related