Removing hexadecimal characters from a unicode object

Question

I am trying to remove the hexadecimal characters \xef\xbb\xbf from my string however I am getting the following error.

Not quite sure how to resolve this.

>>> x = u'\xef\xbb\xbfHello' >>> x u'\xef\xbb\xbfHello' >>> type(x) <type 'unicode'> >>> print x ï»¿Hello >>> print x.replace('\xef\xbb\xbf', '') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) >>>

Sebastian Wozny · Accepted Answer · 2016-11-09 12:29:46Z

You need to replace the unicode object, otherwise Python2 will to attempt to encode x with the ascii codec to search for the a str in it.

>>> x = u'\xef\xbb\xbfHello' >>> x u'\xef\xbb\xbfHello' >>> print(x.replace(u'\xef\xbb\xbf',u'')) Hello

This only holds for Python2. In Python3 both versions will work.

Community · Accepted Answer · 2017-05-23 12:13:26Z

Try to use either the decode or unicode functions, like so:

x.decode('utf-8')

or

unicode(string, 'utf-8')

Source: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1

Mark Tolonen · Accepted Answer · 2016-11-09 16:32:16Z

The real problem was that your Unicode string was incorrectly decoded in the first place. Those characters are a UTF-8 byte order mark (BOM) character mis-decoded as (likely) latin-1 or cp1252.

Ideally, fix how they were decoded, but you can reverse the error by re-encoding as latin1 and decoding correctly:

>>> x = u'\xef\xbb\xbfHello' >>> x.encode('latin1').decode('utf8') # decode correctly, U+FEFF is a BOM. u'\ufeffHello' >>> x.encode('latin1').decode('utf-8-sig') # decode and handle BOM. u'Hello'

Collectives™ on Stack Overflow

Removing hexadecimal characters from a unicode object

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related