removing characters like '\u0152\xe6' from string

Question

I am trying to convert a string just with English characters, numbers and punctuations but facing an error with encoding and decoding.

The original string is: "DD-XBS 2 1/2x 17 LCLŒæ 3-pack"

The code I wrote to tackle this issue is:

try: each = str(each.decode('ascii')) except UnicodeDecodeError: each = str(each.decode('utf-8').encode('ascii', errors='ignore'))

but I am getting an error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8c in position 16: invalid start byte

How can I solve this?

I tried your code in ipython, not getting any error.

pnv
– pnv

2017-10-17 13:16:16 +00:00
Commented Oct 17, 2017 at 13:16 — pnv
– pnv, Commented Oct 17, 2017 at 13:16
@pnv python 2.7 or 3.x?

bazinga
– bazinga

2017-10-17 13:48:48 +00:00
Commented Oct 17, 2017 at 13:48 — bazinga
– bazinga, Commented Oct 17, 2017 at 13:48
i used python2.7

pnv
– pnv

2017-10-17 13:50:30 +00:00
Commented Oct 17, 2017 at 13:50 — pnv
– pnv, Commented Oct 17, 2017 at 13:50

MaximTitarenko · Accepted Answer · 2017-10-17 16:57:36Z

As it follows from your question, I assume that you use Python 2.7.

The reason of the error is:

Your source code is not in UTF-8 and almost certainly in cp1252.
In cp1252 the 'Œ' character is the byte '\x8c', and that byte is not valid in UTF-8.
You specified UTF-8 as the encoding to decode your string in 'except' part.

For better understanding look at that:

>>> u = '\x8c'.decode('cp1252') >>> u u'\u0152'

So, when we decode '\x8c' byte with cp1252, there is the Unicode code point, which is:

>>> import unicodedata >>> unicodedata.name(u) 'LATIN CAPITAL LIGATURE OE'

However, if we try to decode with UTF-8, we'll get an error:

>>> u = '\x8c'.decode('utf-8') ... UnicodeDecodeError: 'utf8' codec can't decode byte 0x8c ...

So, '\x8c' byte and UTF-8 encoding are incompatible.

To fix the problem you can try this:

each = str(each.decode('cp1252').encode('ascii', errors='ignore'))

Or this:

each = str(each.decode('utf-8', errors='ignore').encode('ascii', errors='ignore'))

Also in your case you can use ord():

my_str = 'DD-XBS 2 1/2x 17 LCLŒæ 3-pack' ascii_str = '' for sign in my_str: if ord(sign) < 128: ascii_str += sign print(ascii_str) # DD-XBS 2 1/2x 17 LCL 3-pack

But possibly the best solution is just to convert your source to UTF-8.

Just one question. How did you identified that its encoding as 'cp1252'? Is there any method available for this?
From the error report and 'Œ' char view I guessed that the actual encoding is one of the most common Western Europe encodings. Then I checked how '\x8c' byte will look after decoding in cp1252, latin_1 and mac_roman - by using print('\x8c'.decode('cp1252')) and so on. And only cp1252-decoded view matched. So, to detect the right encoding, when we have a char view and its byte value, I suggest to make a list of possible encodings and print decoded byte in a loop via that list. Also, If somehow we know Unicode code point of the char, we can avoid print and break the loop after mathing.

Collectives™ on Stack Overflow

removing characters like '\u0152\xe6' from string

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related