0

I am trying to convert a string just with English characters, numbers and punctuations but facing an error with encoding and decoding.

The original string is: "DD-XBS 2 1/2x 17 LCLξ 3-pack"

The code I wrote to tackle this issue is:

try: each = str(each.decode('ascii')) except UnicodeDecodeError: each = str(each.decode('utf-8').encode('ascii', errors='ignore')) 

but I am getting an error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8c in position 16: invalid start byte 

How can I solve this?

3
  • I tried your code in ipython, not getting any error. Commented Oct 17, 2017 at 13:16
  • @pnv python 2.7 or 3.x? Commented Oct 17, 2017 at 13:48
  • i used python2.7 Commented Oct 17, 2017 at 13:50

1 Answer 1

2

As it follows from your question, I assume that you use Python 2.7.


The reason of the error is:

  1. Your source code is not in UTF-8 and almost certainly in cp1252.
  2. In cp1252 the 'Œ' character is the byte '\x8c', and that byte is not valid in UTF-8.
  3. You specified UTF-8 as the encoding to decode your string in 'except' part.

For better understanding look at that:

>>> u = '\x8c'.decode('cp1252') >>> u u'\u0152' 

So, when we decode '\x8c' byte with cp1252, there is the Unicode code point, which is:

>>> import unicodedata >>> unicodedata.name(u) 'LATIN CAPITAL LIGATURE OE' 

However, if we try to decode with UTF-8, we'll get an error:

>>> u = '\x8c'.decode('utf-8') ... UnicodeDecodeError: 'utf8' codec can't decode byte 0x8c ... 

So, '\x8c' byte and UTF-8 encoding are incompatible.


To fix the problem you can try this:

each = str(each.decode('cp1252').encode('ascii', errors='ignore')) 

Or this:

each = str(each.decode('utf-8', errors='ignore').encode('ascii', errors='ignore')) 

Also in your case you can use ord():

my_str = 'DD-XBS 2 1/2x 17 LCLξ 3-pack' ascii_str = '' for sign in my_str: if ord(sign) < 128: ascii_str += sign print(ascii_str) # DD-XBS 2 1/2x 17 LCL 3-pack 


But possibly the best solution is just to convert your source to UTF-8.

Sign up to request clarification or add additional context in comments.

2 Comments

Just one question. How did you identified that its encoding as 'cp1252'? Is there any method available for this?
From the error report and 'Œ' char view I guessed that the actual encoding is one of the most common Western Europe encodings. Then I checked how '\x8c' byte will look after decoding in cp1252, latin_1 and mac_roman - by using print('\x8c'.decode('cp1252')) and so on. And only cp1252-decoded view matched. So, to detect the right encoding, when we have a char view and its byte value, I suggest to make a list of possible encodings and print decoded byte in a loop via that list. Also, If somehow we know Unicode code point of the char, we can avoid print and break the loop after mathing.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.