0

BLUF: Why is the decode() method on a bytes object failing to decode ç?

I am receiving a UnicodeDecodeError: 'utf-8' codec can't decode by 0xe7 in position..... Upon tracking down the character, it is the ç character. So when I get to reading the response from the server:

conn = http.client.HTTPConnection(host = 'something.com') conn.request('GET', url = '/some/json') resp = conn.getresponse() content = resp.read().decode() # throws error 

I am unable to get the content. If I just do content = resp.read() it is successful, I can write to file using wb but then whever the ç is, it is replaced with 0xE7 in the file upon writing. Even if I open the file in Notepad++ and set the encoding to UTF-8, the character only shows as the hex version.

Why am I not able to decode this UTF-8 character from an HTTPResponse? Am I not correctly writing it to file either?

15
  • Have you considered using requests? Commented Nov 6, 2017 at 18:08
  • @kichik No need. requests is just a high level API for making the same type of requests. It relies on http.client to make the socket connections anyhow. The example I have shown is somewhat false, as I am really making HTTPS connections and requests does not support SSL. Commented Nov 6, 2017 at 18:11
  • @kichik Further, the real question is why does decode() not work on a valid UTF-8 character? Commented Nov 6, 2017 at 18:12
  • The server doesn't seem to send you actual UTF-8. I was hoping requests will do better at detecting that. The actual UTF-8 representation for ç is b'\xc3\xa7'. The server is sending you CP1252. Commented Nov 6, 2017 at 18:18
  • 1
    What does resp.getheaders() return? Commented Nov 6, 2017 at 18:26

1 Answer 1

1

When you have issues with encoding/decoding, you should take a look at the UTF-8 Encoding Debugging Chart.

If you look in the chart for the Windows 1252 code point 0xE7 you find the expected character is ç showing that the encoding is CP1252.

Sign up to request clarification or add additional context in comments.

Comments