Trouble decoding utf-16 string

Question

I'm using python3.3. I've been trying to decode a certain string that looks like this:

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc....

keeps going on. However whenever I try to decode this string using str.decode('utf-16') I get an error saying:

'utf16' codec can't decode bytes in position 54-55: illegal UTF-16 surrogate

I'm not exactly sure how to decode this string.

So that means it isn't really UTF16. Where did you get the string? Might it be UCS2? — RemcoGerlich
– RemcoGerlich, Commented Apr 27, 2016 at 18:11
Does the result look ok, if you only decode up to position 53? This may help to decide whether your assumption utf16 is correct. — mkiever
– mkiever, Commented Apr 27, 2016 at 18:16
I got it from Twisted, I went in twisted/web/proxy.py in the handleResponsePart(self, buffer) function, I just injected print(buffer). So basically the encoded string you're looking at is supposed to be HTML, that I receive from Twisted proxies — Cristian
– Cristian, Commented Apr 27, 2016 at 18:17
so the actual string is huge, what I pasted is only a small part of the full string. — Cristian
– Cristian, Commented Apr 27, 2016 at 18:20
Try explicitly decoding to 'UTF-16BE' and 'UTF-16LE' — endianess might be the issue. — martineau
– martineau, Commented Apr 27, 2016 at 18:35

unutbu · Accepted Answer · 2016-04-28 00:24:11Z

gzipped data begins with \x1f\x8b\x08 so my guess is that your data is gzipped. Try gunzipping the data before decoding.

import io import gzip # this raises IOError because `buf` is incomplete. It may work if you supply the complete buf buf = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc' with gzip.GzipFile(fileobj=io.BytesIO(buf)) as f: content = f.read() print(content.decode('utf-16'))

Collectives™ on Stack Overflow

Trouble decoding utf-16 string

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related