Python 3 Decoding Strings

Question

I understand that this is likely a repeat question, but I'm having trouble finding a solution.

In short I have a string I'd like to decode:

raw = "\x94my quote\x94" string = decode(raw)

expected from string

'"my quote"'

Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?

If you already have a Unicode string, your website scraping used the wrong encoding to decode the data to Unicode. Ideally, fix the code reading the website instead of the result; otherwise, encode with the mis-applied encoding to undo the problem, then decode with the correct one. — Mark Tolonen
– Mark Tolonen, Commented Jun 1, 2017 at 15:42
I'm just using urllib.request.urlopen, and there doesn't appear to be an option to change how the request is decoded. As pointed out in my selected answer, the solution to my immediate problem was to encode in "latin-1" and then decode in "windows-1252". Is this a reasonable approach, or is there a way to address the problem at its root? — rmorshea
– rmorshea, Commented Jun 1, 2017 at 16:53
It's a reasonable approach, but without seeing a reproducible example of your code reading the website, it's difficult to address the problem at its root :) — Mark Tolonen
– Mark Tolonen, Commented Jun 1, 2017 at 17:16

CodeMonkey · Accepted Answer · 2017-06-01 06:27:34Z

6

string = "\x22my quote\x22" print(string)

You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "

If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:

str(b"\x94my quote\x94", "windows-1252")

If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:

string = "\x94my quote\x94" str(string.encode("latin-1"), "windows-1252")

edited Jun 1, 2017 at 6:27

answered Jun 1, 2017 at 5:40

CodeMonkey

4,9662 gold badges41 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

rmorshea Over a year ago

Hmmm, well "\x94" is not an input of my choosing, but rather from a website I'm parsing, and while print may send the decode string to stdout, I need to capture it as a variable.

CodeMonkey Over a year ago

It is captured as a variable. If I just write str in Python it will output '"myquote"'.

CodeMonkey Over a year ago

@rmorshea I amended my answer to include decoding the string from a different character set.

rmorshea Over a year ago

What if I'm not given the string as a binary? Am I forced to encode it somehow, and then decode it? my best guess is "\x94my quote\x94".encode("utf-8").decode('windows-1252') but this is wrong. I get Â”my quoteÂ”

Yuval Pruss · Accepted Answer · 2017-06-01 05:42:19Z

I don't know if you mean to this, but this works:

some_binary = a = b"\x94my quote\x94" result = some_binary.decode()

And you got the result... If you don't know which encoding to choose, you can use chardet.detect:

import chardet chardet.detect(some_binary)

Matthew Plemmons · Accepted Answer · 2017-06-01 05:47:51Z

2

Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.

string = b"\x94my quote\x94" decoded_str = string.decode('utf-8', 'ignore') print(decoded_str)

edited Jun 1, 2017 at 5:47

answered Jun 1, 2017 at 5:32

Matthew Plemmons

1124 bronze badges

2 Comments

CIsForCookies Over a year ago

If you think, you should verify your solution

Matthew Plemmons Over a year ago

My fault, corrected it. And you're right, SO is addicting but when my responses start getting that sloppy, it's time for bed. (:

Collectives™ on Stack Overflow

Python 3 Decoding Strings

3 Answers 3

4 Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

2 Comments

Linked

Related