6

I understand that this is likely a repeat question, but I'm having trouble finding a solution.

In short I have a string I'd like to decode:

raw = "\x94my quote\x94" string = decode(raw) 

expected from string

'"my quote"' 

Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?

3
  • If you already have a Unicode string, your website scraping used the wrong encoding to decode the data to Unicode. Ideally, fix the code reading the website instead of the result; otherwise, encode with the mis-applied encoding to undo the problem, then decode with the correct one. Commented Jun 1, 2017 at 15:42
  • I'm just using urllib.request.urlopen, and there doesn't appear to be an option to change how the request is decoded. As pointed out in my selected answer, the solution to my immediate problem was to encode in "latin-1" and then decode in "windows-1252". Is this a reasonable approach, or is there a way to address the problem at its root? Commented Jun 1, 2017 at 16:53
  • It's a reasonable approach, but without seeing a reproducible example of your code reading the website, it's difficult to address the problem at its root :) Commented Jun 1, 2017 at 17:16

3 Answers 3

6
string = "\x22my quote\x22" print(string) 

You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "

If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:

str(b"\x94my quote\x94", "windows-1252") 

If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:

string = "\x94my quote\x94" str(string.encode("latin-1"), "windows-1252") 
Sign up to request clarification or add additional context in comments.

4 Comments

Hmmm, well "\x94" is not an input of my choosing, but rather from a website I'm parsing, and while print may send the decode string to stdout, I need to capture it as a variable.
It is captured as a variable. If I just write str in Python it will output '"myquote"'.
@rmorshea I amended my answer to include decoding the string from a different character set.
What if I'm not given the string as a binary? Am I forced to encode it somehow, and then decode it? my best guess is "\x94my quote\x94".encode("utf-8").decode('windows-1252') but this is wrong. I get ”my quote”
4

I don't know if you mean to this, but this works:

some_binary = a = b"\x94my quote\x94" result = some_binary.decode() 

And you got the result... If you don't know which encoding to choose, you can use chardet.detect:

import chardet chardet.detect(some_binary) 

Comments

2

Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.

string = b"\x94my quote\x94" decoded_str = string.decode('utf-8', 'ignore') print(decoded_str) 

2 Comments

If you think, you should verify your solution
My fault, corrected it. And you're right, SO is addicting but when my responses start getting that sloppy, it's time for bed. (:

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.