0

I have the encoded URL

http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5 

for

http://blahblah.com/start/DEE-G6F-W4A-2N15 

What kind of encoding is this and how to I convert it in Python?

3
  • I do not understand what is going on in this question, it seems that %E2%80%8B is just randomly inserted into your URL. - How did this happen? What have you tried to do to convert it? How did you get from A->B or from B->A ? Commented Mar 18, 2013 at 12:46
  • This happens when copying an email in IE and pasting it in Chrome or FF. :-/ Commented Mar 18, 2013 at 12:47
  • 1
    Similar question and problem(%E2%80%8B) here stackoverflow.com/questions/6315422/encoding-issue-asp-net Commented Mar 18, 2013 at 12:47

1 Answer 1

3

Edit: (due to conversation with @interjay):

%E2%80%8B represents a ZERO WIDTH SPACE. Those probably shouldn't be there. You could remove them with str.replace:

In [135]: 'http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5'.replace('%E2%80%8B', '') Out[135]: 'http://blahblah.com/start/DEE-G6F-W4A-2N15' 

In general, quoted URLs can be unquoted using urllib.unquote:

In [6]: import urllib In [7]: print(urllib.unquote('http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5')) http://blahblah.com/s​tart/DEE-G​6F-W4A-2N1​5 

Here is how you can tell that %E2%80%8B represents a ZERO WIDTH SPACE:

In [18]: x = urllib.unquote('%E2%80%8B') In [19]: y = x.decode('utf-8') In [20]: import unicodedata as UD In [21]: [UD.name(c) for c in y] Out[21]: ['ZERO WIDTH SPACE'] 

Note that the unqoted URL includes ZERO WIDTH SPACEs:

In [4]: urllib.unquote('http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5') Out[4]: 'http://blahblah.com/s\xe2\x80\x8btart/DEE-G\xe2\x80\x8b6F-W4A-2N1\xe2\x80\x8b5' 

It seems like an odd thing to put in a URL...

Sign up to request clarification or add additional context in comments.

6 Comments

This will leave the zero-width space in the string (although you can't see it when using print), which is probably not the right solution, as the URL is almost certainly not supposed to have a zero-width space in the middle of a word.
Given the URL, this is how it is unquoted in Python. Whether the given URL is correct is not the OP's question and not one we can answer since the URL is obviously made-up.
Part of answering a question is figuring out what the OP actually needs, as they may not exactly know themselves. In this case, unquoting the URL is obviously not it.
What do you think is the actual problem then?
Going by the OP's comment, the problem is probably with his webmail client or browser adding the zero-width space. If it can't be fixed in the source, the zero-width space probably needs to be removed rather than unquoted.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.