What kind of URL encoding is this? [duplicate]

Question

I have the encoded URL

http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5

for

http://blahblah.com/start/DEE-G6F-W4A-2N15

What kind of encoding is this and how to I convert it in Python?

I do not understand what is going on in this question, it seems that %E2%80%8B is just randomly inserted into your URL. - How did this happen? What have you tried to do to convert it? How did you get from A->B or from B->A ? — Inbar Rose
– Inbar Rose, Commented Mar 18, 2013 at 12:46
This happens when copying an email in IE and pasting it in Chrome or FF. :-/ — Sri
– Sri, Commented Mar 18, 2013 at 12:47
Similar question and problem(%E2%80%8B) here stackoverflow.com/questions/6315422/encoding-issue-asp-net — Daniel Magnusson
– Daniel Magnusson, Commented Mar 18, 2013 at 12:47

unutbu · Accepted Answer · 2013-03-18 18:28:26Z

Edit: (due to conversation with @interjay):

%E2%80%8B represents a ZERO WIDTH SPACE. Those probably shouldn't be there. You could remove them with str.replace:

In [135]: 'http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5'.replace('%E2%80%8B', '') Out[135]: 'http://blahblah.com/start/DEE-G6F-W4A-2N15'

In general, quoted URLs can be unquoted using urllib.unquote:

In [6]: import urllib In [7]: print(urllib.unquote('http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5')) http://blahblah.com/start/DEE-G6F-W4A-2N15

Here is how you can tell that %E2%80%8B represents a ZERO WIDTH SPACE:

In [18]: x = urllib.unquote('%E2%80%8B') In [19]: y = x.decode('utf-8') In [20]: import unicodedata as UD In [21]: [UD.name(c) for c in y] Out[21]: ['ZERO WIDTH SPACE']

Note that the unqoted URL includes ZERO WIDTH SPACEs:

In [4]: urllib.unquote('http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5') Out[4]: 'http://blahblah.com/s\xe2\x80\x8btart/DEE-G\xe2\x80\x8b6F-W4A-2N1\xe2\x80\x8b5'

It seems like an odd thing to put in a URL...

This will leave the zero-width space in the string (although you can't see it when using print), which is probably not the right solution, as the URL is almost certainly not supposed to have a zero-width space in the middle of a word.
Given the URL, this is how it is unquoted in Python. Whether the given URL is correct is not the OP's question and not one we can answer since the URL is obviously made-up.
Part of answering a question is figuring out what the OP actually needs, as they may not exactly know themselves. In this case, unquoting the URL is obviously not it.
Going by the OP's comment, the problem is probably with his webmail client or browser adding the zero-width space. If it can't be fixed in the source, the zero-width space probably needs to be removed rather than unquoted.

Collectives™ on Stack Overflow

What kind of URL encoding is this? [duplicate]

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related