I have the encoded URL
http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5 for
http://blahblah.com/start/DEE-G6F-W4A-2N15 What kind of encoding is this and how to I convert it in Python?
Edit: (due to conversation with @interjay):
%E2%80%8B represents a ZERO WIDTH SPACE. Those probably shouldn't be there. You could remove them with str.replace:
In [135]: 'http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5'.replace('%E2%80%8B', '') Out[135]: 'http://blahblah.com/start/DEE-G6F-W4A-2N15' In general, quoted URLs can be unquoted using urllib.unquote:
In [6]: import urllib In [7]: print(urllib.unquote('http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5')) http://blahblah.com/start/DEE-G6F-W4A-2N15 Here is how you can tell that %E2%80%8B represents a ZERO WIDTH SPACE:
In [18]: x = urllib.unquote('%E2%80%8B') In [19]: y = x.decode('utf-8') In [20]: import unicodedata as UD In [21]: [UD.name(c) for c in y] Out[21]: ['ZERO WIDTH SPACE'] Note that the unqoted URL includes ZERO WIDTH SPACEs:
In [4]: urllib.unquote('http://blahblah.com/s%E2%80%8Btart/DEE-G%E2%80%8B6F-W4A-2N1%E2%80%8B5') Out[4]: 'http://blahblah.com/s\xe2\x80\x8btart/DEE-G\xe2\x80\x8b6F-W4A-2N1\xe2\x80\x8b5' It seems like an odd thing to put in a URL...
print), which is probably not the right solution, as the URL is almost certainly not supposed to have a zero-width space in the middle of a word.
%E2%80%8Bis just randomly inserted into your URL. - How did this happen? What have you tried to do to convert it? How did you get from A->B or from B->A ?