Encoding characters to utf-8 hex in Python 3

Question

I have a web crawler that get a lot of these errors:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)

To mitigate these errors I have implemented a function that encode them like this:

def properEncode(url): url = url.replace("ø", "%C3%B8") url = url.replace("å", "%C3%A5") url = url.replace("æ", "%C3%A6") url = url.replace("é", "%c3%a9") url = url.replace("Ø", "%C3%98") url = url.replace("Å", "%C3%A5") url = url.replace("Æ", "%C3%85") url = url.replace("í", "%C3%AD") return url

These are based on this table: http://www.utf8-chartable.de/

The conversion I do seems to be to convert them to utf-8 hex? Is there a python function to do this automatically?

Martijn Pieters · Accepted Answer · 2016-12-25 23:10:45Z

You are URL encoding them. You can do so trivially with the urllib.parse.quote() function:

>>> from urllib.parse import quote >>> quote("ø") '%C3%B8'

or put into a function to only fix the URL path of a given URL (as this encoding doesn't apply to the host portion, for example):

from urllib.parse import quote, urlparse def properEncode(url): parts = urlparse(url) path = quote(parts.path) return parts._replace(path=path).geturl()

This limits the encoding to just the path portion of the URL. If you need to encode the query string, use the quote_plus function as query parameters replace spaces with a plus instead of %20 (and handle the query portion of the URL).

Collectives™ on Stack Overflow

Encoding characters to utf-8 hex in Python 3

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related