0

I have a web crawler that get a lot of these errors:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128) 

To mitigate these errors I have implemented a function that encode them like this:

def properEncode(url): url = url.replace("ø", "%C3%B8") url = url.replace("å", "%C3%A5") url = url.replace("æ", "%C3%A6") url = url.replace("é", "%c3%a9") url = url.replace("Ø", "%C3%98") url = url.replace("Å", "%C3%A5") url = url.replace("Æ", "%C3%85") url = url.replace("í", "%C3%AD") return url 

These are based on this table: http://www.utf8-chartable.de/

The conversion I do seems to be to convert them to utf-8 hex? Is there a python function to do this automatically?

1 Answer 1

1

You are URL encoding them. You can do so trivially with the urllib.parse.quote() function:

>>> from urllib.parse import quote >>> quote("ø") '%C3%B8' 

or put into a function to only fix the URL path of a given URL (as this encoding doesn't apply to the host portion, for example):

from urllib.parse import quote, urlparse def properEncode(url): parts = urlparse(url) path = quote(parts.path) return parts._replace(path=path).geturl() 

This limits the encoding to just the path portion of the URL. If you need to encode the query string, use the quote_plus function as query parameters replace spaces with a plus instead of %20 (and handle the query portion of the URL).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.