Python: difficulty converting ascii to unicode

Question

My goal: get the page source from a url and count all instances of a keyword within that page source

How I am doing it: getting the pagesource via urllib2, looping through each char of the page source and comparing it to the keyword

My problem: my keyword is encoded in utf-8 while the page source is in ascii... I am running into errors whenever I try conversions.

getting the page source:

import urllib2 response = urllib2.urlopen(myUrl) return response.read()

comparing page source and keyword:

pageSource[i] == keyWord[j]

I need to convert one of these strings to the other's encoding. Intuitively I felt that ascii (the page source) to utf-8 (the key word) would be the best and easiest, so:

 pageSource = unicode(pageSource) UnicodeDecodeError: 'ascii' codec can't decode byte __ in position __: ordinal not in range(128)

Are you sure your page source is ASCII? ASCII is now a subset of UTF-8. I.e. A in ASCII is 0x41, which is the same as UTF-8 — Alastair McCormack
– Alastair McCormack, Commented Jun 2, 2015 at 21:37

Martijn Pieters · Accepted Answer · 2015-06-02 21:39:18Z

When trying to work with text, don't leave your data as byte strings. Decode to Unicode early, encode back to bytes as late as possible.

Decode your downloaded network data:

import urllib2 response = urllib2.urlopen(myUrl) # Latin-1 is the default for HTTP text/ responses, adjust as needed codec = response.info().getparam('charset', 'latin1') return response.read().decode(codec)

and do the same for your keyWord data. If it is encoded as UTF-8, decode it as such, or use Unicode string literals.

You may want to read up on Python and Unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

Alastair McCormack · Accepted Answer · 2015-06-02 21:48:14Z

I'll assume your remote "source page" contains more than just ASCII otherwise your comparison will already work as is (ASCII is now a subset of UTF-8. I.e. A in ASCII is 0x41, which is the same as UTF-8).

You may find Python Requests library easier as it will automatically decode remote content to Unicode strings based on the server's headers (Unicode strings are encoding neutral so can be compared without worrying about encoding).

resp = requests.get("http://www.example.com/utf8page.html") resp.text >> u'My unicode data €'

You will then need to decode your reference data:

keyWord[j] = "€".decode("UTF-8") keyWord[j] >> u'€'

If you're embedding non-ASCII in your source code, you need to define the encoding you're using. For example, at the top of your source code/script:

# coding=UTF-8

this library is much better than urllib, thanks for suggesting it! for me, this solution worked

Collectives™ on Stack Overflow

Python: difficulty converting ascii to unicode

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related