Python: Encoding Error - content of web page

Question

I'm trying to get a content of a web page and parse it than save in mysql db.

I actually did it for a web page encoding utf8.

But when i tried with a 8859-9 encoding webpage i get error.

My code to get content of page:

def getcontent(url): opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Magic Browser')] opener.addheaders = [('Accept-Charset', 'utf-8')] #print chardet.detect(response).get('encoding) response = opener.open(url).read() opener.close() return response url = "http://www.meb.gov.tr/duyurular/index.asp?ID=4" contentofpage = getcontent(url) print contentofpage print chardet.detect(contentofpage) print contentofpage.encode("utf-8")

output of content of page: ... E�itim Teknolojileri Genel M�d�rl�� ...

{'confidence': 0.7789909202570836, 'encoding': 'ISO-8859-2'} Traceback (most recent call last): File "meb.py", line 18, in <module> print contentofpage.encode("utf-8") UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not in range(128)

Actually page is a Turkish page and encoding is 8859-9.

When i tried with default encoding all i see �� instead of some chars. How can i take or convert content of page to utf-8 or turkish (iso-8859-9)

Also when i use unicode(contentofpage)

it get

Traceback (most recent call last): File "meb.py", line 20, in print unicode(contentofpage) UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not in range(128)

any help ?

sberry · Accepted Answer · 2013-01-06 09:06:36Z

3

I think you want to decode, not encode, since it is already encoded.

print contentofpage.decode("iso-8859-9")

yields a sample like:

Eğitim Teknolojileri Genel Müdürlüğü

answered Jan 6, 2013 at 9:06

sberry

133k20 gold badges145 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MatandDie Over a year ago

print contentofpage.decode("iso-8859-9") UnicodeEncodeError: 'ascii' codec can't encode character u'\xee' in position 458: ordinal not in range(128)

Mark Tolonen Over a year ago

Make sure you are decoding directly after getting the content. contentofpage = getcontent(url), then print contentofpage.decode('iso-8859-9').

Collectives™ on Stack Overflow

Python: Encoding Error - content of web page

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related