Problems with encoding while parsing html document with lxml

Question

I am trying to get clean text from some web pages. I have read a lot of tutorials and finally ended up with python lxml + beautifulsoup + requests modules . The reason for using lxml for such a task is that it cleans html files better than beautiful soup do.

I ended up with test script like this:

from bs4 import UnicodeDammit import re import requests import lxml import lxml.html from time import sleep urls = [ "http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html", "http://ru.onlinemschool.com/math/assistance/statistician/", "http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html", "http://universarium.org/courses/info/332", "http://compsciclub.ru/course/wordscombinatorics", "http://ru.onlinemschool.com/math/assistance/statistician/", "http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/", "http://www.youtube.com/watch?v=SLPrGWQBX0I" ] def check(url): print "That is url {}".format(url) r = requests.get(url) ud = UnicodeDammit(r.content, is_html=True) content = ud.unicode_markup.encode(ud.original_encoding, "ignore") root = lxml.html.fromstring(content) lxml.html.etree.strip_elements(root, lxml.etree.Comment, "script", "style") text = lxml.html.tostring(root, method="text", encoding=unicode) text = re.sub('\s+', ' ', text) print "Text type is {}!".format(type(text)) print text[:200] sleep(1) if __name__ == '__main__': for url in urls: check(url)

Inetrmediate de- and reencoding to the original encoding is needed because the html page could contain some characters that are encoded differently from the most others. Such an occasion breaks further lxml tostring method.

However my code is not working properly with all tests. Sometimes (especially with last two urls) it outputs the mess:

... That is url http://ru.onlinemschool.com/math/assistance/statistician/ Text type is <type 'unicode'>! Онлайн решение задач по математике. Комбинаторика. Теория вероятности. Close Авторизация на сайте Введите логин: Введите пароль: Запомнить меня Регистрация Изучение математики онлайн.Изучайте математ That is url http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/ Text type is <type 'unicode'>! ÐÐ°ÑÐµÐ¼Ð°ÑÐ¸ÐºÐ°. ÐÑÐ½Ð¾Ð²Ñ ÐºÐ¾Ð¼Ð±Ð¸Ð½Ð°ÑÐ¾ÑÐ¸ÐºÐ¸ Ð¸ ÑÐµÐ¾ÑÐ¸Ð¸ ÑÐ¸ÑÐµÐ» / ÐÐ¸Ð´ÐµÐ¾Ð»ÐµÐºÑÐ¸Ð¸ Ð¤Ð¸Ð·ÑÐµÑÐ°: ÐÐµÐºÑÐ¾ÑÐ¸Ð¹ ÐÐ¤Ð¢Ð - Ð²Ð¸Ð´ÐµÐ¾Ð»ÐµÐºÑÐ¸Ð¸ Ð¿Ð¾ ÑÐ¸Ð·Ð¸ÐºÐµ, That is url http://www.youtube.com/watch?v=SLPrGWQBX0I Text type is <type 'unicode'>! ÐÑÐ½Ð¾Ð²Ð½ÑÐµ ÑÐ¾ÑÐ¼ÑÐ»Ñ ÐºÐ¾Ð¼Ð±Ð¸Ð½Ð°ÑÐ¾ÑÐ¸ÐºÐ¸ - bezbotvy - YouTube ÐÑÐ¾Ð¿ÑÑÑÐ¸ÑÑ RU ÐÐ¾Ð±Ð°Ð²Ð¸ÑÑ Ð²Ð¸Ð´ÐµÐ¾ÐÐ¾Ð¹ÑÐ¸ÐÐ¾Ð¸ÑÐº ÐÐ°Ð³ÑÑÐ·ÐºÐ°... ÐÑÐ±ÐµÑÐ¸ÑÐµ ÑÐ·ÑÐº.

This mess is somehow connected with encoding ISO-8859-1, but i cannot find out how. For each of two last urls i get:

In [319]: r = requests.get(urls[-1]) In [320]: chardet.detect(r.content) Out[320]: {'confidence': 0.99, 'encoding': 'utf-8'} In [321]: UnicodeDammit(r.content, is_html=True).original_encoding Out[321]: 'utf-8' In [322]: r = requests.get(urls[-2]) In [323]: chardet.detect(r.content) Out[323]: {'confidence': 0.99, 'encoding': 'utf-8'} In [324]: UnicodeDammit(r.content, is_html=True).original_encoding Out[324]: u'utf-8'

So i guess lxml makes internal decoding based on the wrong assumptions of an input string. I think that it don't even try to make guess about input string encoding. It seems that in the core of lxml happens something like this :

In [339]: print unicode_string.encode('utf-8').decode("ISO-8859-1", "ignore") ÑÑÑÐ¾ÐºÐ°

How could i resolve my issue and clean all urls from html tags ? Maybe i should use another python modules or do it another way ? Please, give me your suggestions.

I don't know what UnicodeDammit is but... r.content should be unicode and should allow you to process the correct content. you can retrieve r.text instead. this will certainly look messy but from here you can experiment with coding, as you are in ascii now — Marek
– Marek, Commented Apr 16, 2015 at 17:12
related stackoverflow.com/questions/2307795/…, stackoverflow.com/questions/1495627/… — Nikos M.
– Nikos M., Commented Apr 16, 2015 at 17:14
UnicodeDammit allows to guess charset of a web page according to http headers, meta tag and contents. r.content is not unicode, it is simple sequence of bytes from server. r.text converts r.content to unicode assuming it is in r.encoding. However r.encoding is not always right and i could not use r.text method, because i get an error from lxml: UnicodeDecodeError: 'utf8' codec can't decode byte 0x.. in position ..: invalid continuation byte — xolodec
– xolodec, Commented Apr 16, 2015 at 17:19

xolodec · Accepted Answer · 2015-04-17 10:03:39Z

I finally figured it out. The solution is not to use

root = lxml.html.fromstring(content)

but configure an explicit Parser object which can be told to use particular encoding enc:

htmlparser = etree.HTMLParser(encoding=enc) root = etree.HTML(content, parser=htmlparser)

I found, additionally, that even UnicodeDammit make obvious mistakes when deciding about encoding of a page. So i have added another if block:

if (declared_enc and enc != declared_enc):

Here is the snippet of a result:

from lxml import html from lxml.html import etree import requests from bs4 import UnicodeDammit import chardet try: self.log.debug("Try to get content from page {}".format(url)) r = requests.get(url) except requests.exceptions.RequestException as e: self.log.warn("Unable to get page content of the url: {url}. " "The reason: {exc!r}".format(url=url, exc=e)) raise ParsingError(e.message) ud = UnicodeDammit(r.content, is_html=True) enc = ud.original_encoding.lower() declared_enc = ud.declared_html_encoding if declared_enc: declared_enc = declared_enc.lower() # possible misregocnition of an encoding if (declared_enc and enc != declared_enc): detect_dict = chardet.detect(r.content) det_conf = detect_dict["confidence"] det_enc = detect_dict["encoding"].lower() if enc == det_enc and det_conf < THRESHOLD_OF_CHARDETECT: enc = declared_enc # if page contains any characters that differ from the main # encodin we will ignore them content = r.content.decode(enc, "ignore").encode(enc) htmlparser = etree.HTMLParser(encoding=enc) root = etree.HTML(content, parser=htmlparser) etree.strip_elements(root, html.etree.Comment, "script", "style") text = html.tostring(root, method="text", encoding=unicode)

Collectives™ on Stack Overflow

Problems with encoding while parsing html document with lxml

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related