How to deal with utf-8 encoded String and BeautifulSoup?

Question

How can I replace HTML-entities in unicode-Strings with proper unicode?

u'&quot;HAUS Kleider&quot; - &Uuml;ber das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln'

to

u'"HAUS-Kleider" - Über das Bekleiden und Entkleiden, das Verhüllen und Veredeln'

edit
Actually the entities are wrong. At it seems like BeautifulSoup f...ed it up.

So the question is: How to deal with utf-8 encoded String and BeautifulSoup?

from BeautifulSoup import BeautifulSoup f = open('path_to_file','r') lines = [i for i in f.readlines()] soup = BeautifulSoup(''.join(lines)) allArticles = [] for row in rows: l =[] for r in row.findAll('td'): l += [r.string] # here things seem to go wrong allArticles+=[l]

Ü -> &Yuml; instead of Ü but actually I don't want the encoding to be changed anyway.

>>> soup.originalEncoding 'utf-8'

but I cant generate a proper unicode string of it

possible duplicate of Decode HTML entities in Python string? — Wooble
– Wooble, Commented Oct 29, 2010 at 18:02
Things seem to go wrong? BeautifulSoup f'ed it up? The entities are wrong? Please try to give more precise details to make this question answerable. BeautifulSoup tends to handle UTF-8 pretty well. — Josh Lee
– Josh Lee, Commented Oct 29, 2010 at 18:20

towi · Accepted Answer · 2010-10-29 18:15:33Z

I think what you need are ICU transliterators. I think there is a way to transliterate HTML entities into Unicode.

Try the transliterator id Hex/XML-Any that should to what you want. On the Demo page you can choose "Insert Sample: Compound" and then enter Hex/XML-Any into the "Compound 1" box, add some input data in the box and press "transform". Does this help?

There is a Python ICU binding, but its not taken care of well, I think.

BlueTrance · Accepted Answer · 2010-10-29 18:24:01Z

htmlentitydefs.entitydefs["quot"] returns '"'
That's a dictionary that translates entities to their actual character.
You should be able to continue easily from that point.

if BeautifulSoup would give me the right entities at all. see my edit

vikingosegundo · Accepted Answer · 2010-10-29 19:24:22Z

Ok, the problem was silly, I have to confess. I was working on an old version of rows in the interactive interpreter. I don't know what was wrong with it contents, but this is the correct code:

from BeautifulSoup import BeautifulSoup f = open('path_to_file','r') lines = [i for i in f.readlines()] soup = BeautifulSoup(''.join(lines)) rows = soup.findAll('tr') allArticles = [] for row in rows: l =[] for r in row.findAll('td'): l += [r.string] allArticles+=[l]

shame on me!

Collectives™ on Stack Overflow

How to deal with utf-8 encoded String and BeautifulSoup?

3 Answers 3

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Linked

Related