Python: How to use BeautifulSoup to deal with encoding issues?

Question

This is my first time using BeautifulSoup.

Basically, I use BeautifulSoup to extract data. I am trying to construct a table in csv based on the webtable. And an example row of my table looks like this:

[<td>1</td>, <td> Chief executives and senior officials</td>, <td>£120,830</td>,<td>-3.8</td>]

Now, the problem is when I use .text.encode('utf8'), the output becomes:

('1', ' Chief executives and senior officials', '\xc2\xa3120,830', '-3.8')

The figure £120,830 becomes \xc2\xa3120,830, which I have no idea what kind of encoding this is. Is there a way that I can get the proper output £120,830 rather than the crazy encoding ?

Alternatively, is there a way to make this crazy encoded thing \xc2\xa3120,830 to look like £120,830 in my csv ? Does anyone know how to deal with these kind of problem ?

Another alternative is to remove the <td> tags and keep the content, but how can I do that in python ? Is there an efficient way of getting rid of these tags ? Any help will be appreciated. Thanks

tripleee · Accepted Answer · 2014-06-26 16:44:36Z

That is how £ comes out when you encode it as UTF-8. If that's not what you want, why are you encoding it?

In more detail, UTF-8 encodes U+00A3 as the byte sequence 0xC2 0xA3 (two bytes) which Python displays in a string as '\xc2\xa3'.

If you do want this in a file and you want the file to be UTF-8 encoded, nothing is wrong, except maybe what you are using to look at the file.

Collectives™ on Stack Overflow

Python: How to use BeautifulSoup to deal with encoding issues?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related