-1

This is my first time using BeautifulSoup.

Basically, I use BeautifulSoup to extract data. I am trying to construct a table in csv based on the webtable. And an example row of my table looks like this:

[<td>1</td>, <td> Chief executives and senior officials</td>, <td>£120,830</td>,<td>-3.8</td>] 

Now, the problem is when I use .text.encode('utf8'), the output becomes:

('1', ' Chief executives and senior officials', '\xc2\xa3120,830', '-3.8') 

The figure £120,830 becomes \xc2\xa3120,830, which I have no idea what kind of encoding this is. Is there a way that I can get the proper output £120,830 rather than the crazy encoding ?

Alternatively, is there a way to make this crazy encoded thing \xc2\xa3120,830 to look like £120,830 in my csv ? Does anyone know how to deal with these kind of problem ?

Another alternative is to remove the <td> tags and keep the content, but how can I do that in python ? Is there an efficient way of getting rid of these tags ? Any help will be appreciated. Thanks

1 Answer 1

1

That is how £ comes out when you encode it as UTF-8. If that's not what you want, why are you encoding it?

In more detail, UTF-8 encodes U+00A3 as the byte sequence 0xC2 0xA3 (two bytes) which Python displays in a string as '\xc2\xa3'.

If you do want this in a file and you want the file to be UTF-8 encoded, nothing is wrong, except maybe what you are using to look at the file.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.