Python BeautifulSoup encoding

Question

I have a code to read the html and modify some text using Beatiful Soup. It works fine but when I read the output, this part of my html file is changed automatically:

Original : <meta http-equiv="Content-Type" content="text/html; charset=**iso-8859-1**" />

Modified by itself: <meta http-equiv="Content-Type" content="text/html; charset=**utf-8**" />

I don't want any of the file contents to change automatically. Can someone help me with this.

Here is my code:

import re import sys from BeautifulSoup import BeautifulSoup f = open(sys.argv[1],"rw") data = f.read() soup = BeautifulSoup(data) comma = re.compile(',') for t in soup.findAll(text=comma): t.replaceWith(t.replace(',', '&sbquo')) print soup

BeautifulSoup takes some liberties with html, like automatically correcting things it perceives to be a problem. It might be worth using lxml, it has a very similar feature set and should leave everything unchanged. — Klohkwherk
– Klohkwherk, Commented Sep 12, 2011 at 19:53

rocksportrocker · Accepted Answer · 2011-09-12 18:55:43Z

1

Try

print soup.__str__("ISO-8859-1")

answered Sep 12, 2011 at 18:55

rocksportrocker

7,4692 gold badges34 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Divya Over a year ago

Thank you for that. It works. I have one more question. soup.findAll(text=comma) finds all text including comments in html page. <! text > . How can I get text excluding commented text. Please help me on this. I am stuck because of this issue.

Collectives™ on Stack Overflow

Python BeautifulSoup encoding

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related