1

I have a code to read the html and modify some text using Beatiful Soup. It works fine but when I read the output, this part of my html file is changed automatically:

Original : <meta http-equiv="Content-Type" content="text/html; charset=**iso-8859-1**" />

Modified by itself: <meta http-equiv="Content-Type" content="text/html; charset=**utf-8**" />

I don't want any of the file contents to change automatically. Can someone help me with this.

Here is my code:

import re import sys from BeautifulSoup import BeautifulSoup f = open(sys.argv[1],"rw") data = f.read() soup = BeautifulSoup(data) comma = re.compile(',') for t in soup.findAll(text=comma): t.replaceWith(t.replace(',', '&sbquo')) print soup 
1
  • BeautifulSoup takes some liberties with html, like automatically correcting things it perceives to be a problem. It might be worth using lxml, it has a very similar feature set and should leave everything unchanged. Commented Sep 12, 2011 at 19:53

1 Answer 1

1

Try

print soup.__str__("ISO-8859-1") 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for that. It works. I have one more question. soup.findAll(text=comma) finds all text including comments in html page. <! text > . How can I get text excluding commented text. Please help me on this. I am stuck because of this issue.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.