1

Running this code:

from bs4 import BeautifulSoup soup = BeautifulSoup (open("my.html")) print(soup.prettify()) 

Produces this error:

Traceback (most recent call last): File "soup.py", line 5, in <module> print(soup.prettify()) File "C:\Python33\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u25ba' in position 9001: character maps to <undefined> 

I then tried:

print(soup.encode('UTF-8').prettify()) 

But this fails on account of string formatting with a bytes object:

Traceback (most recent call last): File "soup.py", line 11, in <module> print(soup.encode('UTF-8').prettify()) AttributeError: 'bytes' object has no attribute 'prettify' 

Not sure how to go about solving this. Any input would be greatly appreciated.

2
  • try to decode the string from bytes first: bytes.decode(my.html) Commented Feb 15, 2013 at 6:22
  • I was unable to make this work with beautiful soup (AttributeError: 'str' object has no attribute...) Commented Feb 15, 2013 at 16:32

1 Answer 1

3

Your (Windows) console is using cp437 encoding and there is a character in the soup that isn't supported by that encoding. The default is to throw an exception in this situation, but you can change it.

import sys,io from bs4 import BeautifulSoup sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace') soup = BeautifulSoup (open("my.html")) print(soup.prettify()) 

Alternatively, write the soup to a file and read with an editor that supports the encoding:

# On Windows, utf-8-sig will allow the file to be read by Notepad. with open('out.txt','w',encoding='utf-8-sig') as f: f.write(soup.prettify()) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.