0

I'm working with the boot camp 100 Days of code of Udemy. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am getting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my Python code:

from bs4 import BeautifulSoup with open("website.html") as file: html_doc = file.read() soup = BeautifulSoup(html_doc, 'html.parser') print(soup.title.name) 

Here is the error

Traceback (most recent call last): File "C:\Users\xarss\Desktop\100 days of python\Webdev_projects\Websrapingproyect\main.py", line 12, in <module> html_doc = file.read() File "C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 281: character maps to <undefined> 

I already tried to re-install the Beautiful Soup package and I am still having the same problem and try using other HTML files and the problem persists.

<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Angela's Personal Site</title> </head> <body> <h1 id="name">Angela Yu</h1> <p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p> <p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p> <hr> <h3 class="heading">Books and Teaching</h3> <ul> <li>The Complete iOS App Development Bootcamp</li> <li>The Complete Web Development Bootcamp</li> <li>100 Days of Code - The Complete Python Bootcamp</li> </ul> <hr> <h3 class="heading">Other Pages</h3> <a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a> <a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a> </body> </html> 
2
  • You are missing an indentation for the with block Commented Jun 5, 2023 at 12:07
  • Yes, that was an error when posting the question, but in my original code is indented. Thanks Commented Jun 5, 2023 at 12:44

2 Answers 2

2

This error comes from the encoding of the file not being cp1252 (which is often the default encoding python uses with open on some Windows systems).

You will have to figure out which encoding is used then specify it when opening the file.

In this case, as you can see on line 5, the file is encoded in utf-8 :

<meta charset="utf-8"> 

Here is the updated code :

from bs4 import BeautifulSoup with open("website.html", encoding="utf-8") as file: soup = BeautifulSoup(file, 'html.parser') print(soup.title.string) 

Hope this helps, don't forget to accept an answer if your issue is solved.

Sign up to request clarification or add additional context in comments.

1 Comment

cp1252 is the default on some Windows systems, but generally not on anything else.
-1

This is a common error which we get while opening a file if we don't know the encoding.

One of the below methods may work.

with open("website.html", errors="ignore") as file: with open("website.html", errors='replace') as file: with open("website.html", 'rb') as file: 

1 Comment

Discarding input you don't know how to cope with is an extremely bad idea. You should figure out the correct encoding of the data, and then use the encoding= keyword argument to the open() call.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.