UnicodeDecodeError: 'charmap' when using BeautifulSoup [duplicate]

Question

I'm working with the boot camp 100 Days of code of Udemy. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am getting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my Python code:

from bs4 import BeautifulSoup with open("website.html") as file: html_doc = file.read() soup = BeautifulSoup(html_doc, 'html.parser') print(soup.title.name)

Here is the error

Traceback (most recent call last): File "C:\Users\xarss\Desktop\100 days of python\Webdev_projects\Websrapingproyect\main.py", line 12, in <module> html_doc = file.read() File "C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 281: character maps to <undefined>

I already tried to re-install the Beautiful Soup package and I am still having the same problem and try using other HTML files and the problem persists.

<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Angela's Personal Site</title> </head> <body> <h1 id="name">Angela Yu</h1> <p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p> <p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p> <hr> <h3 class="heading">Books and Teaching</h3> <ul> <li>The Complete iOS App Development Bootcamp</li> <li>The Complete Web Development Bootcamp</li> <li>100 Days of Code - The Complete Python Bootcamp</li> </ul> <hr> <h3 class="heading">Other Pages</h3> <a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a> <a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a> </body> </html>

Yes, that was an error when posting the question, but in my original code is indented. Thanks — Xareni Galindo
– Xareni Galindo, Commented Jun 5, 2023 at 12:44

diogeek · Accepted Answer · 2024-06-06 12:18:22Z

This error comes from the encoding of the file not being cp1252 (which is often the default encoding python uses with open on some Windows systems).

You will have to figure out which encoding is used then specify it when opening the file.

In this case, as you can see on line 5, the file is encoded in utf-8 :

<meta charset="utf-8">

Here is the updated code :

from bs4 import BeautifulSoup with open("website.html", encoding="utf-8") as file: soup = BeautifulSoup(file, 'html.parser') print(soup.title.string)

Hope this helps, don't forget to accept an answer if your issue is solved.

cp1252 is the default on some Windows systems, but generally not on anything else.

Talha Tayyab · Accepted Answer · 2023-06-05 12:58:04Z

This is a common error which we get while opening a file if we don't know the encoding.

One of the below methods may work.

with open("website.html", errors="ignore") as file: with open("website.html", errors='replace') as file: with open("website.html", 'rb') as file:

Discarding input you don't know how to cope with is an extremely bad idea. You should figure out the correct encoding of the data, and then use the encoding= keyword argument to the open() call.

Collectives™ on Stack Overflow

UnicodeDecodeError: 'charmap' when using BeautifulSoup [duplicate]

2 Answers 2

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Linked

Related