Beautiful Soup and UnicodeDecodeError

Question

I am trying to crawl a page but I have a UnicodeDecodeError. Here is my code:

def soup_def(link): req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) usock = urllib2.urlopen(req) encoding = usock.headers.getparam('charset') page = usock.read().decode(encoding) usock.close() soup = BeautifulSoup(page) return soup soup = soup_def("http://www.geekbuying.com/item/Ainol-Novo-10-Hero-II-Quad-Core--Tablet-PC-10-1-inch-IPS-1280-800-1GB-RAM-16GB-ROM-Android-4-1--HDMI-313618.html")

And the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 284: invalid start byte

I checked that a few more users had the same error, but I cannot figure any solution.

For what it's worth: this code works for me (after importing BeautifulSoup and urllib2 that is). — Mark van Lent
– Mark van Lent, Commented Nov 13, 2013 at 14:47
For me it works 2 in 10 times. If I run and run and run, sometime it works. All the other times doesn't. I don't know why. — Tasos
– Tasos, Commented Nov 13, 2013 at 14:49
I am doing XML parsing. Same error happens when I try BeautifulSoup(open(file_path), "xml") in Eclipse. The exact same code works in IPython Notebook! Both use Anaconda Python 3.6 — arun
– arun, Commented Apr 6, 2017 at 17:51

drdrb · Accepted Answer · 2017-05-08 09:09:00Z

Another possibility is a hidden file which you are trying to parse (which is very common on Macs).

Add in a simple if statement so that you are only creating BeautifulSoup objects which are actually html files:

for root, dirs, files in os.walk(folderPath, topdown = True): for fileName in files: if fileName.endswith(".html"): soup = BeautifulSoup(open(os.path.join(root, fileName)).read(), 'lxml')

B.Mr.W. · Accepted Answer · 2013-11-15 22:21:20Z

This is what I got from wikipedia bout the character 0xff, which is a symbol for UTF-16.

UTF-16[edit] In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text display that expects the text to be ISO-8859-1. if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a text display that expects the text to be ISO-8859-1. Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).

So I have two thoughts here:

(1) It could be due to the reason that it should be treat as utf-16 instead of utf-8

(2) The error happens because you are trying to print the whole soup to the screen. Then it involves will your IDE (Eclipse/Pycharm) be smart enough to display those unicode.

If I were you, I will try to move on without printing the whole soup and collect only the piece you want. See you have problem reaching that step. If there is no problem there, then why bother you cannot print the whole soup to the screen.

If you really want to print the soup to screen, try:

print soup.prettify(encoding='utf-16')

But I am not try to print it. I just save it to a variable "soup". Maybe you are right about the utf-16, but I cannot do it since I cannot save it to a variable first.

Collectives™ on Stack Overflow

Beautiful Soup and UnicodeDecodeError

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related