Encoding Error with Beautiful Soup: Character Maps to Undefined (Python)

Question

I've written a script that is supposed to retrieve html pages off a site and update their contents. The following function looks for a certain file on my system, then attempts to open it and edit it:

def update_sn(files_to_update, sn, table, title): paths = files_to_update['files'] print('updating the sn') try: sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0] notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0] except Exception: print('no sns were found') pass new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original']) new_sn_number = sn htm_text = open(sn_htm, 'rb').read().decode('cp1252') content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S) minus_content = htm_text.replace(content[0], '') table_soup = BeautifulSoup(table, 'html.parser') new_soup = BeautifulSoup(minus_content, 'html.parser') head_title = new_soup.title.string.replace_with(new_sn_number) new_soup.link.insert_after(table_soup.div.next) with open(new_path_name, "w+") as file: result = str(new_soup) try: file.write(result) except Exception: print('Met exception. Changing encoding to cp1252') try: file.write(result('cp1252')) except Exception: print('cp1252 did\'nt work. Changing encoding to utf-8') file.write(result.encode('utf8')) try: print('utf8 did\'nt work. Changing encoding to utf-16') file.write(result.encode('utf16')) except Exception: pass

This works in the majority of cases, but sometimes it fails to write, at which point the exception kicks in and I try every feasible encoding without success:

updating the sn Met exception. Changing encoding to cp1252 cp1252 did'nt work. Changing encoding to utf-8 Traceback (most recent call last): File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn file.write(result) File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined> During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn file.write(result('cp1252')) TypeError: 'str' object is not callable During handling of the above exception, another exception occurred: Traceback (most recent call last): File "scraper.py", line 79, in <module> get_latest(entries[0], int(num), entries[1]) File "scraper.py", line 56, in get_latest update_files.update_sn(files_to_update, data['number'], data['table'], data['title']) File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn file.write(result.encode('utf8')) TypeError: write() argument must be str, not bytes

Can anyone give me any pointers on how to better handle html data that might have inconsistent encoding?

try to check type of every possible argument your passing to function? — Piyush S. Wanare
– Piyush S. Wanare, Commented May 29, 2018 at 6:18

t.m.adam · Accepted Answer · 2018-05-29 13:18:18Z

In your code you open the file in text mode, but then you attempt to write bytes (str.encode returns bytes) and so Python throws an exception:

TypeError: write() argument must be str, not bytes

If you want to write bytes, you should open the file in binary mode.

BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding, and use it to encode the content when writting to file. For example,

soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser') data = soup.tag.text encoding = soup.original_encoding or 'utf-8' print(encoding) #ascii with open('my.file', 'wb+') as file: file.write(data.encode(encoding))

In order for this to work you should pass your html as bytes to BeautifulSoup, so don't decode the response content.

If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.

data = 'Somé téxt' encodings = ['ascii', 'utf-8', 'cp1252'] with open('my.file', 'wb+') as file: for encoding in encodings: try: file.write(data.encode(encoding)) break except UnicodeEncodeError: print(encoding + ' failed.')

Alternatively, you could open the file in text mode and set the encoding in open (instead of encoding the content), but note that this option is not available in Python2.

Toto Lele · Accepted Answer · 2018-05-29 06:30:00Z

Just out of curiosity, is this line of code a typo file.write(result('cp1252'))? Seems like it is missing .encode method.

Traceback (most recent call last): File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn file.write(result('cp1252')) TypeError: 'str' object is not callable

Will it work perfectly if you modify the code to: file.write(result.encode('cp1252'))

I once had this write to file with encoding problem and brewed my own solution through the following thread:

Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence .

My problem solved by changing the html.parser parsing mode to html5lib. I root-caused my problem due to malformed HTML tag and solved it with html5lib parser. For your reference, this is the documentation for each parser provided by BeautifulSoup.

Hope this helps

yes that was a typo. I corrected it in my code and I still get the same error, however.

Collectives™ on Stack Overflow

Encoding Error with Beautiful Soup: Character Maps to Undefined (Python)

2 Answers 2

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Linked

Related