1

I'm using BeautifulSoup to parse a bunch of web pages which I downloaded locally using WGet.

I'm reading in the file like this:

file = open(file_name, 'r', encoding='utf-8').read() soup = BeautifulSoup(file, 'html5lib') 

I'm using this soup object to get text, which I am then writing to a .json file like this:

f.write('"text": "' + str(text.encode('utf-8')) ) 

However, when I open the .json file I see strings like this:

and\xe2\x80\x94in spite of

He hadn\xe2\x80\x99t shaved in a few days at least

and Michael can go.\xe2\x80\x9d\xc2\xa0 Her voice

I get that these weird characters are not UTF-8 so python doesn't know what to do with them. But I don't know how to fix this.

Thanks for any help.

EDIT: I'm using python3

Also, if I remove the part where I encode the text before I write it, I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 264: ordinal not in range(128)

4
  • Are you opening the file as UTF-8 encoded? Commented Aug 12, 2017 at 14:48
  • It looks like you're using Python 3. You should always mention the Python version with Unicode questions, since Python 2 & 3 have big differences in that area. But anyway, those hex sequences like \xe2\x80\x94 are actually valid UTF-8 multibyte sequences. When properly decode, they become and—in spite of He hadn’t shaved in a few days at least and Michael can go.”  Her voice. I used this code to perform that transformation: s.encode('latin1').decode(). But I don't know BeautifulSoup, so I can't tell you the proper way to fix this. Commented Aug 12, 2017 at 14:57
  • Suggested reading: joelonsoftware.com/2003/10/08/… Commented Aug 12, 2017 at 16:14
  • Also: nedbatchelder.com/text/unipain.html Commented Aug 12, 2017 at 16:14

2 Answers 2

3

With str(text.encode('utf-8')) you get:

>>> text = 'He hadn’t shaved in a few days' >>> text.encode('utf8') b'He hadn\xe2\x80\x99t shaved in a few days' >>> str(text.encode('utf8')) "b'He hadn\\xe2\\x80\\x99t shaved in a few days'" >>> print(str(text.encode('utf8'))) b'He hadn\xe2\x80\x99t shaved in a few days' 

So you are getting exactly what you unintentionally wrote to the file.

Instead of manually building the JSON, use the json module. Given UTF-8-encoded input of:

<html> <p>He hadn’t shaved in a few days</p> </html> 

Then:

from bs4 import BeautifulSoup import json # Good practice: # Decode text data to Unicode when read into a program. # Process text as Unicode in the program. # Encoded text when leaving the program, such as: # Writing to database. # Sending over a network socket. # Writing to a file. # Read the content as Unicode text. with open('test.html','r',encoding='utf8') as file: content = file.read() soup = BeautifulSoup(content) text = soup.find('p').text # Unicode string! # Build the dictionary to be written in JSON format. # Leave as Unicode! items = {'text':text} # Output as UTF-8-encoded data. # # ensure_ascii=False makes the non-ASCII characters in the file readable, # but it works without it. The file will just have Unicode escapes. # with open('out.json','w',encoding='utf8') as out: json.dump(items,out,ensure_ascii=False) # Read and decode the data back from the file and turn it back into # a dictionary. with open('out.json','r',encoding='utf8') as file: data = json.load(file) print(data) 

Output (Python dict):

{'text': 'He hadn’t shaved in a few days'} 

Content of file when ensure_ascii=True:

{"text": "He hadn’t shaved in a few days"} 

Content of file when ensure_ascii=False:

{"text": "He hadn\u2019t shaved in a few days"} 
Sign up to request clarification or add additional context in comments.

2 Comments

I tried that, but this gives me the following error: json.dump(items,f,ensure_ascii=False) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py", line 179, in dump fp.write(chunk) UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 258: ordinal not in range(128)
Nevermind. I was able to fix that by using codecs.open(file_name,'w', encoding="utf-8") to open the file that I was writing to
0

Simplify your write: f.write('"text": "' + text) (or f.write('"text": "' + soup.prettify()). You were encoding material that was already encoded.

Use version 4.6.0: https://pypi.python.org/pypi/beautifulsoup4/

Use python3 -- you will find the str diagnostics more helpful than in python2, they offer better guidance about when to encode or decode.

4 Comments

I assume the OP is already using Python 3, since open(file_name, 'r', encoding='utf-8') doesn't work in Python 2; at least, the standard open built-in function doesn't support an encoding keyword arg in Python 2 (although there are other opens that do).
If I prettify the soup, it turns into a string. I didn't show this in the question, but the text is fetched from HTML tags, which is why I need the actual soup object. Additionally, I tried removing the encoding when I wrote the text, but it created an error, which I just edited into the original question.
You didn't show us how you open'd f. It sounds like your open chose a (default) ascii codec instead of utf8 codec.
Mark Tolonen's code is very nice. Perhaps the best part is the comment block. Do be sure to follow the "Good practice" advice. You can view type(text) if you're ever unsure what sort of object you have at the moment. Also call encode or decode and view the type of that result.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.