json encoded as UTF-8 characters. How do I process as json in Python Requests

Question

I am scraping a website that is rendering a JavaScript/JSON Object that looks like this:

{ "company": "\r\n \x3cdiv class=\"page-heading\"\x3e\x3ch1\x3eSEARCH RESULTS 1 - 40 OF 200\x3c/h1\x3e\x3c/div\x3e\r\n\r\n \x3cdiv class=\"right-content-list\"\x3e\r\n\r\n \x3cdiv class=\"top-buttons-adm-lft\"\x3e\r\n

I am attempting to process this as a JSON Object (which is what this looks like) using Python's Requests library.

I have used the following methods to encode/process the text:

unicodedata.normalize("NFKD", get_city_json.text).encode('utf-8', 'ignore') unicodedata.normalize("NFKD", get_city_json.text).encode('ascii', 'ignore') unicode(get_city_json.text)

However, even after repeated attempts, the text is rendered with the UTF-8 encoding and its characters. The Content-Type returned by the web app is "text/javascript; charset=utf-8"

I want to be able to process it as a regular JSON/JavaScript Object for parsing and reading.

Help would be greatly appreciated!

Martin Konecny · Accepted Answer · 2014-05-29 03:38:03Z

0

That isn't UTF-8. It HTML encoded text.

You can decode it using the following:

Python 2

import HTMLParser html_parser = HTMLParser.HTMLParser() unescaped = html_parser.unescape(json_value) print unescaped

Python 3

import html.parser html_parser = html.parser.HTMLParser() unescaped = html_parser.unescape(json_value) print unescaped

If you unescape your string with these you should get

<div class="page-heading"><h1>SEARCH RESULTS 1 - 40 OF 200</h1></div> <div class="right-content-list"> <div class="top-buttons-adm-lft">

edited May 29, 2014 at 3:38

answered May 29, 2014 at 3:31

Martin Konecny

59.9k20 gold badges144 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Abhay Bhargav Over a year ago

This isnt working for me. I am getting an error like so.. UnicodeEncodeError: 'ascii' codec can't encode character '\xbb' in position 33759: ordinal not in range(128)

Martin Konecny Over a year ago

This isn't related to the code I gave you. You need to remove the normalize/encode functions you posted in your question.

Abhay Bhargav Over a year ago

No, I have removed those functions. I am attempting to process this directly. The error was thrown when attempting to print unescaped

Martin Konecny Over a year ago

I see. Some machines with older Python versions have this problem. There are workarounds here for the print problem: stackoverflow.com/questions/3224268/python-unicode-encode-error

Abhay Bhargav Over a year ago

I am able to print it now, but the HTMLParser doesnt seem to be unescaping anything. Same representation

Collectives™ on Stack Overflow

json encoded as UTF-8 characters. How do I process as json in Python Requests

1 Answer 1

Python 2

Python 3

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Python 2

Python 3

5 Comments

Linked

Related