0

I am scraping a website that is rendering a JavaScript/JSON Object that looks like this:

{ "company": "\r\n \x3cdiv class=\"page-heading\"\x3e\x3ch1\x3eSEARCH RESULTS 1 - 40 OF 200\x3c/h1\x3e\x3c/div\x3e\r\n\r\n \x3cdiv class=\"right-content-list\"\x3e\r\n\r\n \x3cdiv class=\"top-buttons-adm-lft\"\x3e\r\n 

I am attempting to process this as a JSON Object (which is what this looks like) using Python's Requests library.

I have used the following methods to encode/process the text:

unicodedata.normalize("NFKD", get_city_json.text).encode('utf-8', 'ignore') unicodedata.normalize("NFKD", get_city_json.text).encode('ascii', 'ignore') unicode(get_city_json.text) 

However, even after repeated attempts, the text is rendered with the UTF-8 encoding and its characters. The Content-Type returned by the web app is "text/javascript; charset=utf-8"

I want to be able to process it as a regular JSON/JavaScript Object for parsing and reading.

Help would be greatly appreciated!

1 Answer 1

0

That isn't UTF-8. It HTML encoded text.

You can decode it using the following:

Python 2

import HTMLParser html_parser = HTMLParser.HTMLParser() unescaped = html_parser.unescape(json_value) print unescaped 

Python 3

import html.parser html_parser = html.parser.HTMLParser() unescaped = html_parser.unescape(json_value) print unescaped 

If you unescape your string with these you should get

<div class="page-heading"><h1>SEARCH RESULTS 1 - 40 OF 200</h1></div> <div class="right-content-list"> <div class="top-buttons-adm-lft"> 
Sign up to request clarification or add additional context in comments.

5 Comments

This isnt working for me. I am getting an error like so.. UnicodeEncodeError: 'ascii' codec can't encode character '\xbb' in position 33759: ordinal not in range(128)
This isn't related to the code I gave you. You need to remove the normalize/encode functions you posted in your question.
No, I have removed those functions. I am attempting to process this directly. The error was thrown when attempting to print unescaped
I see. Some machines with older Python versions have this problem. There are workarounds here for the print problem: stackoverflow.com/questions/3224268/python-unicode-encode-error
I am able to print it now, but the HTMLParser doesnt seem to be unescaping anything. Same representation

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.