I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:
url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.text After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the € symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1) and plain_text.decode('cp1252') so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).
EDIT
the new code after @ChrisKoston suggestion using .content instead of .text
url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.content the_sourcecode = plain_text.decode('cp1252').encode('UTF-8') soup = BeautifulSoup(the_sourcecode, 'html.parser') encoding and decoding is now possible but still the character problem.
EDIT2
the solution is to set it .content.decode('cp1252')
url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.content.decode('cp1252') soup = BeautifulSoup(plain_text, 'html.parser') Special thanks to Tomalak for the solution
plain_text.decode('cp1252').encode('utf-8')does not change the value ofplain_text.