2

I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:

url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.text 

After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1) and plain_text.decode('cp1252') so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).

EDIT

the new code after @ChrisKoston suggestion using .content instead of .text

url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.content the_sourcecode = plain_text.decode('cp1252').encode('UTF-8') soup = BeautifulSoup(the_sourcecode, 'html.parser') 

encoding and decoding is now possible but still the character problem.

EDIT2

the solution is to set it .content.decode('cp1252')

url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.content.decode('cp1252') soup = BeautifulSoup(plain_text, 'html.parser') 

Special thanks to Tomalak for the solution

4
  • Try using source_code.content instead of .text Commented Nov 17, 2016 at 16:49
  • @ChrisKoston Thank you! Now I am able to decode and encode the plain_text but sadly it does not solve the character issue. I posted the new code above. Commented Nov 17, 2016 at 17:06
  • Hint: plain_text.decode('cp1252').encode('utf-8') does not change the value of plain_text. Commented Nov 17, 2016 at 17:08
  • @Tomalak yeah you are right, i edited the source code again but still no change Commented Nov 17, 2016 at 17:13

1 Answer 1

2

You must actually store the result of decode() somewhere because it does not modify the original variable.

Another thing:

  • decode() turns a list of bytes into a string.
  • encode() does the oposite, it turns a string into a list of bytes

BeautifulSoup is happy with strings; you don't need to use encode() at all.

import requests from bs4 import BeautifulSoup url = 'https://www.example.com' response = requests.get(url) html = response.content.decode('cp1252') soup = BeautifulSoup(html, 'html.parser') 

Hint: For working with HTML you might want to look at pyquery instead of BeautifulSoup.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank your for your quick help. I edited the source code but the character is still \x80 when I run the program
\x80 is the character code for the Euro symbol. Don't look at the IDLE console, it displays characters this way when it wants to. Write the string to a file and look again.
this worked for the title now! thanks alot for that. the description is still not working. I´ll post the code in the question
Now everything is working. i had to replace the .text with . content there, too. Thank you so much for your help!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.