1

I'm running into a encoding issue with BeautifulSoup. I'm trying to parse Open Graph titles but it's leaving out non-ascii characters.

from bs4 import BeautifulSoup doc = BeautifulSoup(html,"lxml") doc.html.head.findAll('meta',attrs={'property':'og:title'}) 

For http://mattilintulahti.net/mediablogi/2013/02/11/19-asiaa-joita-et-tieda-mediayhtiosta-nimeltaan-red-bull/ it prints out the following for the content

19 asiaa joita et tied mediayhtist nimeltn Red Bull 

Where the correct one is

19 asiaa joita et tiedä mediayhtiöstä nimeltään Red Bull 

Any advice on how to get utf-8 to works properly?

2
  • What operating system? Works for me on Linux. Commented Feb 14, 2013 at 23:34
  • quick nit: find_all(..) is preferable to findAll(..) for pep8 reasons Commented Feb 15, 2013 at 0:10

1 Answer 1

1

I'm not able to reproduce the problem:

import urllib2 import bs4 as bs url = 'http://mattilintulahti.net/mediablogi/2013/02/11/19-asiaa-joita-et-tieda-mediayhtiosta-nimeltaan-red-bull/' html = urllib2.urlopen(url).read() doc = bs.BeautifulSoup(html, 'lxml') for meta in doc.html.head.findAll('meta', attrs={'property': 'og:title'}): print(meta.attrs['content']) 

yields

19 asiaa joita et tiedä mediayhtiöstä nimeltään Red Bull 

If this doesn't help, please show your your code.

Sign up to request clarification or add additional context in comments.

1 Comment

You're correct, this actually works. I had accidentally copypasted one line too much when investigating this which messed up things: html = unicode(html,errors='ignore')

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.