Html wrongly encoded fetched by Nokogiri

Question

I use Nokogiri to parse an html. I need both the content and image tags in the page, so I use inner_html instead of content method. But the value returned by content is encoded correct, while wrongly encoded by inner_html. One note, the page is in Chinese and not use UTF-8 encoding.

Here is my code:

# encoding: utf-8 require 'rubygems' require 'nokogiri' require 'open-uri' require 'iconv' doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030') doc.css('td.font_info').each do |link| # output, correct but not i expect: 目前市面上影响比 puts link.content # output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ???? # I expect: <img ....></img>目前市面上影响比 puts link.inner_html end

What version of Ruby are you using? What version of Nokogiri? What is your expectation? When I run your above code under Ruby 1.9 I get a UTF-8 encoded string that starts with "目前市面上影响比较大的讲述《论语". — Phrogz
– Phrogz, Commented Jan 6, 2012 at 18:13
@Phrogz I use Ruby 1.9.2; If I use link.content, that is correct (as you mentioned above). But besides plain text, I also want to get the html tags, like img, from the page. But this this time, it is not UTF-8 encoded. It outputs something like Ŀǰ??????Ӱ??Ƚϴ?Ľ?????????? — Frankel
– Frankel, Commented Jan 7, 2012 at 1:03
Please update your question showing exactly how to reproduce and verify the problem, and what you expected or desire instead. — Phrogz
– Phrogz, Commented Jan 8, 2012 at 17:33

kakutani · Accepted Answer · 2012-01-10 01:48:56Z

That is written on the 'Encoding' section on README: http://nokogiri.org/

Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.

So, you should convert inner_html string manually if you want to get it as UTF-8 string:

puts link.inner_html.encode('utf-8') # for 1.9.x

Hishalv · Accepted Answer · 2012-01-06 21:00:44Z

I think content strips out tags well, however the inner_html method nodes does not do this very well or at all.

"I think you can end up with some pretty weird states if you change the inner_html (which contain tags) while you are traversing. In other words, if you are traversing a node tree, you shouldn’t do anything that could add or remove nodes."

Try this:

doc.css('td.font_info').each do |link| puts link.content some_stuff = link.inner_html link.children = Nokogiri::HTML.fragment(some_stuff, 'utf-8') end

@Hishalv Thanks. Tried your code, the output is still wrong encoded. I wonder if I need to do some encoding converting manually.

Collectives™ on Stack Overflow

Html wrongly encoded fetched by Nokogiri

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related