1

I use Nokogiri to parse an html. I need both the content and image tags in the page, so I use inner_html instead of content method. But the value returned by content is encoded correct, while wrongly encoded by inner_html. One note, the page is in Chinese and not use UTF-8 encoding.

Here is my code:

# encoding: utf-8 require 'rubygems' require 'nokogiri' require 'open-uri' require 'iconv' doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030') doc.css('td.font_info').each do |link| # output, correct but not i expect: 目前市面上影响比 puts link.content # output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ???? # I expect: <img ....></img>目前市面上影响比 puts link.inner_html end 
3
  • What version of Ruby are you using? What version of Nokogiri? What is your expectation? When I run your above code under Ruby 1.9 I get a UTF-8 encoded string that starts with "目前市面上影响比较大的讲述《论语". Commented Jan 6, 2012 at 18:13
  • @Phrogz I use Ruby 1.9.2; If I use link.content, that is correct (as you mentioned above). But besides plain text, I also want to get the html tags, like img, from the page. But this this time, it is not UTF-8 encoded. It outputs something like Ŀǰ??????Ӱ??Ƚϴ?Ľ?????????? Commented Jan 7, 2012 at 1:03
  • Please update your question showing exactly how to reproduce and verify the problem, and what you expected or desire instead. Commented Jan 8, 2012 at 17:33

2 Answers 2

5

That is written on the 'Encoding' section on README: http://nokogiri.org/

Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.

So, you should convert inner_html string manually if you want to get it as UTF-8 string:

puts link.inner_html.encode('utf-8') # for 1.9.x 
Sign up to request clarification or add additional context in comments.

Comments

1

I think content strips out tags well, however the inner_html method nodes does not do this very well or at all.

"I think you can end up with some pretty weird states if you change the inner_html (which contain tags) while you are traversing. In other words, if you are traversing a node tree, you shouldn’t do anything that could add or remove nodes."

Try this:

doc.css('td.font_info').each do |link| puts link.content some_stuff = link.inner_html link.children = Nokogiri::HTML.fragment(some_stuff, 'utf-8') end 

2 Comments

You may want to clarify how this addresses the question.
@Hishalv Thanks. Tried your code, the output is still wrong encoded. I wonder if I need to do some encoding converting manually.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.