I use Nokogiri to parse an html. I need both the content and image tags in the page, so I use inner_html instead of content method. But the value returned by content is encoded correct, while wrongly encoded by inner_html. One note, the page is in Chinese and not use UTF-8 encoding.
Here is my code:
# encoding: utf-8 require 'rubygems' require 'nokogiri' require 'open-uri' require 'iconv' doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030') doc.css('td.font_info').each do |link| # output, correct but not i expect: 目前市面上影响比 puts link.content # output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ???? # I expect: <img ....></img>目前市面上影响比 puts link.inner_html end
link.content, that is correct (as you mentioned above). But besides plain text, I also want to get the html tags, like img, from the page. But this this time, it is not UTF-8 encoded. It outputs something likeĿǰ??????Ӱ??Ƚϴ?Ľ??????????