3

I’m using Rails 4.2.7. I’m currently using the following logic to parse a doc with Nokogiri:

content.xpath("//pre[@class='text-results']").xpath('text()').to_s 

In my HTML document, this content appears within my “text-results” block:

<pre class="text-results"><html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=Title content="&lt;p&gt;&lt;a href=http://mychiptime"> <meta name=Keywords content=""> <meta http-equiv=Content-Type content="text/html; charset=macintosh”>… 

I include this section because my parsing dies with the following error:

Error during processing: unknown encoding name - macintosh /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:627:in `find' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:627:in `serialize' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:786:in `to_format' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:642:in `to_html' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:512:in `to_s' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:187:in `block in each' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `upto' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `each' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:218:in `map' /Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:218:in `to_s' /Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data' /Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:77:in `process_my_object_data' /Users/davea/Documents/workspace/myproject/app/services/onlinerr_my_object_finder_service.rb:82:in `process_my_object_link' /Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:29:in `block in process_data' /Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:28:in `each' /Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:28:in `process_data' /Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers' /Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each' /Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each' /Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:5:in `run_all_crawlers' 

Is there any way to make Nokogiri ignore this unknown encoding? I’m trying to get the content inside the <pre> tag as text, so I don’t need it parsed further.

I'm on Mac El Capitan. Per the comment, here's my locale settings:

davea$ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= 
7
  • set your locale to utf-8, see: perlgeek.de/en/article/set-up-a-clean-utf8-environment Commented Oct 11, 2016 at 13:58
  • I'm on Mac, but my locale settings are already utf-8. I included them as an edit to my question. Commented Oct 11, 2016 at 14:12
  • do export LC_ALL="en_US.UTF-8" before you run rails s and see if you have same problem Commented Oct 11, 2016 at 15:47
  • See stackoverflow.com/a/20521428/1020958 Commented Oct 11, 2016 at 23:55
  • Hi @bjhaid, are you sayhing change my source encoding or external encoding? Commented Oct 13, 2016 at 13:45

2 Answers 2

2

Your HTML is invalid. You have a <pre> tag outside the <body> and, as a result, Nokogiri is having to do fixups which usually results in questionable results.

This is what Nokogiri has to say about the document:

doc.errors # => [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: misplaced <html> tag>, #<Nokogiri::XML::SyntaxError: htmlParseStartTag: misplaced <head> tag>, #<Nokogiri::XML::SyntaxError: AttValue: " expected>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag meta>] doc.to_html # => "<pre class=\"text-results\">\n\n\n<meta name=\"Title\" content=\"&lt;p&gt;&lt;a href=http://mychiptime\">\n<meta name=\"Keywords\" content=\"\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=macintosh”&gt;\n&lt;/head&gt;\n\"></pre>" 

Looking at only the line in question, it's also confusing Nokogiri:

doc = Nokogiri::HTML::DocumentFragment.parse('<meta http-equiv=Content-Type content="text/html; charset=macintosh”>') doc.errors # => [#<Nokogiri::XML::SyntaxError: AttValue: " expected>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag meta>] doc.to_html # => "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=macintosh”&gt;\">" 

Notice that Nokogiri doesn't recognize a closing curly-quote as a terminator for the string content="text/html; charset=macintosh”.

You can't fix this within Nokogiri. You'll need to provide the appropriate structure, and need to do a search and replace to convert curly quotes prior to parsing the document. Hopefully the document won't contain them inside the <body> in text or you'll be altering text which might be a problem for your use.

The fact you have curly-quotes in places they shouldn't exist is curious. If your editor is converting from straight quotes to curly quotes then you need to immediately turn off that feature as it'll cause real havoc with coding. Good text editors for coding won't even offer the use of curly quotes because of the problems they cause.


Nokogiri is complaining about the "macintosh" sequence as far as I can tell.

require 'nokogiri' doc = Nokogiri::HTML::DocumentFragment.parse('<meta http-equiv=Content-Type content="text/html; charset=macintosh">') doc.at('meta')['content'] # => "text/html; charset=macintosh" 

If the HTML is clean it doesn't care.

Sign up to request clarification or add additional context in comments.

2 Comments

Unfortunately I don't control the content of this HTML. It is being served to me from an external web site. DOn't worry about the curly quotes, that's just a transposition error. Nokogiri is complaining about the "macintosh" sequence as far as I can tell.
Nope, it's not worried about "macintosh" at all if the HTML is clean. See the added information.
0

See Nokogiri, open-uri, and Unicode Characters

When Nokogiri parses a document, it uses the encoding that the document specifies (unless you explicitly tell it what encoding to use).

"macintosh" is not a default Ruby encoding (see Encoding.list for a list of all encodings Ruby knows).

You can force Nokogiri to use an explicit encoding by passing it as an argument to parse.

# encoding is guessed from the document doc = Nokogiri::HTML.parse(File.open('test.html')) doc.xpath("//pre[@class='text-results']").xpath('text()').to_s ArgumentError: unknown encoding name - macintosh # force Nokogiri to parse the document as 'utf-8' doc = Nokogiri::HTML.parse(File.open('test.html'), nil, 'utf-8') doc.xpath("//pre[@class='text-results']").xpath('text()').to_s => "\n\n\n" 

The caveat is that Nokogiri really will parse the content as 'utf-8', meaning if any special characters are encoded using some other encoding (like macintosh), they may become garbled.

2 Comments

The IANA/MIME charset name "macintosh" is called "macRoman" in Ruby. Nokogiri should theoretically handle this the same way it handles other MIME charsets, but you can provide an additional alias yourself: Encoding.find('macRoman').duplicate('macintosh')
@RJHunter suggestion worked for me, but method is now apparently "replicate" and not "duplicate"

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.