Ruby 1.9 - Invalid multibyte character (utf-8)

Question

I have a ruby file with only these two lines:

# encoding: utf-8 puts "—"

When I run it with ruby test_enc.rb it fails with:

test_enc.rb:2: invalid multibyte char (UTF-8) test_enc.rb:2: unterminated string meets end of file

I don't know how to properly specify the character code of — (emdash), but vim tells me it is 151, Hex 97, Octal 227. It fails the same way with other characters like ã as well, so I doubt it is related specifically to that character. I am running on Windows XP and the version of ruby I'm using is:

ruby 1.9.1p430 (2010-08-16 revision 28998) [i386-mingw32]

I feel like there is something very obvious I am missing here. Any ideas?

EDIT: Learned a valuable lesson about assumptions today - specifically assuming your editor IS using UTF-8 without actually checking it. Oops!

Thanks for the quick and accurate replies all!

EDIT AGAIN: The 'setting up vim properly for utf-8' grew too big and wasn't really relevant to this question, so it is now a separate question.

Are you sure it's not coding: utf-8? (rather than encoding). — Amokrane Chentir
– Amokrane Chentir, Commented Mar 29, 2011 at 16:52
Both do the same thing. You can actually put asdfgibberishcoding: utf-8 and it works just the same. — Nick Knowlson
– Nick Knowlson, Commented Mar 29, 2011 at 16:54
What does 'puts ENCODING' say? (add one 2 _ each part of ENCODING). — Amokrane Chentir
– Amokrane Chentir, Commented Mar 29, 2011 at 16:57

Jon Skeet · Accepted Answer · 2011-03-29 16:55:16Z

Given that Ruby is explicitly calling your attention to UTF-8, I strongly suspect that you haven't actually written out a UTF-8 file to start with. Make sure that Vim (or whatever text editor you're using to create the file) is really set to write out UTF-8.

Note that in UTF-8, any non-ASCII character will be represented by multiple bytes, not a single byte as you've described from the Vim diagnostics. I'd recommend using a binary file editor (or dump, or whatever) to really show what's in the text file though. Something that doesn't already have some preconceived notion of the encoding - something that isn't even trying to think of it as a text file.

Notepad lets you write out a file in UTF-8, so you might want to try that just to see what happens. (I don't have Ruby installed myself, otherwise I'd try it for you.)

I just had the same thought - what is vim actually saving the file as? When I checked I saw its encoding was set to latin1. I was wondering why those numbers didn't match up to what I saw in here.
Setting the encoding to ISO-8859-1 (to match what my editor is actually using) appears to fix it. I still see ù when I print it out, but I'm pretty sure that's just a windows terminal issue.
@Nick: Rather than change the encoding in the file, why not change what your editor uses? Then you won't be limited to just Latin-1, which is a pretty small range of characters. I'm sure Vim must support other encodings...
I think you're right, and that will help me long term as well. I'm remembering now a previous time I had problems with encoding and I'm pretty sure the cause was this same darn thing. For any other vim users that see this, put set encoding=utf-8 in your .vimrc and you'll be set.
Also, thanks very much for your help, and wow you answer questions quickly.

Bruno Rohée · Accepted Answer · 2011-03-29 17:05:15Z

3

Your file is in latin1. Ruby is right.

emdash would be encoded on two bytes not one in UTF-8.

answered Mar 29, 2011 at 17:05

Bruno Rohée

3,56429 silver badges32 bronze badges

3 Comments

Nick Knowlson Over a year ago

Thanks, your comment is spot on. :)

Jörg W Mittag Over a year ago

Three, actually: 0xE2 0x80 0x94.

Bruno Rohée Over a year ago

@Jörg : that's what I guess for not checking ;)

Collectives™ on Stack Overflow

Ruby 1.9 - Invalid multibyte character (utf-8)

2 Answers 2

6 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Linked

Related