UTF-8 Encoding and decoding issue

Question

I'm having a problem with converting text from and to UTF-8 encoding. Here I have byte array,

byte[] c = new byte[] { 1, 2, 200 };

I'm converting it to UTF-8 string and back to byte array,

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(c));

According to my understand what i should be expecting from this is an array with 3 bytes. right? But here's what I'm getting.

byte[5] { 1, 2, 239, 191, 189 }

What's the reason for this? I understand the 239, 191, 189 combination is called REPLACEMENT CHARACTER in UTF-8 Specials.

Also this is part of a bigger problem.

Why are you doing that? { 1, 2, 3, 200 } is not text encoded as UTF-8. If you're aiming to propagate arbitrary binary data, use base64. It sounds like you're not "converting text" - you're converting binary data. — Jon Skeet
– Jon Skeet, Commented May 15, 2017 at 9:33

Mikhail Lobanov · Accepted Answer · 2017-05-15 09:35:48Z

Not all sequences of bytes are valid UTF-8. It seems that your array (1, 2, 200) is invalid in UTF-8 (that's why this special error character is added)

MSDN says about Encoding.UTF8:

It returns a UTF8Encoding object that provides a Unicode byte order mark (BOM). To instantiate a UTF8 encoding that doesn't provide a BOM, call any overload of the UTF8Encoding constructor.

1) There are no BOM (https://en.wikipedia.org/wiki/Byte_order_mark) in your example.

2) 200 - a leading byte. It must be followed by enough continuation bytes

Collectives™ on Stack Overflow

UTF-8 Encoding and decoding issue

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related