0

I'm having a problem with converting text from and to UTF-8 encoding. Here I have byte array,

byte[] c = new byte[] { 1, 2, 200 }; 

I'm converting it to UTF-8 string and back to byte array,

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(c)); 

According to my understand what i should be expecting from this is an array with 3 bytes. right? But here's what I'm getting.

byte[5] { 1, 2, 239, 191, 189 } 

What's the reason for this? I understand the 239, 191, 189 combination is called REPLACEMENT CHARACTER in UTF-8 Specials.

Also this is part of a bigger problem.

1
  • 1
    Why are you doing that? { 1, 2, 3, 200 } is not text encoded as UTF-8. If you're aiming to propagate arbitrary binary data, use base64. It sounds like you're not "converting text" - you're converting binary data. Commented May 15, 2017 at 9:33

1 Answer 1

2

Not all sequences of bytes are valid UTF-8. It seems that your array (1, 2, 200) is invalid in UTF-8 (that's why this special error character is added)

MSDN says about Encoding.UTF8:

It returns a UTF8Encoding object that provides a Unicode byte order mark (BOM). To instantiate a UTF8 encoding that doesn't provide a BOM, call any overload of the UTF8Encoding constructor.

1) There are no BOM (https://en.wikipedia.org/wiki/Byte_order_mark) in your example.

2) 200 - a leading byte. It must be followed by enough continuation bytes

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.