5

I am currently programming with c++ a program that handles both alphabets and Korean characters.

However I learned that the size of char in c++ is only 1 bytes. This meant that in order to handle foreign characters or UNICODE, it needs to use two chars for one character.

string s = string("a가b나c다"); cout<< s.length(); 

prints 9

but my question is how does the c++ execution distinguish between the two different type of characters?

for example if I make a char array the size of 9, How does it know whether its 9 ascii characters or 4 unicode + 1 ascii ??

and then i figured out this :

 char c; int a; char* cp = "가나다라마바사아"; for (int i = 0; i < 20; i++) { c = a = cp[i]; cout << "\n c val : " << c; cout << "\n a val : " << a; } 

ONLY prints out negative values for a.

 c val : a val : -80 c val : a val : -95 c val : a val : -77 c val : a val : -86 c val : a val : -76 c val : a val : -39 

Which i can infer that for non ascii characters it only uses negative values? but isn't this quite a waste ?

My question in summary: Do c++ distinguish ascii chars and unicode chars only by looking if they are negative ?


Answer in summary : the parser decides whether to consider 1~4 char as 1 glyph by looking up the first few bits of the char, so to some extent my assumption was valid.

1
  • ASCII is a subset of unicode, there is no "distinguish" Commented Nov 16, 2017 at 3:39

1 Answer 1

8

how does the c++ execution distinguish between the two different type of characters?

It doesn't. The compiler decided to encode your string as Unicode at compile-time. In this case, it appears to have chosen UTF-8.

How does it know whether its 9 ascii characters or 4 unicode + 1 ascii ??

Again, it doesn't. Your string contains 9 char values (excluding any termination character). The number of actual "characters" (or "glyphs") that represents can only be determined by parsing the string. If you know it's UTF-8, you parse it accordingly.

Which i can infer that for non ascii characters it only uses negative values? but isn't this quite a waste ?

No. Well, sort of. If you're interested, read a primer on Unicode (specifically UTF-8). You could read the actual standard, but it's enormous. Wikipedia should be sufficient for a better understanding.

You'll see that multi-byte strings have the high-bit set. This makes it possible to parse multi-byte values correctly. It's not really that wasteful, because the standard is arranged such that wider encodings are generally reserved for less common values.

The reason it output negatives is that you are using signed char types. If you cast as unsigned, you'll see the values are simply greater than 127. When you read more about how UTF-8 is encoded, you'll understand why.

My question in summary: Do c++ distinguish ascii chars and unicode chars only by looking if they are negative ?

My answer in summary: No. "Negative" is a numeric system. You are probably accustomed to 2's-complement. Encode, or encode not: there is no "negative".

Sign up to request clarification or add additional context in comments.

2 Comments

what i meant by 'negative' is asking if it only uses half of the availible values. So your saying that when 2 chars add up to one character the two chars are always greater than 127 ?
I have edited to provide some context, and a link so you can read about UTF-8. There is an explanation in there about the standard requiring backwards-compatibility with ASCII.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.