12

Is there any difference between these two string storage formats?

1

3 Answers 3

17

std::wstring is a container of wchar_t. The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type.

UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers.

Using Visual Studio, if you use wide character literals (e.g. L"Hello World") that contain no characters outside of the BMP, you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate pairs into Unicode code points for you, even if wchar_t is 16 bits.

Sign up to request clarification or add additional context in comments.

4 Comments

Do you mean that std::wstring is the same with UTF-16 for only the non-BMP unicode character when used in Windows operating system?
No. std::wstring is just a container of integers. The encoding of the container depends entirely on the data you insert into the container.
+1: For people unfamiliar with UTF it may be wise to define BMP.
Your last paragraph is the answer to my question. Thank you.
9

UTF-16 is a specific Unicode encoding. std::wstring is a string implementation that uses wchar_t as its underlying type for storing each character. (In contrast, regular std::string uses char).

The encoding used with wchar_t does not necessarily have to be UTF-16—it could also be UTF-32 for example.

1 Comment

It could also be UCS-2 or S-JIS or Big 5 or ... well, anything.
3

UTF-16 is a concept of text represented in 16-bit elements but an actual textual character may consist of more than one element

std::wstring is just a collection of these elements, and is a class primarily concerned with their storage.

The elements in a wstring, wchar_t is at least 16-bits but could be 32 bits.

4 Comments

Can you please explain in more detail, like giving an example. For instance the character 'A' is stored in std::wstring like "0x0041". How is it stored in UTF-16 format?
16-byte ?? woah that's a hardcore character encoding
@Inverse: That's why everyone should just use ASCII, there wouldn't be so much grief on memory use ;)
For those who may not understand the humor in the above comments, UTF-16 is a 16-bit Unicode encoding. Also, in UTF-16, a character that is defined using more than one 16-bit element is done so via surrogate pairs.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.