Reading a UTF-16 formatted file bytewise to wstring

Question

I'm reading a UTF-16 formatted file with fread bytewise and want to store the result into a std::wstring. So far i'm able to read the file with:

char* path = "Some_Path_To_a_UTF-16_File" char buffer[buffersize]; FILE* handle = fopen(path, "rb"); fread(buffer, 1, 100, handle);

After this I have (some of) the bytes of the file stored in buffer (including BOM).

Now to my actual question: I want to store the data I've just read into a std::wstring! I don't know/understand how i can get those respectively 2 bytes representing a UTF-16 character into a wstring?

I can't use any external librarys! Thanks for your help in advance!

Do not tag C++ questions C! And highlight code-snippets in your text. — too honest for this site
– too honest for this site, Commented Sep 24, 2015 at 10:55
Combining two bytes into a 16 bit word is easily searchable. So is stuffing the resulting buffer into a std::wstring. Do you also need to decode the UTF-16 stream? That's easily searchable too. — Captain Obvlious
– Captain Obvlious, Commented Sep 24, 2015 at 11:02
@Olaf Thank you! I'm completely new to posting here! I will try to regard your remark in the next post i make! — Parker_Halo
– Parker_Halo, Commented Sep 24, 2015 at 11:36

flodin · Accepted Answer · 2015-09-24 11:06:23Z

Whenever you store data in a file (such as a text file) you need to "serialize" it to a sequence of bytes, and when you read it back you need to unserialize it into your data representation.

UTF-16 files follow a specific binary format that starts with a byte order mark and then followed by pairs of bytes that must be combined into wchar_t values.

I would suggest you start by reading data in byte pairs (e.g. with fgetc) and combine them into wchar_t according to the byte order, e.g. wchar_t utf16 c = b1; c = c<<8 | b2, then push_back on the wstring.

This worked perfectly so far! I now have another question regarding byte order on multiple platforms: As I see now the byte order of wchar_t in Windows seems to be Big-endian. Will it be BE in Linux/iOS too or could it be different there?
Typically the default format depends on the CPU architecture, not the OS. It is better to use a byte-order mark in the file to indicate how it was saved, but if none is available then you just have to make a guess. The most reliable guess would probably be based on what makes most sense when you look at the file contents, but to make it less complicated you could assume the same endianness as the CPU you're running on.

Collectives™ on Stack Overflow

Reading a UTF-16 formatted file bytewise to wstring

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related