1

I'm reading a UTF-16 formatted file with fread bytewise and want to store the result into a std::wstring. So far i'm able to read the file with:

char* path = "Some_Path_To_a_UTF-16_File" char buffer[buffersize]; FILE* handle = fopen(path, "rb"); fread(buffer, 1, 100, handle); 

After this I have (some of) the bytes of the file stored in buffer (including BOM).

Now to my actual question: I want to store the data I've just read into a std::wstring! I don't know/understand how i can get those respectively 2 bytes representing a UTF-16 character into a wstring?

I can't use any external librarys! Thanks for your help in advance!

3
  • 2
    Do not tag C++ questions C! And highlight code-snippets in your text. Commented Sep 24, 2015 at 10:55
  • Combining two bytes into a 16 bit word is easily searchable. So is stuffing the resulting buffer into a std::wstring. Do you also need to decode the UTF-16 stream? That's easily searchable too. Commented Sep 24, 2015 at 11:02
  • @Olaf Thank you! I'm completely new to posting here! I will try to regard your remark in the next post i make! Commented Sep 24, 2015 at 11:36

1 Answer 1

1

Whenever you store data in a file (such as a text file) you need to "serialize" it to a sequence of bytes, and when you read it back you need to unserialize it into your data representation.

UTF-16 files follow a specific binary format that starts with a byte order mark and then followed by pairs of bytes that must be combined into wchar_t values.

I would suggest you start by reading data in byte pairs (e.g. with fgetc) and combine them into wchar_t according to the byte order, e.g. wchar_t utf16 c = b1; c = c<<8 | b2, then push_back on the wstring.

Sign up to request clarification or add additional context in comments.

2 Comments

This worked perfectly so far! I now have another question regarding byte order on multiple platforms: As I see now the byte order of wchar_t in Windows seems to be Big-endian. Will it be BE in Linux/iOS too or could it be different there?
Typically the default format depends on the CPU architecture, not the OS. It is better to use a byte-order mark in the file to indicate how it was saved, but if none is available then you just have to make a guess. The most reliable guess would probably be based on what makes most sense when you look at the file contents, but to make it less complicated you could assume the same endianness as the CPU you're running on.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.