Explanations about the Ifstream get() method behaviour when reading UTF-8 encoded text (C++)

Question

I am currently learning how to work with UTF-XX encoded files and text.

I have this simple example:

std::ifstream ifs; ifs.open("data/text.txt"); do { char c; ifs.get(c); printf("%x\n", c); } while (!ifs.eof());

Where the file text.txt contains the following strings:

yabloko яблоко

The results looks like this:

79 61 62 6c 6f 6b 6f a ffffffd1 ffffff8f ffffffd0 ffffffb1 ffffffd0 ffffffbb ffffffd0 ffffffbe ffffffd0 ffffffba ffffffd0 ffffffbe

I do understand why I have twice the number of lines for the cyrillic word (because it's UTF-8 encoded and that each character in this case is using 2 bytes), my questions is about what get() and printf() are doing. More precisely why is my character c printed out as a int? with the first 3 bytes set to FFF? When I look at the doc for the get() method I see:

int get(); istream& get (char& c);

I tried both option. I see the first one returns an int. The second takes a char? I am really confused? Why would these functions extracts anything else from a file than just a single byte (char) at a time and why is the value for the cyrillic characters printed out as for example ffffffd1 instead of d1?

Not related to your issue, but your do/while loop is not validating get() succeeds before calling printf(), so c will be garbage when EOF is reached. You should use a while loop instead: char c; while (ifs.get(c)) { printf("%hhx\n", c); } — Remy Lebeau
– Remy Lebeau, Commented Mar 8, 2017 at 22:48

Maxim Egorushkin · Accepted Answer · 2017-03-06 18:12:07Z

More precisely why is my character c printed out as a int?

Because char is promoted to int when passed to ... argument of printf. On your platform char is signed, hence all codes above 127 get promoted to a negative int.

You may like to use %hhx format specifier to print char.

int istream::get() returns an int rather than char to be able to distinguish the read character from EOF. Traits::eof() is normally int(-1). No Unicode character has this code.

Nice, thank you I am tempted to accept your answer) but do you ask know why the get() method returns an int rather than a char?

Collectives™ on Stack Overflow

Explanations about the Ifstream get() method behaviour when reading UTF-8 encoded text (C++)

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related