I am currently learning how to work with UTF-XX encoded files and text.
I have this simple example:
std::ifstream ifs; ifs.open("data/text.txt"); do { char c; ifs.get(c); printf("%x\n", c); } while (!ifs.eof()); Where the file text.txt contains the following strings:
yabloko яблоко The results looks like this:
79 61 62 6c 6f 6b 6f a ffffffd1 ffffff8f ffffffd0 ffffffb1 ffffffd0 ffffffbb ffffffd0 ffffffbe ffffffd0 ffffffba ffffffd0 ffffffbe I do understand why I have twice the number of lines for the cyrillic word (because it's UTF-8 encoded and that each character in this case is using 2 bytes), my questions is about what get() and printf() are doing. More precisely why is my character c printed out as a int? with the first 3 bytes set to FFF? When I look at the doc for the get() method I see:
int get(); istream& get (char& c); I tried both option. I see the first one returns an int. The second takes a char? I am really confused? Why would these functions extracts anything else from a file than just a single byte (char) at a time and why is the value for the cyrillic characters printed out as for example ffffffd1 instead of d1?
do/whileloop is not validatingget()succeeds before callingprintf(), socwill be garbage when EOF is reached. You should use awhileloop instead:char c; while (ifs.get(c)) { printf("%hhx\n", c); }