2

Below is a simplified example of my problem. I have some external byte data which appears to be a string with cp1252 encoded degree symbol 0xb0. When it is stored in my program as an std::string it is correctly represented as 0xffffffb0. However, when that string is then written to a file, the resulting file is only one byte long with just 0xb0. How do I write the string to the file? How does the concept of UTF-8 come into this?

#include <iostream> #include <fstream> typedef struct { char n[40]; } mystruct; static void dump(const std::string& name) { std::cout << "It is '" << name << "'" << std::endl; const char *p = name.data(); for (size_t i=0; i<name.size(); i++) { printf("0x%02x ", p[i]); } std::cout << std::endl; } int main() { const unsigned char raw_bytes[] = { 0xb0, 0x00}; mystruct foo; foo = *(mystruct *)raw_bytes; std::string name = std::string(foo.n); dump(name); std::ofstream my_out("/tmp/out.bin", std::ios::out | std::ios::binary); my_out << name; my_out.close(); return 0; } 

Running the above program produces the following on STDOUT

It is '�' 0xffffffb0 
6
  • 1
    What do you expect if the actual byte is 0xb0? Forget ansi strings and use unicode anyway inside app, for serialization use UTF-8. Commented Aug 2, 2019 at 16:23
  • std::string name = std::string(foo.n); -- This does not construct a string containing two characters. Commented Aug 2, 2019 at 16:27
  • BTW, in C++ you don't need the typedef for struct. Commented Aug 2, 2019 at 16:34
  • 1. *(mystruct *)raw_bytes is not legal, anything could happen. 2. 0xffffffb0 is 0xb0 char value cast to int. It has nothing to do with ASCII, cp1252 or anything of this nature. Commented Aug 2, 2019 at 16:35
  • @MichaelChourdakis - can you please clarify how I would go about converting external data which contains non-ascii byte 0xb0 into a unicode string? Commented Aug 2, 2019 at 17:35

1 Answer 1

2

First of all, this is a must read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Now, when you done with that, you have to understand what type represents p[i].

It is char, which in C is a small size integer value with a sign! char can be negative!

Now, since you have cp1252 characters, they are outside the scope of ASCII. This means these characters are seen as negative values!

Now, when they are converted to int, the sign bit is replicated, and when you are trying to print it, you will see 0xffffff<actual byte value>.

To handle that in C, first you should cast to unsigned char:

printf("0x%02x ", (unsigned char)p[i]); 

then the default conversion will fill in the missing bits with zeros and printf() will give you a proper value.

Now, in C++ this is a bit more nasty, since char and unsigned char are treated by stream operators as a character representation. So to print them in hex manner, it should be like this:

int charToInt(char ch) { return static_cast<int>(static_cast<unsigned char>(ch)); } std::cout << std::hex << charToInt(s[i]); 

Now, direct conversion from char to unsigned int will not fix the problem since silently the compiler will perform a conversation to int first.

See here: https://wandbox.org/permlink/sRmh8hZd78Oar7nF

UTF-8 has nothing to this issue.

Off-topic: please, when you write pure C++ code, do not use C. It is pointless and makes code harder to maintain, and it is not faster. So:

  • do not use char* or char[] to store strings. Just use std::string.
  • do not use printf(), use std::cout (or the fmt library, if you like format strings - it will became a future C++ standard).
  • do not use alloc(), malloc(), free() - in modern C++, use std::make_unique() and std::make_shared().
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for answering my question, as it was phrased. I guess I should have rephrased the questions. I have read Joe's article before and re-read it again right now. What I ultimately want to achieve is to take external data and covert it from what appears to be cp1252 encoding into UTF-8. There must be a c++ generic way to do that, but I cannot find it
I found the solution that works for me. In my original code, instead of std::string name = std::string(foo.n); I am now doing std::string name = boost::locale::conv::to_utf<char>(foo.n, "Latin1");

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.