std::string character encoding

Question

std::string arrWords[10]; std::vector<std::string> hElemanlar;

......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

......

What i am doing is: Every element of arrWord is a std::string. I get the n th element of arrWord and then push them into hElemanlar.

Assuming arrWords[0] is "test", then:

this->hElemanlar.push_back("t"); this->hElemanlar.push_back("e"); this->hElemanlar.push_back("s"); this->hElemanlar.push_back("t");

And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar. How can i fix it?s

We cannot help when your problem statement is just "some utf-8 characters are not printed or treated well" — Lightness Races in Orbit
– Lightness Races in Orbit, Commented Dec 23, 2015 at 10:27
I'm sure there is no problem for "test". Can you show some string that does have a problem? — Bo Persson
– Bo Persson, Commented Dec 23, 2015 at 10:31
@LightnessRacesinOrbit well the problem is that some utf-8 characters are not printed or treated well. — gokturk
– gokturk, Commented Dec 23, 2015 at 10:31

Martin Bonner supports Monica · Accepted Answer · 2015-12-23 10:56:45Z

If you know that arrWords[i] contains UTF-8 encoded text, then you probably need to split the strings into complete Unicode characters.

As an aside, rather than saying:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

(which constructs a temporary std::string, obtains a the c-string representation of it, constructs another temporary string, and pushes that onto the vector), say:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]))

Anyway. This will need to become something like:

std::string str(1, this-arrWords[sayKelime][j]) if (static_cast<unsigned char>(str[0]) >= 0xC0) { for (const char c = this-arrWords[sayKelime][j+1]; static_cast<unsigned char>(c) >= 0x80; j++) { str.push_back(c); } } this->hElemenlar.push_back(str);

Note that the above loop is safe, because if j is the index of the last char in the string, [j+1] will return the nul-terminator (which will end the loop). You will need to consider how incrementing j interacts with the rest of your code though.

You then need to consider whether you want hElemanlar to represent individual Unicode code points (which this does), or do you want to include a character + all the combining characters that follow? In the latter case, you would have to extend the code above to:

Parse the next code-point
Decide whether it is a combining character
Push the UTF-8 sequence on the string if so.
Repeat (you can have multiple combining characters on a character).

Collectives™ on Stack Overflow

std::string character encoding

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related