0
std::string arrWords[10]; std::vector<std::string> hElemanlar; 

......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str()); 

......

What i am doing is: Every element of arrWord is a std::string. I get the n th element of arrWord and then push them into hElemanlar.

Assuming arrWords[0] is "test", then:

this->hElemanlar.push_back("t"); this->hElemanlar.push_back("e"); this->hElemanlar.push_back("s"); this->hElemanlar.push_back("t"); 

And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar. How can i fix it?s

7
  • We cannot help when your problem statement is just "some utf-8 characters are not printed or treated well" Commented Dec 23, 2015 at 10:27
  • I'm sure there is no problem for "test". Can you show some string that does have a problem? Commented Dec 23, 2015 at 10:31
  • @LightnessRacesinOrbit well the problem is that some utf-8 characters are not printed or treated well. Commented Dec 23, 2015 at 10:31
  • Repeating the same statement does not add value either. Commented Dec 23, 2015 at 10:31
  • @BoPersson like "ğ,ş,ı,ö,ç,ü". Commented Dec 23, 2015 at 10:32

1 Answer 1

1

If you know that arrWords[i] contains UTF-8 encoded text, then you probably need to split the strings into complete Unicode characters.

As an aside, rather than saying:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str()); 

(which constructs a temporary std::string, obtains a the c-string representation of it, constructs another temporary string, and pushes that onto the vector), say:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j])) 

Anyway. This will need to become something like:

std::string str(1, this-arrWords[sayKelime][j]) if (static_cast<unsigned char>(str[0]) >= 0xC0) { for (const char c = this-arrWords[sayKelime][j+1]; static_cast<unsigned char>(c) >= 0x80; j++) { str.push_back(c); } } this->hElemenlar.push_back(str); 

Note that the above loop is safe, because if j is the index of the last char in the string, [j+1] will return the nul-terminator (which will end the loop). You will need to consider how incrementing j interacts with the rest of your code though.

You then need to consider whether you want hElemanlar to represent individual Unicode code points (which this does), or do you want to include a character + all the combining characters that follow? In the latter case, you would have to extend the code above to:

  • Parse the next code-point
  • Decide whether it is a combining character
  • Push the UTF-8 sequence on the string if so.
  • Repeat (you can have multiple combining characters on a character).
Sign up to request clarification or add additional context in comments.

1 Comment

Unfortunately it is crashing.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.