I am writing a Lexical analyzer that parses a given string in C++. I have a string
line = R"(if n = 4 # comment return 34; if n≤3 retur N1 FI)"; All I need to do is output all words, numbers and tokens in a vector.
My program works with regular tokens, words and numbers; but I cannot figure out how to parse Unicode characters. The only Unicode characters my program needs to save in a vector are ≤ and ≠.
So far I all my code basically takes the string line by line, reads the first word, number or token, chops it off and recursively continues to eat tokens until the string is empty. I am unable to compare line[0] with ≠ (of course) and I am also not clear on how much of the string I need to chop off in order to get rid of the Unicode char? In case of "!=" I simple remove line[0] and line[1].
std::wstringvia the function calledwiden. These will then be much easier to process.std::wstringwhich is worst of both worlds. You still have issues with multibyte, with the funny addition of byte orders, etc..std::wstring. That said, characters outside the BMP are not difficult to handle in UTF-16 because the values used for the individual code units in a surrogate pair fall within well-defined ranges.