0

I am writing a Lexical analyzer that parses a given string in C++. I have a string

line = R"(if n = 4 # comment return 34; if n≤3 retur N1 FI)"; 

All I need to do is output all words, numbers and tokens in a vector.

My program works with regular tokens, words and numbers; but I cannot figure out how to parse Unicode characters. The only Unicode characters my program needs to save in a vector are ≤ and ≠.

So far I all my code basically takes the string line by line, reads the first word, number or token, chops it off and recursively continues to eat tokens until the string is empty. I am unable to compare line[0] with (of course) and I am also not clear on how much of the string I need to chop off in order to get rid of the Unicode char? In case of "!=" I simple remove line[0] and line[1].

5
  • Are your strings encoded as UTF-8? If so, see this post to see how to convert them to std::wstring via the function called widen. These will then be much easier to process. Commented Oct 10, 2020 at 21:46
  • @PaulSanders. I disagree, see utf8everywhere.org. The only architecture where utf8 might be problematic is under MS-Windows. But windows uses utf16 for std::wstring which is worst of both worlds. You still have issues with multibyte, with the funny addition of byte orders, etc.. Commented Oct 10, 2020 at 22:06
  • @PaulSanders converting to UTF-16 doesn't solve the problem because you still need to handle characters outside the BMP. UTF-16 is not a fixed-width encoding that you only need to read only 2 bytes Commented Oct 11, 2020 at 1:14
  • @phuclv I didn't say I was converting to UTF-16. The referenced code converts to std::wstring. That said, characters outside the BMP are not difficult to handle in UTF-16 because the values used for the individual code units in a surrogate pair fall within well-defined ranges. Commented Oct 11, 2020 at 1:23
  • 1
    @PaulSanders then handling UTF-8 is even easier because the ranges are also well defined Commented Oct 11, 2020 at 1:24

2 Answers 2

3

If your input-file is utf8, just treat your unicode characters , , etc as strings. So you just have to use the same logic to recognize "≤" as you would for "<=". The length of a unicode char is then given by strlen("≤")

Sign up to request clarification or add additional context in comments.

2 Comments

you didn't know what the next character is yet, so how can you know that it's ≤ to call strlen("≤")? And if you already knew it then strlen is unnecessary because you already knew the length. To recognized "≤" you need to know its length before reading and recognizing it
@phuclv, The poster specifically said he knew how to recognize "!=" and remove it from the byte stream. Recognizing != is done by matching the beginning of the input stream, byte-by-byte, with the zero-terminated string "!=". Recognizing or is done by matching the beginning of the input stream, byte-by-byte, with the zero terminated strings "≤" or "≠". The poster also hinted that he didn't understand how to remove from the input stream. With != he knows to remove 2 bytes, with "≠" he should remove strlen("≠") bytes
2

All Unicode encodings are variable-length except UTF-32. Therefore the next character isn't necessary a single char and you must read it as a string. Since you're using a char* or std::string, the encoding is likely UTF-8 and the next character and can be returned as std::string

The encoding of UTF-8 is very simple and you can read about it everywhere. In short, the first byte of a sequence will indicate how long that sequence is and you can get the next character like this:

std::string getNextChar(const std::string& str, size_t index) { if (str[index] & 0x80 == 0) // 1-byte sequence return std::string(1, str[index]) else if (str[index] & 0xE0 == 0xC0) // 2-byte sequence return std::string(&str[index], 2) else if (str[index] & 0xF0 == 0xE0) // 3-byte sequence return std::string(&str[index], 3) else if (str[index] & 0xF8 == 0xF0) // 4-byte sequence return std::string(&str[index], 4) throw "Invalid codepoint!"; } 

It's a very simple decoder and doesn't handle invalid codepoints or broken datastream yet. If you need better handling you'll have to use a proper UTF-8 library

1 Comment

This is much more complicated than it has to be. It is virtually never necessary to split a utf8-string into separate codepoints to do string matching. Just have the unicode-characters you are looking for in utf8-strings, and look for those byte-sequences in the input-stream. No need to do codepoint-by-codepoint comparisons when byte-by-byte comparisons works perfectly fine

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.