This should return what you asked. It will check for both the absence of unpaired high/low surrogate and for absence of non-defined codepoints (were "defined" depends on the unicode tables present in the version of .NET you are using and on the version of operating system)
static bool IsLegalUnicode(string str) { for (int i = 0; i < str.Length; i++) { var uc = char.GetUnicodeCategory(str, i); if (uc == UnicodeCategory.Surrogate) { // Unpaired surrogate, like "😵"[0] + "A" or "😵"[1] + "A" return false; } else if (uc == UnicodeCategory.OtherNotAssigned) { // \uF000 or \U00030000 return false; } // Correct high-low surrogate, we must skip the low surrogate // (it is correct because otherwise it would have been a // UnicodeCategory.Surrogate) if (char.IsHighSurrogate(str, i)) { i++; } } return true; }
Note that Unicode is in continuous expansion. UTF-8 is able to map all the Unicode codepoints, even the ones that can't be assigned at this time.
Some examples:
var test1 = IsLegalUnicode("abcdeàèéìòù"); // true var test2 = IsLegalUnicode("⭐ White Medium Star"); // true, Unicode 5.1 var test3 = IsLegalUnicode("😁 Beaming Face With Smiling Eyes"); // true, Unicode 6.0 var test4 = IsLegalUnicode("🙂 Slightly Smiling Face"); // true, Unicode 7.0 var test5 = IsLegalUnicode("🤗 Hugging Face"); // true, Unicode 8.0 var test6 = IsLegalUnicode("🤣 Rolling on the Floor Laughing"); // false, Unicode 9.0 (2016) var test7 = IsLegalUnicode("🤩 Star-Struck"); // false, Unicode 10.0 (2017) var test8 = IsLegalUnicode("\uFF00"); // false, undefined BMP UTF-16 unicode var test9 = IsLegalUnicode("😀"[0] + "X"); // false, unpaired high surrogate pair var test10 = IsLegalUnicode("😀"[1] + "X"); // false, unpaired low surrogate pair
Note that you can encode in UTF-8 even well-formed "unknown" Unicode codepoints, like the 🤩 Star-Struck.
Results taken with .NET 4.7.2 under Windows 10.
ais utf-16, not utf-8. What do you mean with "invalid"? Unpaired high/low surrogates? Unassigned unicode codepoints?somethingis just a random value and I want to check whether that value is one of the valid utf-8 codes or not..'\uD83D'+'\uDE35'(so it is 2xchartogether), but alone both'\uD83D'and'\uDE35'(that are called high and low surrogates) are illegal.'\uFFF0'is at this time undefined in the Unicode standard (there is no character defined for that codepoint). We don't know if in a year it will still be undefined. Two different "illegal".