How to check for invalid UTF-8 characters?

Question

Now there are lots of supported Hexadecimal (UTF-8) entities out there starting from Decimal values 0 to 10175, is there a fast way to check a certain value contained in a variable is one of the values of the supported Hexadecimal (UTF-8) entities.

e.x.

var something="some string value"; char[] validCharacter = new[] { All 10175 UTF-8 Hexadecimal characters }; if(validCharacter.Contains(something)) { \\do something };

How can I do this check the fastest way possible?

Unclear what you are asking. a is utf-16, not utf-8. What do you mean with "invalid"? Unpaired high/low surrogates? Unassigned unicode codepoints? — xanatos
– xanatos, Commented Jun 8, 2018 at 12:59
@xanatos check the question now, something is just a random value and I want to check whether that value is one of the valid utf-8 codes or not.. — CD DelRio
– CD DelRio, Commented Jun 8, 2018 at 13:03
you are repeating the same words, but your words don't have a unique meaning. 😵 == '\uD83D'+'\uDE35' (so it is 2x char together), but alone both '\uD83D' and '\uDE35' (that are called high and low surrogates) are illegal. '\uFFF0' is at this time undefined in the Unicode standard (there is no character defined for that codepoint). We don't know if in a year it will still be undefined. Two different "illegal". — xanatos
– xanatos, Commented Jun 8, 2018 at 13:07
Problem 1 (unpaired surrogates) can be mechanically detected (it is based on the value of a char). Problem 2 (which characters are defined in Unicode) requires big tables of Unicode characters. The ones in .NET are old and don't contain newer emojis (and other rare scripts) — xanatos
– xanatos, Commented Jun 8, 2018 at 13:08
Do you want to know whether some integer value represents a valid unicode codepoint, or whether some byte could be used in a UTF-8 encoding, or (maybe) whether the current font will show something useful on-screen? — Hans Keﬆing
– Hans Keﬆing, Commented Jun 8, 2018 at 13:08

xanatos · Accepted Answer · 2018-06-08 13:33:27Z

This should return what you asked. It will check for both the absence of unpaired high/low surrogate and for absence of non-defined codepoints (were "defined" depends on the unicode tables present in the version of .NET you are using and on the version of operating system)

static bool IsLegalUnicode(string str) { for (int i = 0; i < str.Length; i++) { var uc = char.GetUnicodeCategory(str, i); if (uc == UnicodeCategory.Surrogate) { // Unpaired surrogate, like "😵"[0] + "A" or "😵"[1] + "A" return false; } else if (uc == UnicodeCategory.OtherNotAssigned) { // \uF000 or \U00030000 return false; } // Correct high-low surrogate, we must skip the low surrogate // (it is correct because otherwise it would have been a // UnicodeCategory.Surrogate) if (char.IsHighSurrogate(str, i)) { i++; } } return true; }

Note that Unicode is in continuous expansion. UTF-8 is able to map all the Unicode codepoints, even the ones that can't be assigned at this time.

Some examples:

var test1 = IsLegalUnicode("abcdeàèéìòù"); // true var test2 = IsLegalUnicode("⭐ White Medium Star"); // true, Unicode 5.1 var test3 = IsLegalUnicode("😁 Beaming Face With Smiling Eyes"); // true, Unicode 6.0 var test4 = IsLegalUnicode("🙂 Slightly Smiling Face"); // true, Unicode 7.0 var test5 = IsLegalUnicode("🤗 Hugging Face"); // true, Unicode 8.0 var test6 = IsLegalUnicode("🤣 Rolling on the Floor Laughing"); // false, Unicode 9.0 (2016) var test7 = IsLegalUnicode("🤩 Star-Struck"); // false, Unicode 10.0 (2017) var test8 = IsLegalUnicode("\uFF00"); // false, undefined BMP UTF-16 unicode var test9 = IsLegalUnicode("😀"[0] + "X"); // false, unpaired high surrogate pair var test10 = IsLegalUnicode("😀"[1] + "X"); // false, unpaired low surrogate pair

Note that you can encode in UTF-8 even well-formed "unknown" Unicode codepoints, like the 🤩 Star-Struck.

Results taken with .NET 4.7.2 under Windows 10.

What I acutally want is, if the value inside your IsLegalUnicode method contains more than one character, it should be false automatically and if the value is a single character then it should first check whether it is a number [0 to 9], or a alphabetical character [a to z] or punctuation [.,;: etc] and if it is none of them then the check should really work...I hope I made it a bit clearer..normal alphabetical characters, numbers and punctuations are excluded from the checking
@xanatos +1. I didn't know that char.GetUnicodeCategory works differently than myString[i] and takes both parts in case of a surrogate pair.

ispiro · Accepted Answer · 2021-09-09 20:25:12Z

UTF8Encoding.GetString(byteArray) will throw an ArgumentException if Error detection is enabled.

Source: https://msdn.microsoft.com/en-us/library/kzb9f993(v=vs.110).aspx

But if you're testing something that is already a string - as far as I know - it will almost always be valid UTF8. (see below.) As far as I know all C# strings are encoded in UTF16 which is an encoding for all Unicode characters. UTF8 is just a different encoding for the same set. i.e. For all of the Unicode characters.

(This might excluded some Unicode characters which are new etc. But those will also not be in UTF16 so that won't matter here.)

As someone has commented, there might be "halves" of UTF16 characters that would be valid strings but won't be valid UTF8 values. So you can Encoding.Unicode.GetBytes() and then Encoding.UTF8.GetString() to verify. But those will probably be quite rare.

EDIT

Enabling error detection: Use this UTF8Encoding(Boolean, Boolean) constructor for UTF8Encoding.

@xanatos Thanks. I actually think it's a little strange that C# allows those. Though I can understand the reasoning behind it...
You can, however, losslessly convert all C# strings to WTF-8.
@dan04 What about a half of a surrogate pair? I think it's a 'valid' string, though it's not really a Unicode character.

Collectives™ on Stack Overflow

How to check for invalid UTF-8 characters?

2 Answers 2

3 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Linked

Related