0

The following C# method counts string characters considering combining characters (Grapheme Clusters). Here it is:

public static class StringExtensions { public static SqlInt32 GetStrLength(this string input) { if (string.IsNullOrEmpty(input)) return 0; return StringInfo.ParseCombiningCharacters(input).Length; } } 

Now, I create a CLR from it to use inside SQL Server. Here is its code:

public static class UserDefinedFunctions { [SqlFunction(IsDeterministic = true, IsPrecise = true)] public static SqlInt32 GetStrLength(SqlString input) { if (input.IsNull) return 0; return StringInfo.ParseCombiningCharacters(input.Value).Length; } } 

The C# version works well, but in SQL Server, it doesn't count properly. What's the problem?

Here are a few examples where the SQLCLR function cannot count correctly:

SQLCLR version (Wrong) non-SQLCLR version (Correct)
'πŸ‘©πŸ»' -> 2 'πŸ‘©πŸ»' -> 1
'πŸ‘¨πŸ»β€β€οΈβ€πŸ’‹β€πŸ‘©πŸΌ' -> 9 'πŸ‘¨πŸ»β€β€οΈβ€πŸ’‹β€πŸ‘©πŸΌ' -> 1

Here is the SQL code I have run to get the length:

SELECT dbo.GetStringLength(body) FROM notes; 

And the following is the SQL code used to register the SQLCLR:

EXEC sp_configure 'show advanced options' , 1; RECONFIGURE; EXEC sp_configure 'clr enable' ,1; RECONFIGURE; EXEC sp_configure 'clr strict security', 0; RECONFIGURE; CREATE ASSEMBLY StringUtils FROM 'C:\GraphemeClusters.dll' WITH PERMISSION_SET = SAFE; CREATE FUNCTION dbo.GetStringLength(@input NVARCHAR(MAX)) RETURNS INT AS EXTERNAL NAME StringUtils.UserDefinedFunctions.GetStrLength; 
20
  • 2
    Do you have some string examples and the output you get from sql server + c#? Feel free to update your question with that. Commented Nov 5, 2024 at 10:06
  • 3
    Also, can you show the SQL used to a) register the CLR function and b) run the examples that you add to the question, as siggemannen asks. Commented Nov 5, 2024 at 10:09
  • 1
    Can you show us T-SQL code with your inputs and expected outputs? One thing I'd suggest is that you're probably not using an *_SC (Supplementary Characters) collation so nchar and nvarchar values will be using UCS-2 encoding instead of UTF-16 like C# expects. (They're not the same thing.) Commented Nov 5, 2024 at 10:36
  • 1
    @AlwaysLearning Hi there. The collation of the database shouldn't matter as the _SC and _140_ collations only really affect the behavior of built-in function, and only in relation to supplementary characters (I only mention this because combining characters can be either BMP or supplementary, so built-in functions should work as expected with combining characters so long as they are in the BMP). Commented Nov 5, 2024 at 14:36
  • 1
    MJDevelops: I just tested your method as well as StringInfo.LengthInTextElements, in both SQL Server 2017 and 2022, using the following test string, DECLARE @Input NVARCHAR(50) = NCHAR(0x0303) + NCHAR(0x0303) + N'o' + NCHAR(0x0303) + NCHAR(0x0302) + NCHAR(0x0303) + NCHAR(0x0302);, and the expected value of 3 was returned in all cases. So, again, please update the question with: 1) the version of SQL Server, 2) your test queries, and 3) the results. Commented Nov 5, 2024 at 17:40

1 Answer 1

2

The collation of the database shouldn't matter as the _SC and _140_ collations only really affect the behavior of built-in functions, and only in relation to supplementary characters.

I tested your GetStrLength method as well as StringInfo.LengthInTextElements, in both SQL Server 2017 and 2022, using the following test string:

DECLARE @Input NVARCHAR(50) = -- first two count individually as they do not have a base character NCHAR(0x0303) + NCHAR(0x0303) + -- next character counts as it is a base character N'o' + -- next four do not count as they all attach to the base character NCHAR(0x0303) + NCHAR(0x0302) + NCHAR(0x0303) + NCHAR(0x0302); 

and the expected value of 3 was returned in all cases ("0x0303" and "0x0302" are both combining characters).


The problem is that:

  1. You are testing emoji "sequences", not combining characters,

    and:

  2. SQL Server is tied to .NET Framework, and the highest version, 4.8, does not handle emoji "sequences". Newer versions of .NET, however, do.

Hence, while you thought you were testing the non-SQLCLR version against .NET Framework 4.8 with combining characters, you were actually testing against a more recent version of .NET with emoji "sequences".

I executed the following in both LINQPad versions 5 and 8 (the test emoji sequence provided by the OP, minus the skin tone modifiers of U+1F3FB, can be found in the Unicode Emoji documentation as a "Multi-Person Gender" example):

string smooch = "\U0001F469\U0001F3FB\u200D\u2764\uFE0F\u200D\U0001F48B\u200D\U0001F468\U0001F3FB"; System.Console.WriteLine(smooch); System.Console.WriteLine( System.Globalization.StringInfo.ParseCombiningCharacters(smooch) .Length); 

LINQPad 5, which works with .NET Framework 4.7 and 4.8, returns:

πŸ‘©πŸ»β€πŸ’‹πŸ‘¨πŸ» 9 

The value of 9 comes from the Unicode code points that make up this emoji sequence, minus the emoji presentation selector, which does not count:

  1. U+1F469 (woman)
  2. U+1F3FB (light skin tone)
  3. U+200D (zero width joiner / ZWJ)
  4. U+2764 (heavy black heart)
  • U+FE0F (emoji presentation selector)
  1. U+200D (zero width joiner / ZWJ)
  2. U+1F48B (kiss mark)
  3. U+200D (zero width joiner / ZWJ)
  4. U+1F468 (man)
  5. U+1F3FB (light skin tone)

Β 
LINQPad 8, which works with .NET 5 through 9, returns (I tested against .NET 6.0.35 and 8.0.10):

πŸ‘©πŸ»β€β€οΈβ€πŸ’‹β€πŸ‘¨πŸ» 1 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. so it supports combining characters and surrogate-pairs but not emoji sequences. and there is no way to fix this in SQL Server. true?
@MJDevelops I'm looking into a possible work-around for this... no, natively we are stuck with .NET Framework in SQLCLR, but we might be able to do a first-pass over the data to convert emoji sequences to a single character and then do the counting.. I will update my answer with what I find and let you know...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.