Skip to main content
added 6 characters in body
Source Link
Andriy M
  • 23.3k
  • 6
  • 60
  • 104

UTF-8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. That is enough to store any character in BMP (code points 000000 to 00FFFF). Some of the characters in that range would be encoded with just 1 byte in UTF-8 while others would require 2 or even 3 bytes (see this comparison chart for more details). Therefore, ensuring coverage of the same set of characters in UTF-8 would require char(3).

For example:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error for NCHAR(911):

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

However, if it was e.g. NCHAR(8364), you would need to expand the column further, to char(3) or varchar(3).

Note also that the UTF-8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.

UTF-8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. That is enough to store any character in BMP (code points 000000 to 00FFFF). Some of the characters in that range would be encoded with just 1 byte in UTF-8 while others would require 2 or even 3 (see this comparison chart for more details). Therefore, ensuring coverage of the same set of characters in UTF-8 would require char(3).

For example:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error for NCHAR(911):

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

However, if it was e.g. NCHAR(8364), you would need to expand the column further, to char(3) or varchar(3).

Note also that the UTF-8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.

UTF-8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. That is enough to store any character in BMP (code points 000000 to 00FFFF). Some of the characters in that range would be encoded with just 1 byte in UTF-8 while others would require 2 or even 3 bytes (see this comparison chart for more details). Therefore, ensuring coverage of the same set of characters in UTF-8 would require char(3).

For example:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error for NCHAR(911):

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

However, if it was e.g. NCHAR(8364), you would need to expand the column further, to char(3) or varchar(3).

Note also that the UTF-8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.

elaborated on storage size
Source Link
Andriy M
  • 23.3k
  • 6
  • 60
  • 104

UTF8UTF-8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. ToThat is enough to store any character in BMP (code points 000000 to 00FFFF). Some of the characters in that range would be encoded with just 1 byte in UTF-8 while others would require 2 or even 3 (see this comparison chart for more details). Therefore, ensuring coverage of the same full set of characters in a UTF8UTF-encoded character column8 would require char(23).

For example:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error for NCHAR(911):

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

However, if it was e.g. NCHAR(8364), you would need to expand the column further, to char(3) or varchar(3).

Note also that the UTF8UTF-8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.

UTF8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. To store the same full set of characters in a UTF8-encoded character column would require char(2).

For example:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

Note also that the UTF8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.

UTF-8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. That is enough to store any character in BMP (code points 000000 to 00FFFF). Some of the characters in that range would be encoded with just 1 byte in UTF-8 while others would require 2 or even 3 (see this comparison chart for more details). Therefore, ensuring coverage of the same set of characters in UTF-8 would require char(3).

For example:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error for NCHAR(911):

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

However, if it was e.g. NCHAR(8364), you would need to expand the column further, to char(3) or varchar(3).

Note also that the UTF-8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.

Source Link
Paul White
  • 95.7k
  • 30
  • 440
  • 691

UTF8 support gives you a new set of options. Potential space savings (without row or page compression) is one consideration, but the choice of type and encoding should probably be primarily made on the basis of actual requirements for comparison, sorting, data import, and export.

You may need to change more than you think, since e.g. an nchar(1) type provides two bytes of storage. To store the same full set of characters in a UTF8-encoded character column would require char(2).

For example:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 char(1) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

gives the familiar error:

Msg 8152, Level 16, State 30, Line xxx
String or binary data would be truncated.

Or if trace flag 460 is active:

Msg 2628, Level 16, State 1, Line xxx
String or binary data would be truncated in table '@T', column 'UTF8'. Truncated value: ' '.

Expanding the UTF8 column to char(2) or varchar(2) resolves the error:

DECLARE @T AS table ( n integer PRIMARY KEY, UTF16 nchar(1) COLLATE Latin1_General_CI_AS, UTF8 varchar(2) COLLATE Latin1_General_100_CI_AS_SC_UTF8 ); INSERT @T (n, UTF16, UTF8) SELECT 911, NCHAR(911), NCHAR(911); 

Note also that the UTF8 collations all use supplementary characters, so will not work with replication.

Aside from anything else, UTF-8 support is only in preview at this time, so not available for production use.