Revisions to The conversion from UTF-16 to UTF-8

added 212 characters in body

edited Jun 22, 2020 at 12:22

4.7k
15
19

First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.

 if (codepoint <= 0x007F && codepoint != 0) {

Now the actual review:

(Optional) To cope with modified (*str) UTF-16, generating modified UTF-8:
```
 if (codepoint <= 0x007F && codepoint != 0) { 
```
You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. An output stream parameter could be appended to per byte, without intermediate arrays; and only at the end might need a NUL byte. If str occupies N UTF bytes, then the result will at most need 2N UTF bytes. _N UTF-16 bytes ~ to N/2 code points ~ max 2N UTF-8 (N/2 * 4-byte sequences)._

That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes. (This would be the case for delivering the result somewhat like cout << ((codepoint >> 6) & 0x1F) | 0xC0)
You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.

First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.

 if (codepoint <= 0x007F && codepoint != 0) {

Now the actual review:

(Optional) To cope with modified (*str) UTF-16, generating modified UTF-8:
```
 if (codepoint <= 0x007F && codepoint != 0) { 
```
You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. An output stream parameter could be appended to per byte, without intermediate arrays; and only at the end might need a NUL byte. If str occupies N UTF bytes, then the result will at most need 2N UTF bytes. _N UTF-16 bytes ~ to N/2 code points ~ max 2N UTF-8 (N/2 * 4-byte sequences)._

That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes.
You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.

First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.

 if (codepoint <= 0x007F && codepoint != 0) {

Now the actual review:

(Optional) To cope with modified (*str) UTF-16, generating modified UTF-8:
```
 if (codepoint <= 0x007F && codepoint != 0) { 
```
You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. An output stream parameter could be appended to per byte, without intermediate arrays; and only at the end might need a NUL byte. If str occupies N UTF bytes, then the result will at most need 2N UTF bytes. _N UTF-16 bytes ~ to N/2 code points ~ max 2N UTF-8 (N/2 * 4-byte sequences)._

That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes. (This would be the case for delivering the result somewhat like cout << ((codepoint >> 6) & 0x1F) | 0xC0)
You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.

added 212 characters in body

Source Link

edited Jun 22, 2020 at 12:08

Joop Eggen

4.7k
15
19

First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.

 if (codepoint <= 0x007F && codepoint != 0) {

Now the actual review:

(Optional) To cope with modified (*str) UTF-16, generating modified UTF-8:
```
 if (codepoint <= 0x007F && codepoint != 0) { 
```
You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. An output stream parameter could be appended to per byte, without intermediate arrays; and only at the end might need a NUL byte. If str occupies N UTF bytes, then the result will at most need 2*N2N UTF bytes. _N UTF-16 bytes ~ to N/2 code points ~ max 2N UTF bytes-8 (N/2 * 4-byte sequences)._

That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes.
You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.

First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.

 if (codepoint <= 0x007F && codepoint != 0) {

Now the actual review:

(Optional) To cope with modified (*str) UTF-16, generating modified UTF-8:
```
 if (codepoint <= 0x007F && codepoint != 0) { 
```
You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. If str occupies N UTF bytes, then the result will at most need 2*N UTF bytes.

That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes.
You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.

First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.

 if (codepoint <= 0x007F && codepoint != 0) {

Now the actual review:

(Optional) To cope with modified (*str) UTF-16, generating modified UTF-8:
```
 if (codepoint <= 0x007F && codepoint != 0) { 
```
You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. An output stream parameter could be appended to per byte, without intermediate arrays; and only at the end might need a NUL byte. If str occupies N UTF bytes, then the result will at most need 2N UTF bytes. _N UTF-16 bytes ~ to N/2 code points ~ max 2N UTF-8 (N/2 * 4-byte sequences)._

That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes.
You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.

Source Link

answered Jun 18, 2020 at 11:54

Joop Eggen

4.7k
15
19

First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.

 if (codepoint <= 0x007F && codepoint != 0) {

Now the actual review:

(Optional) To cope with modified (*str) UTF-16, generating modified UTF-8:
```
 if (codepoint <= 0x007F && codepoint != 0) { 
```
You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. If str occupies N UTF bytes, then the result will at most need 2*N UTF bytes.

That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes.
You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.

Stack Exchange Network

Return to Answer