First a special problem: Unicode 0 is a terminator char in strings in C/C++. Modified UTF (i.e. UTF-8) deals with this by also doing the encoding for what officially should be one byte 0. As decoding poses no problem. You might consider this. For instance simply requiring modified UTF-16 on input as you check the terminator.
if (codepoint <= 0x007F && codepoint != 0) { Now the actual review:
(Optional) To cope with modified (
*str) UTF-16, generating modified UTF-8:if (codepoint <= 0x007F && codepoint != 0) {You are now giving the result to cout using an extra nul byte as terminator.
There should be a byte output stream. An output stream parameter could be appended to per byte, without intermediate arrays; and only at the end might need a NUL byte. If
stroccupies N UTF bytes, then the result will at most need 2N UTF bytes. _N UTF-16 bytes ~ to N/2 code points ~ max 2N UTF-8 (N/2 * 4-byte sequences)._That terminator should be added outside the loop.
Creating arrays is superfluous, immediately return the single bytes. (This would be the case for delivering the result somewhat like
cout << ((codepoint >> 6) & 0x1F) | 0xC0)You nicely validate the input for illegal UTF-16 chars above the max. In java one would throw an exception, you just discard the char.
(A matter of taste) Maybe consider an API with a string length as input parameter instead of relying on a NUL terminator. If the area of application is file based, that would even be more natural.