Difference between UTF-8 and UTF-16?
UTF-8 is a sequence of 8 bit bytes, while UTF-16 is a sequence of 16 bit units (hereafter referred to as words).
In UTF-8 code points with values 0 to 0x7F are encoded directly as single bytes, code points with values 0x100 to 0x7FF as two bytes, code points with values 0x800 to 0xFFFF as three bytes and code points with values 0x100000 to 0x10FFFF encoded as four bytes.
In UTF-16 code points 0x0000 to 0xFFFF (note: values 0xD800 to 0xDFFF are not valid Unicode code points) are encoded directly as single words. Code points with values 0x100000 to 0x10FFFF are encoded as two words. These two word sequences are known as surrogate pairs.
Why do we need these?
Because history is messy. Different companies and organisations have different priorities and ideas, and once a format decision is made, it tends to stick around.
Back in 1989 the ISO had proposed a Universal character set as a draft of ISO 10646, but the major software vendors did not like it, seeing it as over-complicated. They devised their own system called Unicode, a fixed-width 16-bit encoding. The software companies convinced a sufficient number of national standards bodies to vote down the draft of ISO 10646 and ISO was pushed into unification with Unicode.
This original 16-bit Unicode was adopted as the native internal format by a number of major software products. Two of the most notable were Java (released in 1996) and Windows NT (released in 1993). A string in Java or NT is, at its most fundamental, a sequence of 16-bit values.
There was a need to encode Unicode in byte-orientated "extended ASCII" environments. The ISO had proposed a standard "UTF-1" for this, but people didn't like it, it was slow to implement because it involved modulo operators and the encoded data had some undesirable properties.
x-open circulated a proposal for a new standard for encoding Unicode/UCS values in extended ASCII environments. This was altered slightly by the Plan 9 developers to become what we now know as UTF-8.
Eventually, the software vendors had to concede that 16 bits was not enough. In particular, China was pressing heavily for support for historic Chinese characters that were two numerous to encode in 16 bits.
The end result was Unicode 2.0, which expanded the code space to just over 20 bits and introduced UTF-16. At the same time, Unicode 2.0 also elevated UTF-8 to be a formal part of the standard. Finally it introduced UTF-32, a new fixed width encoding.
In practice due to compatibility and efficiency considerations, relatively few systems adopted UTF-32. Those systems that had adopted the original 16 bit Unicode (e.g. Windows, Java) moved to UTF-16, while those that had remained byte orientated (e.g. Unix, the Internet) continued their gradual move from legacy 8 bit encodings to UTF-8.