7

What is encoding in XML? The normal encoding used is utf-8. How is it different from other encoding? What is the purpose of using it?

6
  • Refer W3C recommendation on encoding. Commented Apr 14, 2011 at 9:58
  • @Nishant: that's not really a good introduction into the topic of character encodings in general. And I think that's what the question is really about. Commented Apr 14, 2011 at 9:59
  • I just added XML specs as OP pointed, 'normal encoding used is UTF-8'. It isn't supposed to be an answer. Commented Apr 14, 2011 at 10:02
  • @Joachim. ya Joachim its not very clear in W3C. Can you suggest any other link. Commented Apr 14, 2011 at 14:15
  • you mean apart from the links I gave you in the answer below? Commented Apr 14, 2011 at 14:17

4 Answers 4

9

A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.

UTF-8 is chosen as the default, because it has several advantages:

  • it is compatible with ASCII in that all valid ASCII encoded text is also valid UTF-8 encoded (but not necessarily the other way around!)
  • it uses only 1 byte per character for "common" letters (those that also exist in ASCII)
  • it can represent all existing Unicode characters

Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.

Sign up to request clarification or add additional context in comments.

Comments

4

When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.

However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.

The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.

The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code. However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.

In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.

So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant. The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.

2 Comments

Unicode is also needed for fancier kinds of punctuation and symbols, like ¢£€ and “curly quotes” and such. It need not be substantially harder to work with Unicode if a programming language starts with Unicode as its base character set; then you don’t have to worry about variable-width encodings — or shouldn’t.
Enjoying Unicode support (UTF-8 / UTF-16 / UTF-whatever) does not have anything to do with "don't have to worry about variable-width encodings" (where many people would think about assumed UTF16 characteristics, usually). ((Unicode - whichever encoding protocol - is primarily about actually being able to properly represent full colorful Unicode codepoints space, methinks)) utf8everywhere.org/#conclusions "UTF-16 is the worst of both worlds, being both variable length and too wide." utf8everywhere.org/#faq.why.care
2

XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. So, to avoid errors you set the encoding or save the XML file as Unicode.

XML Encoding Rules

Comments

1

When data is stored or transfered it is only bytes. Those bytes need some interpretation. Users with non English locales used to have some problems with characters that only appeared in their locale. Those characters were displayed in a wrong way frequently.

With XML having an information how to interpret its bytes character can be displayed in a correct way.

2 Comments

Note that English itself also used to have troubles. ASCII and EBCDIC for example use entirely different encodings even for "normal" english characters. Encoding is not just for "the rest of the world" ;-)
@Joachim: Very much agreed. The hyper-conservative and reactionary notion that ASCII was good enough for our grandparents so it should be good enough for us is ridiculously short-sighted — and bogus. But terribly common.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.