What is encoding in XML? The normal encoding used is utf-8. How is it different from other encoding? What is the purpose of using it?
- Refer W3C recommendation on encoding.Nishant– Nishant2011-04-14 09:58:19 +00:00Commented Apr 14, 2011 at 9:58
- @Nishant: that's not really a good introduction into the topic of character encodings in general. And I think that's what the question is really about.Joachim Sauer– Joachim Sauer2011-04-14 09:59:30 +00:00Commented Apr 14, 2011 at 9:59
- I just added XML specs as OP pointed, 'normal encoding used is UTF-8'. It isn't supposed to be an answer.Nishant– Nishant2011-04-14 10:02:12 +00:00Commented Apr 14, 2011 at 10:02
- @Joachim. ya Joachim its not very clear in W3C. Can you suggest any other link.trilawney– trilawney2011-04-14 14:15:22 +00:00Commented Apr 14, 2011 at 14:15
- you mean apart from the links I gave you in the answer below?Joachim Sauer– Joachim Sauer2011-04-14 14:17:53 +00:00Commented Apr 14, 2011 at 14:17
4 Answers
A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.
UTF-8 is chosen as the default, because it has several advantages:
- it is compatible with ASCII in that all valid ASCII encoded text is also valid UTF-8 encoded (but not necessarily the other way around!)
- it uses only 1 byte per character for "common" letters (those that also exist in ASCII)
- it can represent all existing Unicode characters
Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.
Comments
When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.
However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.
The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.
The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code. However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.
In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.
So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant. The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.
2 Comments
¢£€ and “curly quotes” and such. It need not be substantially harder to work with Unicode if a programming language starts with Unicode as its base character set; then you don’t have to worry about variable-width encodings — or shouldn’t.XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. So, to avoid errors you set the encoding or save the XML file as Unicode.
Comments
When data is stored or transfered it is only bytes. Those bytes need some interpretation. Users with non English locales used to have some problems with characters that only appeared in their locale. Those characters were displayed in a wrong way frequently.
With XML having an information how to interpret its bytes character can be displayed in a correct way.