4

Which are the valid xml encoding strings? For instance, what is the way of specifying UTF-8:

  • encoding="utf8"
  • encoding="utf8"
  • etc

Or Windows 1251:

  • encoding="windows-1251"
  • encoding="windows1251"
  • encoding="cp-1251"
  • etc.

I am making a character decoder as well as a xml parser. Thus, I need to be able to set the encoding of my StreamReader based on the value from the encoding attribute.

Any ideas where I could find a list of the official encoding string?

The best I could find is this, but it seems to be IE specific.

Thanks!

1
  • I'd be very interested to know why you are writing your own XML parser. Any reason you don't use an existing parser? Commented Oct 19, 2010 at 10:01

4 Answers 4

10

If all fails, read the spec :-).

4.3.3 Character Encoding in Entities

Each external parsed entity in an XML document may use a different encoding for its characters.

[...]

In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 ", and " ISO-10646-UCS-4 " SHOULD be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values " ISO-8859-1 ", " ISO-8859-2 ", ... " ISO-8859- n " (where n is the part number) SHOULD be used for the parts of ISO 8859, and the values " ISO-2022-JP ", " Shift_JIS ", and " EUC-JP " SHOULD be used for the various encoded forms of JIS X-0208-1997.

It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority IANA-CHARSETS, other than those just listed, be referred to using their registered names; other encodings SHOULD use names starting with an "x-" prefix.

Source: http://www.w3.org/TR/REC-xml/

So UTF-8 is written as encoding="UTF-8".

For other character sets not listed above, use the names given in the IANA character set list.

Case of the letters in the character set name is not significant: "However, no distinction is made between use of upper and lower case letters." (IANA character set list). So you could also write encoding="uTf-8" if you feel like it ;-).

BTW: Are you really, really certain you want to write your own XML parser? This sounds suspiciously like reinventing the wheel.

Sign up to request clarification or add additional context in comments.

4 Comments

+1 for 'read the spec', -1 for 'if all fails' (it should be the first port of call when writing a parser, not the last) and +1 again for 'reinventing the wheel' ;)
@David Dorward Thanks :-). To be honest, in generally I would not recommend the spec as first port of call to a beginner, many specs can be rather daunting. But the spec is the place to go if you can't find the answer in a tutorial (or if you want to be certain what is right). Anyway, you probably noted the smiley next to "if all fails".
The smiley is next to read the spec :) Seriously though, the question suggests the goal is to write a general parser, so it needs to cover everything that it might be parsing, and that really really needs the spec as it lays out the requirements in technical terms. I'd be very surprised if anybody wrote documentation that provided enough information to write a parser that was aimed at beginners.
As sleske said, it all goes to the IANA list: iana.org/assignments/character-sets Thanks a lot! I've been stupid not to find this in the spec. Yes, I need my own parser for some embarrassing reasons. Thanks, again!
2
<?xml version="1.0" encoding="utf-8"?> 

should be fine for utf-8.

Comments

0

Use command locale -A to see all the encodings: http://dwbitechguru.blogspot.ca/2014/07/check-foreign-characters-support-on.html

Option A: To add encoding using the below tags:

You can edit the encoding attribute in the the dtd using XML spy.

Related links: http://dwbitechguru.blogspot.ca/2014/07/issue-xml-reader-error.html

1 Comment

Put a few more spaces before your XML to get it to format properly.
0

You can explicitly declare the character encoding in the XML declaration at the beginning of an XML document using the encoding attribute:

<?xml version="1.0" encoding="UTF-8"?>

Which are the valid xml encoding strings?

Valid XML encoding strings are that identify a character encoding scheme recognized by XML processors. The XML 1.0 specification ensure that all XML processors must support UTF-8 and UTF-16. Encoding names must match with the IANA registered characterset.

https://www.iana.org/assignments/character-sets/character-sets.xml

what is the way of specifying UTF-8?

encoding="UTF-8"

encoding names are case-insensitive so "UTF-8","utf8" are valid.But as per IANA registered name use UTF-8

where I could find a list of the official encoding string?

https://www.iana.org/assignments/character-sets/character-sets.xml

for more reference

What is encoding in XML?

Meaning of - <?xml version="1.0" encoding="utf-8"?>

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.