Monthly Archive for September, 2005

Why UTF-8 is the best encoding for XML

Recently we had a short discussion on what character encoding should be used in XML: UTF-16 or UTF-8?

One thing that has to be mentioned first because it is mixed up often: UTF-8, as well as UTF-16 are Unicode. The difference lies in the coding of the characters. In UTF-8 Umlauts for example get coded (in the way #x1234), whereas in UTF-16 they are directly readable - e.g. <node>äöü</node>.

XML Parser have to implement support for UTF-8 as well as for UTF-16. That has nothing to do your Source Code is compiled in Unicode or ANSI. The correct conversion is the responsibility of the Parser.

Excerpt out of the W3C XML Standard:

Each external parsed entity in an XML document MAY use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. The terms “UTF-8″ and “UTF-16″ in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16.
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

I recommend to write XMLs strictly in UTF-8. The Advantages are described in following article: Encode your XML documents in UTF-8.

The advantages of UTF-8 in short:

  • It offers broad tool support, including the best compatibility with legacy ASCII systems.
  • It’s straightforward and efficient to process.
  • It’s resistant to corruption.
  • It’s platform neutral.”

And one thing always has to be sure: Never use codings other than UTF-8 or UTF-16, because their support is not mandatory for XML Parsers.