What Is UTF-8 And Why Is It Important?

Unicode is a character set supported across many commonly used software applications and operating systems. For example, many popular web browser, e-mail, and word processing applications support Unicode. Operating systems that support Unicode include Solaris Operating Environment, Linux, Microsoft Windows 2000, and Apple's Mac OS X. Applications that support Unicode are often capable of displaying multiple languages and scripts within the same document. In a multilingual office or business setting, Unicode's importance as a universal character set cannot be overlooked.

Unicode is the only practical character set option for applications that support multilingual documents. However, applications do have several options for how they encode Unicode. An encoding is the mapping of Unicode code points to a stream of storable code units or octets. The most common encodings include the following:

  • UTF-8
  • UTF-16
  • UTF-32
Each encoding has advantages and drawbacks. However, one encoding in particular has gained widespread acceptance. That encoding is UTF-8. This article describes UTF-8, what it is, and why it is important.

Table 1 defines some terms that are used in this document.

Table 1 Common Definitions


Character Set A repertoire of characters that have been collected together for some purpose.
Coded Character Set An ordered character set in which each character has an assigned integer value.
Code Point The integer value of a character within a coded character set.
Character Encoding A mapping of code points to a series of bytes.
Code Unit A single octet or byte of an encoded character.
Charset A set of characters that has been encoded using a character encoding . Often used as a synonym for character encoding.

What is it?

Unicode 3.1 code points exist in the range U+0000 - U+10FFFF . Although each of the code points can be stored and manipulated as 32-bit integers, convincing the world to use a 32-bit wide character encoding won't be immediately successful everywhere. This is especially true for Western European and non-Asian nations in general, which can encode their legacy character sets in as little as one byte per character.

UTF-8 is a multibyte encoding in which each character can be encoded in as little as one byte and as many as four bytes. Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Finally, Japanese, Korean, and Chinese typically require three bytes per character. [1]

The encoding algorithm is straightforward. Table 2 below shows how bits from a Unicode code point are arranged in the encoding for different character ranges.

Table 2 UTF-8 Bit Encoding of a Unicode Code Point


Character Range 1st Byte 2nd Byte
3rd Byte
4th Byte
U+0000 - U+007F 00..7F


U+0080..U+07FF C2..DF 80..BF 80..BF
U+0800..U+0FFF E1..EC 80..BF 80..BF  
U+1000..U+CFFF E1..EC 80..BF 80..BF  
U+D000..U+D7FF ED 80..9F 80..BF  
U+D800..U+DFFF ill-formed
 
 
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF
F4 80..8F 80..BF 80..BF

As the above table shows, characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII charset can be represented unchanged with a single byte of storage space. The next range, U+0080 - U+07FF , contains the remaining characters for most of the world's scripts and includes characters with diacritics. This range requires two bytes of encoded storage. The notable scripts in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese. These scripts require three bytes of storage for each character. Finally, the non-BMP range contains characters that can be represented as surrogate pairs in UTF-16. Most of the new characters in this range are Chinese ideographs. The newly defined characters in this range require four bytes in the UTF-8 encoding.

Algorithms for producing a UTF-8 encoded character can be very simple. The following Java code shows how you can easily create your own UTF-8 encoder [2] :

/**
* Converts an array of Unicode scalar values (code points) into
* UTF-8. This algorithm works under the assumption that all
* surrogate pairs have already been converted into scalar code
* point values within the argument.
*
* @param ch an array of Unicode scalar values (code points)
* @returns a byte[] containing the UTF-8 encoded characters
*/
public static byte[] encode(int[] ch) {
// determine how many bytes are needed for the complete conversion
int bytesNeeded = 0;
for (int i=0; i if (ch[i] < 0x80) {
++bytesNeeded;
}
else if (ch[i] < 0x0800) {
bytesNeeded += 2;
}
else if (ch[i] < 0x10000) {
bytesNeeded += 3;
}
else {
bytesNeeded += 4;
}
}
// allocate a byte[] of the necessary size
byte[] utf8 = new byte[bytesNeeded];
// do the conversion from character code points to utf-8
for(int i=0, bytes = 0; i if(ch[i] < 0x80) {
utf8[bytes++] = (byte)ch[i];
}
else if (ch[i] < 0x0800) {
utf8[bytes++] = (byte)(ch[i]>> 6 | 0xC0);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else if (ch[i] < 0x10000) {
utf8[bytes++] = (byte)(ch[i]>> 12 | 0xE0);
utf8[bytes++] = (byte)(ch[i]>> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else {
utf8[bytes++] = (byte)(ch[i]>> 18 | 0xF0);
utf8[bytes++] = (byte)(ch[i]>> 12 & 0x3F |
0x80);
utf8[bytes++] = (byte)(ch[i]>> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
}
return utf8;
}

Why is it Important?

UTF-8 is an important encoding because of the following reasons:
  • ASCII compatible
  • easily supported
  • compact and efficient for most scripts
  • easily processed, unlike other multibyte encodings
At the recent Unicode Conference in Hong Kong, one company said that their move to Unicode was simplified by the adoption of UTF-8. Instead of changing their products' code to support 16-bit or 32-bit wide Unicode characters, they chose UTF-8 instead. What was their reason? They said that their system had lots of hard-coded comparisons to find specific ASCII characters in text. Instead of modifying their code everywhere, they simply changed their character encoding to UTF-8, which is compatible with ASCII. In other words, single byte ASCII characters retain their encoded value in UTF-8. For example, code that checks for a '' can continue checking for the byte value 0x5C instead of changing the code to check for 0x005C . Modifying hundreds of lines of text processing code scattered throughout thousands of lines of miscellaneous code can be time consuming and error prone. Sometimes selecting the UTF-8 encoding can provide the easiest and most cost-effective way to get a basic level of Unicode support in a legacy application.

Most applications have basic text-handling algorithms. Many of those algorithms make flawed assumptions about a character's storage requirements. For example, many programmer's assume that a character requires only a single byte of storage. Another common assumption, especially for C programmers, is that a text string never contains the value 0x00 . If this value does appear, it typically marks the end of the text string. Encodings like UTF-16 and UTF-32 store characters as 16- or 32-bit values. When a string of 16- or 32-bit values are processed as a series of byte values, the value 0x00 often appears, especially in Latin-based scripts. This complicates and confuses existing text processing algorithms, leading to miscalculated string lengths, oddly concatenated strings, and search failures. On the other hand, because UTF-8's basic code unit is a byte, legacy algorithms can typically run with only minor adjustments, if any.

One complaint often aimed at Unicode is that it requires so much more space than legacy encodings for Latin-based scripts. In other words, UTF-16 or UTF-32 require 16 or 32 bits of storage for most characters instead of a single byte required by the series of ISO-8859 encodings. However, UTF-8 stores the ASCII subset of all these charsets in as little as one byte. The ASCII subset is definitely the most used set of characters for Western European and American languages. As mentioned earlier, most Western European languages can be written with 1.1 bytes per character on average. This is almost as efficient as ASCII, but it allows for up to four bytes per character for rare characters and obscure scripts when necessary.

Although many new development projects standardize quickly on Unicode, older projects often used legacy character sets that supported a small set of related languages. Experienced internationalization and localization engineers remember updating text processing algorithms to handle both "single-byte" and "multibyte" character sets. Do you remember updating your code to check "lead" bytes and possibly "trail" bytes during processing? Remember how difficult it was to find the beginning of a character if your index into the text was an arbitrary location? The problem was that trail bytes could also be lead bytes in some encodings. The Shift-JIS encoding, for example, was difficult to process backwards for this reason.

When Unicode became available as a fixed-width 16-bit encoding, many were excited to throw out multibyte encodings. Understandably, you may be hesitant to adopt a multibyte Unicode encoding after all the troubles you may have had with multibyte Asian character sets. However, UTF-8 is different, and it doesn't have all of the same problems as those legacy encodings. For example, it is much easier to find the start of a character from any arbitrary point in a text string. So called "trail" bytes of a UTF-8 character sequence always have the bit pattern 10xxxxxx , so it is easy to find one's way back to the beginning of a character. A character pointer is at most three bytes away from the character's beginning. Even with most Asian ideographs, character boundaries are at most just a couple of bytes away. Figure 1 shows several characters and their encoding in UTF-8. Notice the hexadecimal byte sequence E5 , AD ,97 . If asked to find the character's beginning from the location marked 1 , we could proceed as follows to find the character boundary at location 2 in the figure:

  1. Does the current byte start with the bit pattern 10xxxxxx?
  2. If yes, move left and go to step #1.
  3. Finished.
Figure 1 Finding Character Boundaries is Relatively Simple

Unlike some legacy character encodings, UTF-8 is fairly easy to parse and manipulate. The bit patterns of the encoding allow you to quickly determine whether your character index points to a character's beginning or somewhere else. Moving backward or forward within a string is easy.

Summary

UTF-8 is a compact, efficient Unicode encoding. The encoding distributes a Unicode code value's bit pattern across one, two, three, or even four bytes. This encoding is a multibyte encoding.

UTF-8 encodes ASCII in only one byte. That means that languages that use Latin-based scripts can be represented with only 1.1 bytes per character on average. Other languages may require more bytes per character. Only the Asian scripts have significant encoding overhead in UTF-8 compared to UTF-16.

UTF-8 is useful for legacy systems that want Unicode support because developers don't have to drastically modify text processing code. Code that assumes single-byte code units typically don't fail completely when provided UTF-8 text instead of ASCII or even Latin-1.

Finally, unlike some legacy encodings, UTF-8 is easy to parse. So-called lead and trail bytes are easily distinguished. Moving forwards or backwards in a text string is easier in UTF-8 than many other multibyte encodings.


[1] Forms of Unicode, Mark Davis, September 1999, http://www.ibm.com/developerworks/unicode/library/utfencodingforms/index.html .
[2] This code has not been optimized for size or speed.

© 2001 John O'Conner. John O'Conner is a staff engineer specializing in Java internationalization.

你可能感兴趣的:(算法)