Unicode is a character set supported across many commonly used software applications and operating systems. For example, many popular web browser, e-mail, and word processing applications support Unicode. Operating systems that support Unicode include Solaris Operating Environment, Linux, Microsoft Windows 2000, and Apple's Mac OS X. Applications that support Unicode are often capable of displaying multiple languages and scripts within the same document. In a multilingual office or business setting, Unicode's importance as a universal character set cannot be overlooked.
Unicode is the only practical character set option for applications that support multilingual documents. However, applications do have several options for how they encode Unicode. An encoding is the mapping of Unicode code points to a stream of storable code units or octets. The most common encodings include the following:
Table 1 defines some terms that are used in this document.
Table 1 Common Definitions
Character Set | A repertoire of characters that have been collected together for some purpose. |
Coded Character Set | An ordered character set in which each character has an assigned integer value. |
Code Point | The integer value of a character within a coded character set. |
Character Encoding | A mapping of code points to a series of bytes. |
Code Unit | A single octet or byte of an encoded character. |
Charset | A set of characters that has been encoded using a character encoding . Often used as a synonym for character encoding. |
UTF-8 is a multibyte encoding in which each character can be encoded in as little as one byte and as many as four bytes. Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Finally, Japanese, Korean, and Chinese typically require three bytes per character. [1]
The encoding algorithm is straightforward. Table 2 below shows how bits from a Unicode code point are arranged in the encoding for different character ranges.
Table 2 UTF-8 Bit Encoding of a Unicode Code Point
Character Range | 1st Byte | 2nd Byte |
3rd Byte |
4th Byte |
U+0000 - U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | 80..BF | |
U+0800..U+0FFF | E1..EC | 80..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+D800..U+DFFF | ill-formed | |
|
|
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF |
F4 | 80..8F | 80..BF | 80..BF |
As the above table shows, characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII charset can be represented unchanged with a single byte of storage space. The next range, U+0080 - U+07FF , contains the remaining characters for most of the world's scripts and includes characters with diacritics. This range requires two bytes of encoded storage. The notable scripts in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese. These scripts require three bytes of storage for each character. Finally, the non-BMP range contains characters that can be represented as surrogate pairs in UTF-16. Most of the new characters in this range are Chinese ideographs. The newly defined characters in this range require four bytes in the UTF-8 encoding.
Algorithms for producing a UTF-8 encoded character can be very simple. The following Java code shows how you can easily create your own UTF-8 encoder [2] :
/**
* Converts an array of Unicode scalar values (code points) into
* UTF-8. This algorithm works under the assumption that all
* surrogate pairs have already been converted into scalar code
* point values within the argument.
*
* @param ch an array of Unicode scalar values (code points)
* @returns a byte[] containing the UTF-8 encoded characters
*/
public static byte[] encode(int[] ch) {
// determine how many bytes are needed for the complete conversion
int bytesNeeded = 0;
for (int i=0; iif (ch[i] < 0x80) {
++bytesNeeded;
}
else if (ch[i] < 0x0800) {
bytesNeeded += 2;
}
else if (ch[i] < 0x10000) {
bytesNeeded += 3;
}
else {
bytesNeeded += 4;
}
}
// allocate a byte[] of the necessary size
byte[] utf8 = new byte[bytesNeeded];
// do the conversion from character code points to utf-8
for(int i=0, bytes = 0; iif(ch[i] < 0x80) {
utf8[bytes++] = (byte)ch[i];
}
else if (ch[i] < 0x0800) {
utf8[bytes++] = (byte)(ch[i]>> 6 | 0xC0);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else if (ch[i] < 0x10000) {
utf8[bytes++] = (byte)(ch[i]>> 12 | 0xE0);
utf8[bytes++] = (byte)(ch[i]>> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else {
utf8[bytes++] = (byte)(ch[i]>> 18 | 0xF0);
utf8[bytes++] = (byte)(ch[i]>> 12 & 0x3F |
0x80);
utf8[bytes++] = (byte)(ch[i]>> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
}
return utf8;
}
Most applications have basic text-handling algorithms. Many of those algorithms make flawed assumptions about a character's storage requirements. For example, many programmer's assume that a character requires only a single byte of storage. Another common assumption, especially for C programmers, is that a text string never contains the value 0x00 . If this value does appear, it typically marks the end of the text string. Encodings like UTF-16 and UTF-32 store characters as 16- or 32-bit values. When a string of 16- or 32-bit values are processed as a series of byte values, the value 0x00 often appears, especially in Latin-based scripts. This complicates and confuses existing text processing algorithms, leading to miscalculated string lengths, oddly concatenated strings, and search failures. On the other hand, because UTF-8's basic code unit is a byte, legacy algorithms can typically run with only minor adjustments, if any.
One complaint often aimed at Unicode is that it requires so much more space than legacy encodings for Latin-based scripts. In other words, UTF-16 or UTF-32 require 16 or 32 bits of storage for most characters instead of a single byte required by the series of ISO-8859 encodings. However, UTF-8 stores the ASCII subset of all these charsets in as little as one byte. The ASCII subset is definitely the most used set of characters for Western European and American languages. As mentioned earlier, most Western European languages can be written with 1.1 bytes per character on average. This is almost as efficient as ASCII, but it allows for up to four bytes per character for rare characters and obscure scripts when necessary.
Although many new development projects standardize quickly on Unicode, older projects often used legacy character sets that supported a small set of related languages. Experienced internationalization and localization engineers remember updating text processing algorithms to handle both "single-byte" and "multibyte" character sets. Do you remember updating your code to check "lead" bytes and possibly "trail" bytes during processing? Remember how difficult it was to find the beginning of a character if your index into the text was an arbitrary location? The problem was that trail bytes could also be lead bytes in some encodings. The Shift-JIS encoding, for example, was difficult to process backwards for this reason.
When Unicode became available as a fixed-width 16-bit encoding, many were excited to throw out multibyte encodings. Understandably, you may be hesitant to adopt a multibyte Unicode encoding after all the troubles you may have had with multibyte Asian character sets. However, UTF-8 is different, and it doesn't have all of the same problems as those legacy encodings. For example, it is much easier to find the start of a character from any arbitrary point in a text string. So called "trail" bytes of a UTF-8 character sequence always have the bit pattern 10xxxxxx , so it is easy to find one's way back to the beginning of a character. A character pointer is at most three bytes away from the character's beginning. Even with most Asian ideographs, character boundaries are at most just a couple of bytes away. Figure 1 shows several characters and their encoding in UTF-8. Notice the hexadecimal byte sequence E5 , AD ,97 . If asked to find the character's beginning from the location marked 1 , we could proceed as follows to find the character boundary at location 2 in the figure:
Unlike some legacy character encodings, UTF-8 is fairly easy to parse and manipulate. The bit patterns of the encoding allow you to quickly determine whether your character index points to a character's beginning or somewhere else. Moving backward or forward within a string is easy.
UTF-8 encodes ASCII in only one byte. That means that languages that use Latin-based scripts can be represented with only 1.1 bytes per character on average. Other languages may require more bytes per character. Only the Asian scripts have significant encoding overhead in UTF-8 compared to UTF-16.
UTF-8 is useful for legacy systems that want Unicode support because developers don't have to drastically modify text processing code. Code that assumes single-byte code units typically don't fail completely when provided UTF-8 text instead of ASCII or even Latin-1.
Finally, unlike some legacy encodings, UTF-8 is easy to parse. So-called lead and trail bytes are easily distinguished. Moving forwards or backwards in a text string is easier in UTF-8 than many other multibyte encodings.
© 2001 John O'Conner. John O'Conner is a staff engineer specializing in Java internationalization.