UnicodeStandard-12.0
⓿❶❷❸❹❹❺❻❼❽❾
Chapter 2 General Structure
-
2-5 Encoding Forms
Computers handle numbers not simply as abstract mathematical objects, but as combina-tions of fixed-size units like bytes and 32-bit words. A character encoding model must takethis fact into account when determining how to associate numbers with the characters.
Actual implementations in computer systems represent integers in specific code units ofparticular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encodingmodel, precisely defined encoding forms specify how each integer (code point) for a Uni-code character is to be expressed as a sequence of one or more code units. The UnicodeStandard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. The“UTF” is a carryover from earlier terminology meaning Unicode (or UCS) TransformationFormat. Each of these three encoding forms is an equally legitimate mechanism for repre-senting Unicode characters; each has advantages in different environments.
All three encoding forms can be used to represent the full range of encoded characters inthe Unicode Standard; they are thus fully interoperable for implementations that maychoose different encoding forms for various reasons. Each of the three Unicode encodingforms can be efficiently transformed into either of the other two without any loss of data.
Non-overlap. Each of the Unicode encoding forms is designed with the principle of non-overlap in mind. Figure 2-9 presents an example of an encoding where overlap is permit-ted. In this encoding (Windows code page 932), characters are formed from either one ortwo code bytes. Whether a sequence is one or two bytes in length depends on the first byte,so that the values for lead bytes (of a two-byte sequence) and single bytes are disjoint. How-ever, single-byte values and trail-byte values can overlap. That means that when someonesearches for the character “D”, for example, he or she might find it either (mistakenly) asthe trail byte of a two-byte sequence or as a single, independent byte. To find out whichalternative is correct, a program must look backward through text.
The situation is made more complex by the fact that lead and trail bytes can also overlap, asshown in the second part of Figure 2-9. This means that the backward scan has to repeatuntil it hits the start of the text or hits a sequence that could not exist as a pair as shown in Figure 2-10. This is not only inefficient, but also extremely error-prone: corruption of onebyte can cause entire lines of text to be corrupted.
The Unicode encoding forms avoid this problem, because none of the ranges of values forthe lead, trail, or single code units in any of those encoding forms overlap.Non-overlap makes all of the Unicode encoding forms well behaved for searching andcomparison. When searching for a particular character, there will never be a mismatchagainst some code unit sequence that represents just part of another character. The factthat all Unicode encoding forms observe this principle of non-overlap distinguishes themfrom many legacy East Asian multibyte character encodings, for which overlap of code unitsequences may be a significant problem for implementations.
Another aspect of non-overlap in the Unicode encoding forms is that all Unicode charac-ters have determinate boundaries when expressed in any of the encoding forms. That is,the edges of code unit sequences representing a character are easily determined by localexamination of code units; there is never any need to scan back indefinitely in Unicode textto correctly determine a character boundary. This property of the encoding forms hassometimes been referred to as self-synchronization. This property has another very import-ant implication: corruption of a single code unit corrupts only a single character; none ofthe surrounding characters are affected.
For example, when randomly accessing a string, a program can find the boundary of acharacter with limited backup. In UTF-16, if a pointer points to a leading surrogate, a sin-gle backup is required. In UTF-8, if a pointer points to a byte starting with 10xxxxxx (inbinary), one to three backups are required to find the beginning of the character.
Conformance. The Unicode Consortium fully endorses the use of any of the three Unicodeencoding forms as a conformant way of implementing the Unicode Standard. It is import-ant not to fall into the trap of trying to distinguish “UTF-8 versus Unicode,” for example.UTF-8, UTF-16, and UTF-32 are all equally valid and conformant ways of implementingthe encoded characters of the Unicode Standard.
Examples. Figure 2-11 shows the three Unicode encoding forms, including how they arerelated to Unicode code points.
In Figure 2-11, the UTF-32 line shows that each example character can be expressed withone 32-bit code unit. Those code units have the same values as the code point for the char-acter. For UTF-16, most characters can be expressed with one 16-bit code unit, whose General Structure352.5Encoding Forms value is the same as the code point for the character, but characters with high code pointvalues require a pair of 16-bit surrogate code units instead. In UTF-8, a character may beexpressed with one, two, three, or four bytes, and the relationship between those byte val-ues and the code point value is more complex.
UTF-8, UTF-16, and UTF-32 are further described in the subsections that follow. See eachsubsection for a general overview of how each encoding form is structured and the generalbenefits or drawbacks of each encoding form for particular purposes. For the detailed for-mal definition of the encoding forms and conformance requirements, see Section 3.9, Uni-code Encoding Forms.
UTF-32
UTF-32 is the simplest Unicode encoding form. Each Unicode code point is representeddirectly by a single 32-bit code unit. Because of this, UTF-32 has a one-to-one relationshipbetween encoded character and code unit; it is a fixed-width character encoding form. Thismakes UTF-32 an ideal form for APIs that pass single character values.
As for all of the Unicode encoding forms, UTF-32 is restricted to representation of codepoints in the range 0..10FFFF16—that is, the Unicode codespace. This guarantees interop-erability with the UTF-16 and UTF-8 encoding forms.
Fixed Width. The value of each UTF-32 code unit corresponds exactly to the Unicode codepoint value. This situation differs significantly from that for UTF-16 and especially UTF-8,where the code unit values often change unrecognizably from the code point value. Forexample, U+10000 is represented as <00010000> in UTF-32 and as
Preferred Usage. UTF-32 may be a preferred encoding form where memory or disk stor-age space for characters is not a particular concern, but where fixed-width, single code unitaccess to characters is desired. UTF-32 is also a preferred encoding form for processingcharacters on most Unix platforms.
UTF-16
In the UTF-16 encoding form, code points in the range U+0000..U+FFFF are representedas a single 16-bit code unit; code points in the supplementary planes, in the rangeU+10000..U+10FFFF, are represented as pairs of 16-bit code units. These pairs of specialcode units are known as surrogate pairs. The values of the code units used for surrogatepairs are completely disjunct from the code units used for the single code unit representa-tions, thus maintaining non-overlap for all code point representations in UTF-16. For theformal definition of surrogates, see Section 3.8, Surrogates.
Optimized for BMP. UTF-16 optimizes the representation of characters in the Basic Mul-tilingual Plane (BMP)—that is, the range U+0000..U+FFFF. For that range, which containsthe vast majority of common-use characters for all modern scripts of the world, each char-acter requires only one 16-bit code unit, thus requiring just half the memory or storage ofthe UTF-32 encoding form. For the BMP, UTF-16 can effectively be treated as if it were afixed-width encoding form.
Supplementary Characters and Surrogates. For supplementary characters, UTF-16requires two 16-bit code units. The distinction between characters represented with oneversus two 16-bit code units means that formally UTF-16 is a variable-width encodingform. That fact can create implementation difficulties if it is not carefully taken intoaccount; UTF-16 is somewhat more complicated to handle than UTF-32.
Preferred Usage. UTF-16 may be a preferred encoding form in many environments thatneed to balance efficient access to characters with economical use of storage. It is reason-ably compact, and all the common, heavily used characters fit into a single 16-bit code unit.
Origin. UTF-16 is the historical descendant of the earliest form of Unicode, which wasoriginally designed to use a fixed-width, 16-bit encoding form exclusively. The surrogateswere added to provide an encoding form for the supplementary characters at code pointspast U+FFFF. The design of the surrogates made them a simple and efficient extensionmechanism that works well with older Unicode implementations and that avoids many ofthe problems of other variable-width character encodings. See Section 5.4, Handling Surro-gate Pairs in UTF-16, for more information about surrogates and their processing.
Collation. For the purpose of sorting text, binary order for data represented in the UTF-16encoding form is not the same as code point order. This means that a slightly differentcomparison implementation is needed for code point order. For more information, seeSection 5.17, Binary Order.
UTF-8
To meet the requirements of byte-oriented, ASCII-based systems, a third encoding form isspecified by the Unicode Standard: UTF-8. This variable-width encoding form preservesASCII transparency by making use of 8-bit code units.
Byte-Oriented. Much existing software and practice in information technology have longdepended on character data being represented as a sequence of bytes. Furthermore, many of the protocols depend not only on ASCII values being invariant, but must make use of oravoid special byte values that may have associated control functions. The easiest way toadapt Unicode implementations to such a situation is to make use of an encoding form thatis already defined in terms of 8-bit code units and that represents all Unicode characterswhile not disturbing or reusing any ASCII or C0 control code value. That is the function of UTF-8.
Va r i a b l e W i d t h . UTF-8 is a variable-width encoding form, using 8-bit code units, in whichthe high bits of each code unit indicate the part of the code unit sequence to which eachby te belongs. A range of 8-bit code unit values is reser ved for the first, or leading, element ofa UTF-8 code unit sequences, and a completely disjunct range of 8-bit code unit values isreserved for the subsequent, or trailing, elements of such sequences; this convention pre-serves non-overlap for UTF-8. Ta b l e 3 - 6 on page 126 shows how the bits in a Unicode codepoint are distributed among the bytes in the UTF-8 encoding form. See Section 3.9, UnicodeEncoding Forms, for the full, formal definition of UTF-8.
ASCII Transparency. The UTF-8 encoding form maintains transparency for all of theASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F areconverted to single bytes 0x00..0x7F in UTF-8 and are thus indistinguishable from ASCIIitself. Furthermore, the values 0x00..0x7F do not appear in any byte for the representationof any other Unicode code point, so that there can be no ambiguity. Beyond the ASCIIrange of Unicode, many of the non-ideographic scripts are represented by two bytes percode point in UTF-8; all non-surrogate code points between U+0800 and U+FFFF are rep-resented by three bytes; and supplementary code points above U+FFFF require four bytes.
Preferred Usage. UTF-8 is typically the preferred encoding form for HTML and similarprotocols, particularly for the Internet. The ASCII transparency helps migration. UTF-8also has the advantage that it is already inherently byte-serialized, as for most existing 8-bitcharacter sets; strings of UTF-8 work easily with C or other programming languages, andmany existing APIs that work for typical Asian multibyte character sets adapt to UTF-8 aswell with little or no change required.
Self-synchronizing. In environments where 8-bit character processing is required for onereason or another, UTF-8 has the following attractive features as compared to other multi-byte encodings:
• The first byte of a UTF-8 code unit sequence indicates the number of bytes tofollow in a multibyte sequence. This allows for very efficient forward parsing.
• It is efficient to find the start of a character when beginning from an arbitrarylocation in a byte stream of UTF-8. Programs need to search at most four bytesbackward, and usually much less. It is a simple task to recognize an initial byte,because initial bytes are constrained to a fixed range of values.
• As with the other encoding forms, there is no overlap of byte values.
Comparison of the Advantages of UTF-32, UTF-16, and UTF-8
On the face of it, UTF-32 would seem to be the obvious choice of Unicode encoding formsfor an internal processing code because it is a fixed-width encoding form. It can be confor-mantly bound to the C and C++ wchar_t, which means that such programming languagesmay offer built-in support and ready-made string APIs that programmers can take advan-tage of. However, UTF-16 has many countervailing advantages that may lead implement-ers to choose it instead as an internal processing code.
While all three encoding forms need at most 4 bytes (or 32 bits) of data for each character,in practice UTF-32 in almost all cases for real data sets occupies twice the storage thatUTF-16 requires. Therefore, a common strategy is to have internal string storage use UTF-16 or UTF-8 but to use UTF-32 when manipulating individual characters.
UTF-32 Versus UTF-16. On average, more than 99% of all UTF-16 data is expressed usingsingle code units. This includes nearly all of the typical characters that software needs tohandle with special operations on text—for example, format control characters. As a conse-quence, most text scanning operations do not need to unpack UTF-16 surrogate pairs atall, but rather can safely treat them as an opaque part of a character string.
For many operations, UTF-16 is as easy to handle as UTF-32, and the performance ofUTF-16 as a processing code tends to be quite good. UTF-16 is the internal processingcode of choice for a majority of implementations supporting Unicode. Other than for Unixplatforms, UTF-16 provides the right mix of compact size with the ability to handle theoccasional character outside the BMP.
UTF-32 has somewhat of an advantage when it comes to simplicity of software codingdesign and maintenance. Because the character handling is fixed width, UTF-32 process-ing does not require maintaining branches in the software to test and process the doublecode unit elements required for supplementary characters by UTF-16. Conversely, 32-bitindices into large tables are not particularly memory efficient. To avoid the large memorypenalties of such indices, Unicode tables are often handled as multistage tables (see “Mul-tistage Tables” in Section 5.1, Data Structures for Character Conversion). In such cases, the32-bit code point values are sliced into smaller ranges to permit segmented access to thetables. This is true even in typical UTF-32 implementations.
The performance of UTF-32 as a processing code may actually be worse than the perfor-mance of UTF-16 for the same data, because the additional memory overhead means thatcache limits will be exceeded more often and memory paging will occur more frequently.For systems with processor designs that impose penalties for 16-bit aligned access but havevery large memories, this effect may be less noticeable.
Characters Versus Code Points. In any event, Unicode code points do not necessarilymatch user expectations for “characters.” For example, the following are not represented bya single code point: a combining character sequence such as
UTF-8. UTF-8 is reasonably compact in terms of the number of bytes used. It is really onlyat a significant size disadvantage when used for East Asian implementations such as Chi-nese, Japanese, and Korean, which use Han ideographs or Hangul syllables requiringthree-byte code unit sequences in UTF-8. UTF-8 is also significantly less efficient in termsof processing than the other encoding forms.
Binary Sorting. A binary sort of UTF-8 strings gives the same ordering as a binary sort ofUnicode code points. This is obviously the same order as for a binary sort of UTF-32strings.
All three encoding forms give the same results for binary string comparisons or string sort-ing when dealing only with BMP characters (in the range U+0000..U+FFFF). However,when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16binary order does not match Unicode code point order. This can lead to complicationswhen trying to interoperate with binary sorted lists—for example, between UTF-16 sys-tems and UTF-8 or UTF-32 systems. However, for data that is sorted according to the con-ventions of a specific language or locale rather than using binary order, data will beordered the same, regardless of the encoding form.
-
2-6 Encoding Schemes
The discussion of Unicode encoding forms in the previous section was concerned with themachine representation of Unicode code units. Each code unit is represented in a com-puter simply as a numeric data type; just as for other numeric types, the exact way the bitsare laid out internally is irrelevant to most processing. However, interchange of textualdata, particularly between computers of different architectural types, requires consider-ation of the exact ordering of the bits and bytes involved in numeric representation. Inte-gral data, including character data, is serialized for open interchange into well-definedsequences of bytes. This process of byte serialization allows all applications to correctlyinterpret exchanged data and to accurately reconstruct numeric values (and thereby char-acter values) from it. In the Unicode Standard, the specifications of the distinct types ofbyte serializations to be used with Unicode data are known as Unicode encoding schemes.
Byte Order. Modern computer architectures differ in ordering in terms of whether themost significant byte or the least significant byte of a large numeric data type comes first ininternal representation. These sequences are known as “big-endian” and “little-endian”orders, respectively. For the Unicode 16- and 32-bit encoding forms (UTF-16 and UTF-32), the specification of a byte serialization must take into account the big-endian or little-endian architecture of the system on which the data is represented, so that when the data isbyte serialized for interchange it will be well defined.
A character encoding scheme consists of a specified character encoding form plus a specifi-cation of how the code units are serialized into bytes. The Unicode Standard also specifiesthe use of an initial byte order mark (BOM) to explicitly differentiate big-endian or little-endian data in some of the Unicode encoding schemes. (See the “Byte Order Mark” subsec-tion in Section 23.8, Specials.)
When a higher-level protocol supplies mechanisms for handling the endianness of integraldata types, it is not necessary to use Unicode encoding schemes or the byte order mark. Inthose cases Unicode text is simply a sequence of integral data types.
For UTF-8, the encoding scheme consists merely of the UTF-8 code units (= bytes) insequence. Hence, there is no issue of big- versus little-endian byte order for data repre-sented in UTF-8. However, for 16-bit and 32-bit encoding forms, byte serialization mustbreak up the code units into two or four bytes, respectively, and the order of those bytesmust be clearly defined. Because of this, and because of the rules for the use of the byteorder mark, the three encoding forms of the Unicode Standard result in a total of sevenUnicode encoding schemes, as shown in Ta b l e 2 - 4.
The endian order entry for UTF-8 in Ta b l e 2 - 4 is marked N/A because UTF-8 code unitsare 8 bits in size, and the usual machine issues of endian order for larger code units do notapply. The serialized order of the bytes must not depart from the order defined by theUTF-8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, butmay be encountered in contexts where UTF-8 data is converted from other encoding formsthat use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte OrderMark” subsection in Section 23.8, Specials, for more information.
Encoding Scheme Versus Encoding Form. Note that some of the Unicode encodingschemes have the same labels as the three Unicode encoding forms. This could cause con-fusion, so it is important to keep the context clear when using these terms: characterencoding forms refer to integral data units in memory or in APIs, and byte order is irrele-vant; character encoding schemes refer to byte-serialized data, as for streaming I/O or infile storage, and byte order must be specified or determinable.
The Internet Assigned Numbers Authority (IANA) maintains a registry of charset namesused on the Internet. Those charset names are very close in meaning to the Unicode char-acter encoding model’s concept of character encoding schemes, and all of the Unicodecharacter encoding schemes are, in fact, registered as charsets. While the two concepts arequite close and the names used are identical, some important differences may arise interms of the requirements for each, particularly when it comes to handling of the byteorder mark. Exercise due caution when equating the two.
Examples. Figure 2-12 illustrates the Unicode character encoding schemes, showing howeach is derived from one of the encoding forms by serialization of bytes.
In Figure 2-12, the code units used to express each example character have been serializedinto sequences of bytes. This figure should be compared with Figure 2-11, which shows thesame characters before serialization into sequences of bytes. The “BE” lines show serializa-tion in big-endian order, whereas the “LE” lines show the bytes reversed into little-endianorder. For UTF-8, the code unit is just an 8-bit byte, so that there is no distinction betweenbig-endian and little-endian order. UTF-32 and UTF-16 encoding schemes using the byteorder mark are not shown in Figure 2-12, to keep the basic picture regarding serializationof bytes clearer.
For the detailed formal definition of the Unicode encoding schemes and conformancerequirements, see Section 3.10, Unicode Encoding Schemes. For further general discussionabout character encoding forms and character encoding schemes, both for the UnicodeStandard and as applied to other character encoding standards, see Unicode TechnicalReport #17, “Unicode Character Encoding Model.” For information about charsets andcharacter conversion, see Unicode Technical Standard #22, “Character Mapping MarkupLanguage (CharMapML).”
-
2-7 Unicode Strings
A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an orderedsequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bitcode units.
Depending on the programming environment, a Unicode string may or may not berequired to be in the corresponding Unicode encoding form. For example, strings in Java,C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings tocontain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates.Because strings are such a fundamental component of every program, checking for isolatedsurrogates in every operation that modifies strings can create significant overhead, espe-cially because supplementary characters are extremely rare as a percentage of overall text inprograms worldwide.
It is straightforward to design basic string manipulation libraries that handle isolated sur-rogates in a consistent and straightforward manner. They cannot ever be interpreted asabstract characters, but they can be internally handled the same way as noncharacterswhere they occur. Typically they occur only ephemerally, such as in dealing with keyboardevents. While an ideal protocol would allow keyboard events to contain complete strings,many allow only a single UTF-16 code unit per event. As a sequence of events is transmit-ted to the application, a string that is being built up by the application in response to thoseevents may contain isolated surrogates at any particular point in time.
Whenever such strings are specified to be in a particular Unicode encoding form—evenone with the same code unit size—the string must not violate the requirements of thatencoding form. For example, isolated surrogates in a Unicode 16-bit string are not allowedwhen that string is specified to be well-formed UTF-16. A number of techniques are avail-able for dealing with an isolated surrogate, such as omitting it, converting it into U+FFFDreplacement character to produce well-formed UTF-16, or simply halting the process-ing of the string with an error. (See Section 3.9, Unicode Encoding Forms.)
-
2-8 Unicode Allocation
For convenience, the encoded characters of the Unicode Standard are grouped by linguis-tic and functional categories, such as script or writing system. For practical reasons, thereare occasional departures from this general principle, as when punctuation associated withthe ASCII standard is kept together with other ASCII characters in the rangeU+0020..U+007E rather than being grouped with other sets of general punctuation charac-ters. By and large, however, the code charts are arranged so that related characters can befound near each other in the charts.
Grouping encoded characters by script or other functional categories offers the additionalbenefit of supporting various space-saving techniques in actual implementations, as forbuilding tables or fonts.For more information on writing systems, see Section 6.1, Writing Systems.
Planes
The Unicode codespace consists of the single range of numeric values from 0 to 10FFFF16,but in practice it has proven convenient to think of the codespace as divided up into planesof characters—each plane consisting of 64K code points. Because of these numeric conven-tions, the Basic Multilingual Plane is occasionally referred to as Plane 0. The last four hexa-decimal digits in each code point indicate a character’s position inside a plane. Theremaining digits indicate the plane. For example, U+23456 cjk unified ideograph-23456is found at location 345616 in Plane 2
Basic Multilingual Plane. The Basic Multilingual Plane (BMP, or Plane 0) contains thecommon-use characters for all the modern scripts of the world as well as many historicaland rare characters. By far the majority of all Unicode characters for almost all textual datacan be found in the BMP.
Supplementary Multilingual Plane. The Supplementary Multilingual Plane (SMP, orPlane 1) is dedicated to the encoding of characters for scripts or symbols which eithercould not be fit into the BMP or see very infrequent usage. This includes many historicscripts, a number of lesser-used contemporary scripts, special-purpose invented scripts,notational systems or large pictographic symbol sets, and occasionally historic extensionsof scripts whose core sets are encoded on the BMP.
Examples include Gothic (historic), Shavian (special-purpose invented), Musical Symbols(notational system), Domino Tiles (pictographic), and Ancient Greek Numbers (historicextension for Greek). A number of scripts, whether of historic and contemporary use, donot yet have their characters encoded in the Unicode Standard. The majority of scripts cur-rently identified for encoding will eventually be allocated in the SMP. As a result, someareas of the SMP will experience common, frequent usage.
Supplementary Ideographic Plane. The Supplementary Ideographic Plane (SIP, or Plane2) is intended as an additional allocation area for those CJK characters that could not be fitin the blocks set aside for more common CJK characters in the BMP. While there are a small number of common-use CJK characters in the SIP (for example, for Cantoneseusage), the vast majority of Plane 2 characters are extremely rare or of historical interestonly.
Supplementary Special-purpose Plane. The Supplementary Special-purpose Plane (SSP,or Plane 14) is the spillover allocation area for format control characters that do not fit intothe small allocation areas for format control characters in the BMP.Private Use Planes. The two Private Use Planes (Planes 15 and 16) are allocated, in theirentirety, for private use. Those two planes contain a total of 131,068 characters to supple-ment the 6,400 private-use characters located in the BMP.
Allocation Areas and Blocks
Allocation Areas. The Unicode Standard does not have any normatively defined conceptof areas or zones for the BMP (or other planes), but it is often handy to refer to the alloca-tion areas of the BMP by the general types of the characters they include. These areas aremerely a rough organizational device and do not restrict the types of characters that mayend up being allocated in them. The description and ranges of areas may change from ver-sion to version of the standard as more new scripts, symbols, and other characters areencoded in previously reserved ranges.
Blocks. The various allocation areas are, in turn, divided up into character blocks (see D10bin Section 3.4, Characters and Encoding), which are normatively defined, and which areused to structure the actual code charts. For a complete listing of the normative blocks inthe Unicode Standard, see Blocks.txt in the Unicode Character Database.
The normative status of blocks should not, however, be taken as indicating that they definesignificant sets of characters. For the most part, the blocks serve only as ranges to divide upthe code charts and do not necessarily imply anything else about the types of charactersfound in the block. Block identity cannot be taken as a reliable guide to the source, use, orproperties of characters, for example, and it cannot be reliably used alone to process char-acters. In particular:
• Blocks are simply named ranges, and many contain reserved code points.
• Characters used in a single writing system may be found in several differentblocks. For example, characters used for letters for Latin-based writing systemsare found in at least 14 different blocks: Basic Latin, Latin-1 Supplement, LatinExtended-A , Latin Extended-B, Latin Extended-C, Latin Extended-D, LatinExtended-E, IPA Extensions, Phonetic Extensions, Phonetic Extensions Sup-plement, Latin Extended Additional, Spacing Modifier Letters, CombiningDiacritical Marks, and Combining Diacritical Marks Supplement.
• Characters in a block may be used with different writing systems. For example,the danda character is encoded in the Devanagari block but is used withnumerous other scripts; Arabic combining marks in the Arabic block are usedwith the Syriac script; and so on.
• Block definitions are not at all exclusive. For instance, many mathematicaloperator characters are not encoded in the Mathematical Operators block—and are not even in any block containing “Mathematical” in its name; manycurrency symbols are not found in the Currency Symbols block, and so on.
For reliable specification of the properties of characters, one should instead turn to thedetailed, character-by-character property assignments available in the Unicode CharacterDatabase. See also Chapter 4, Character Properties. For further discussion of the relation-ship between the blocks in the Unicode standard and significant property assignments andsets of characters, see Unicode Standard Annex #24, “Unicode Script Property,” and Uni-code Technical Standard #18, “Unicode Regular Expressions.”
Allocation Order. The allocation order of various scripts and other groups of charactersreflects the historical evolution of the Unicode Standard. While there is a certain geo-graphic sense to the ordering of the allocation areas for the scripts, this is only a very loosecorrelation.
Roadmap for Future Allocation. The unassigned ranges in the Unicode codespace will befilled with future script or symbol encodings on a space-available basis. The relevant char-acter encoding committees follow an organized roadmap to help them decide where toencode new scripts and other characters within the available space. Until the characters areactually standardized, however, there are no absolute guarantees where future allocationswill occur. In general, implementations should not make assumptions about where futurescripts or other sets of symbols may be encoded based solely on the identity of neighboringblocks of characters already encoded.
See Section B.3, Other Unicode Online Resources for information about the roadmap andabout the pipeline of approved characters in process for future publication.
Assignment of Code Points
Code points in the Unicode Standard are assigned using the following guidelines:
• Where there is a single accepted standard for a script, the Unicode Standardgenerally follows it for the relative order of characters within that script.
• The first 256 codes follow precisely the arrangement of ISO/IEC 8859-1 (Latin1), of which 7-bit ASCII (ISO/IEC 646 IRV ) accounts for the first 128 codepositions.
• Characters with common characteristics are located together contiguously. Forexample, the primary Arabic character block was modeled after ISO/IEC8859-6. The Arabic script characters used in Persian, Urdu, and other lan-guages, but not included in ISO/IEC 8859-6, are allocated after the primaryArabic character block. Right-to-left scripts are grouped together.
• In most cases, scripts with fewer than 128 characters are allocated so as not tocross 128-code-point boundaries (that is, they fit in ranges nn00..nn7F ornn80..nnFF). For supplementary characters, an additional constraint not to cross 1,024-code-point boundaries is applied (that is, scripts fit in rangesnn000..nn3FF, nn400..nn7FF, nn800..nnBFF, or nnC00..nnFFF). Such con-straints enable better optimizations for tasks such as building tables for accessto character properties.
• Codes that represent letters, punctuation, symbols, and diacritics that are gen-erally shared by multiple languages or scripts are grouped together in severallocations.
• The Unicode Standard does not correlate character code allocation with lan-guage-dependent collation or case. For more information on collation order,see Unicode Technical Standard #10, “Unicode Collation Algorithm.”
• Unified CJK ideographs are laid out in multiple blocks, each of which isarranged according to the Han ideograph arrangement defined in Section 18.1,Han. This ordering is roughly based on a radical-stroke count order.
-
2-9 Details of Allocation
This section provides a more detailed summary of the way characters are allocated in theUnicode Standard. Figure 2-13 gives an overall picture of the allocation areas of the Uni-code Standard, with an emphasis on the identities of the planes. The following subsectionsdiscuss the allocation details for specific planes.
Plane 0 (BMP)
Figure 2-14 shows the Basic Multilingual Plane (BMP) in an expanded format to illustratethe allocation substructure of that plane in more detail. This section describes each alloca-tion area, in the order of their location on the BMP.
ASCII and Latin-1 Compatibility Area. For compatibility with the ASCII and ISO 8859-1,Latin-1 standards, this area contains the same repertoire and ordering as Latin-1. Accord-ingly, it contains the basic Latin alphabet, European digits, and then the same collection ofmiscellaneous punctuation, symbols, and additional Latin letters as are found in Latin-1.
General Scripts Area. The General Scripts Area contains a large number of modern-usescripts of the world, including Latin, Greek, Cyrillic, Arabic, and so on. Most of the charac-ters encoded in this area are graphic characters. A subrange of the General Scripts Area isset aside for right-to-left scripts, including Hebrew, Arabic, Thaana, and N’Ko.
Punctuation and Symbols Area. This area is devoted mostly to all kinds of symbols,including many characters for use in mathematical notation. It also contains general punc-tuation, as well as most of the important format control characters.
Supplementary General Scripts Area. This area contains scripts or extensions to scriptsthat did not fit in the General Scripts Area itself. It contains the Glagolitic, Coptic, and Tifi-nagh scripts, plus extensions for the Latin, Cyrillic, Georgian, and Ethiopic scripts.
CJK Miscellaneous Area. The CJK Miscellaneous Area contains some East Asian scripts,such as Hiragana and Katakana for Japanese, punctuation typically used with East Asianscripts, lists of CJK radical symbols, and a large number of East Asian compatibility charac-ters.
CJKV Ideographs Area. This area contains almost all the unified Han ideographs in theBMP. It is subdivided into a block for the Unified Repertoire and Ordering (the initialblock of 20,902 unified Han ideographs plus a small number of later additions) andanother block containing Extension A (an additional 6,582 unified Han ideographs).
General Scripts Area (Asia and Africa). This area contains numerous blocks for addi-tional scripts of Asia and Africa, such as Yi, Cham, Vai, and Bamum. It also contains morespillover blocks with additional characters for the Latin, Devanagari, Myanmar, and Han-gul scripts.
Hangul A rea. This area consists of one large block containing 11,172 precomposed Han-gul syllables, and one small block with additional, historic Hangul jamo extensions.
Surrogates Area. The Surrogates Area contains only surrogate code points and no encodedcharacters. See Section 23.6, Surrogates Area, for more details.Private Use Area. The Private Use Area in the BMP contains 6,400 private-use characters.
Compatibility and Specials Area. This area contains many compatibility variants of char-acters from widely used corporate and national standards that have other representationsin the Unicode Standard. For example, it contains Arabic presentation forms, whereas thebasic characters for the Arabic script are located in the General Scripts Area. The Compat-ibility and Specials Area also contains twelve CJK unified ideographs, a few important for-mat control characters, the basic variation selectors, and other special characters. SeeSection 23.8, Specials, for more details.
Plane 1 (SMP)
Figure 2-15 shows Plane 1, the Supplementary Multilingual Plane (SMP), in expanded for-mat to illustrate the allocation substructure of that plane in more detail.
General Scripts Areas. These areas contain a large number of historic scripts, as well as afew regional scripts which have some current use. The first of these areas also contains asmall number of symbols and numbers associated with ancient scripts.
General Scripts Areas (RTL). There are two subranges in the SMP which are set aside forhistoric right-to-left scripts, such as Phoenician, Kharoshthi, and Avestan. The second ofthese also defaults to Bidi_Class = R and is reserved for the encoding of other historicright-to-left scripts or symbols.
Cuneiform and Hieroglyphic Area. This area contains three large, ancient scripts: Sum-ero-Akkadian Cuneiform, Egyptian Hieroglyphs, and Anatolian Hieroglyphs. Other largehieroglyphic and pictographic scripts will be allocated in this area in the future.
Ideographic Scripts Area. This area is set aside for large, historic siniform (but non-Han)logosyllabic scripts such as Tangut, Jurchen, and Khitan, and other East Asian logosyllabicscripts such as Naxi. As of Unicode 12.0, this area contains a large set of Tangut ideographsand components, the Nüshu script, and a large set of hentaigana (historic, variant formkana) characters.
Symbols Areas. The first of these SMP Symbols Areas contains sets of symbols for nota-tional systems, such as musical symbols, shorthands, and mathematical alphanumericsymbols. The second contains various game symbols, and large sets of miscellaneous sym-bols and pictographs, mostly used in compatibility mapping of East Asian character sets.Notable among these are emoji and emoticons.
Plane 2 (SIP)
Plane 2, the Supplementary Ideographic Plane (SIP), consists primarily of one big area,starting from the first code point in the plane, that is dedicated to encoding additional uni-fied CJK characters. A much smaller area, toward the end of the plane, is dedicated to addi-tional CJK compatibility ideographic characters—which are basically just duplicatedcharacter encodings required for round-trip conversion to various existing legacy EastAsian character sets. The CJK compatibility ideographic characters in Plane 2 are currentlyall dedicated to round-trip conversion for the CNS standard and are intended to supple-ment the CJK compatibility ideographic characters in the BMP, a smaller number of char-acters dedicated to round-trip conversion for various Korean, Chinese, and Japanesestandards.
Other Planes
The first 4,096 code positions on Plane 14 form an area set aside for special characters thathave the Default_Ignorable_Code_Point property. A small number of tag characters, plussome supplementary variation selection characters, have been allocated there. All remain-ing code positions on Plane 14 are reserved for future allocation of other special-purposecharacters.Plane 15 and Plane 16 are allocated, in their entirety, for private use. Those two planes con-tain a total of 131,068 characters, to supplement the 6,400 private-use characters located inthe BMP.All other planes are reserved; there are no characters assigned in them. The last two codepositions of all planes are permanently set aside as noncharacters. (See Section 2.13, SpecialCharacters).
-
2-10 Writing Direction
Individual writing systems have different conventions for arranging characters into lineson a page or screen. Such conventions are referred to as a script’s directionality. For exam-ple, in the Latin script, characters are arranged horizontally from left to right to form lines,and lines are arranged from top to bottom, as shown in the first example of Figure 2-16.
Bidirectional. In most Semitic scripts such as Hebrew and Arabic, characters are arrangedfrom right to left into lines, although digits run the other way, making the scripts inher-ently bidirectional, as shown in the second example in Figure 2-16. In addition, left-to-rightand right-to-left scripts are frequently used together. In all such cases, arranging charactersinto lines becomes more complex. The Unicode Standard defines an algorithm to deter-mine the layout of a line, based on the inherent directionality of each character, and sup-plemented by a small set of directional controls. See Unicode Standard Annex #9, “UnicodeBidirectional Algorithm,” for more information.
Ve r t i c a l . East Asian scripts are frequently written in vertical lines in which characters arearranged from top to bottom. Lines are arranged from right to left, as shown in the thirdexample in Figure 2-16. Such scripts may also be written horizontally, from left to right.Most East Asian characters have the same shape and orientation when displayed horizon-tally or vertically, but many punctuation characters change their shape when displayed ver-tically. In a vertical context, letters and words from other scripts are generally rotatedthrough 90-degree angles so that they, too, read from top to bottom. Unicode TechnicalReport #50, “Unicode Vertical Text Layout,” defines a character property which is useful indetermining the correct orientation of characters when laid out vertically in text.
In contrast to the bidirectional case, the choice to lay out text either vertically or horizon-tally is treated as a formatting style. Therefore, the Unicode Standard does not providedirectionality controls to specify that choice.Mongolian is usually written from top to bottom, with lines arranged from left to right, asshown in the fourth example. When Mongolian is written horizontally, the characters arerotated.
Boustrophedon. Early Greek used a system called boustrophedon (literally, “ox-turning”).In boustrophedon writing, characters are arranged into horizontal lines, but the individuallines alternate between right to left and left to right, the way an ox goes back and forth when plowing a field, as shown in the fifth example. The letter images are mirrored inaccordance with the direction of each individual line.
Other Historical Directionalities. Other script directionalities are found in historical writ-ing systems. For example, some ancient Numidian texts are written from bottom to top,and Egyptian hieroglyphics can be written with varying directions for individual lines.The historical directionalities are of interest almost exclusively to scholars intent on repro-ducing the exact visual content of ancient texts. The Unicode Standard does not providedirect support for them. Fixed texts can, however, be written in boustrophedon or in otherdirectional conventions by using hard line breaks and directionality overrides or the equiv-alent markup.
-
2-11 Combining Characters
Combining Characters. Characters intended to be positioned relative to an associated basecharacter are depicted in the character code charts above, below, or through a dotted circle.When rendered, the glyphs that depict these characters are intended to be positioned rela-tive to the glyph depicting the preceding base character in some combination. The Uni-code Standard distinguishes two types of combining characters: spacing and nonspacing.Nonspacing combining characters do not occupy a spacing position by themselves. Never-theless, the combination of a base character and a nonspacing character may have a differ-ent advance width than the base character by itself. For example, an “î” may be slightlywider than a plain “i”. The spacing or nonspacing properties of a combining character aredefined in the Unicode Character Database.
All combining characters can be applied to any base character and can, in principle, beused with any script. As with other characters, the allocation of a combining character toone block or another identifies only its primary usage; it is not intended to define or limitthe range of characters to which it may be applied. In the Unicode Standard, all sequences ofcharacter codes are permitted.
This does not create an obligation on implementations to support all possible combina-tions equally well. Thus, while application of an Arabic annotation mark to a Han charac-ter or a Devanagari consonant is permitted, it is unlikely to be supported well in renderingor to make much sense.
Diacritics. Diacritics are the principal class of nonspacing combining characters used withthe Latin, Greek, and Cyrillic scripts and their relatives. In the Unicode Standard, the term“diacritic” is defined very broadly to include accents as well as other nonspacing marks.
Symbol Diacritics. Some diacritical marks are applied primarily to symbols. These com-bining marks are allocated in the Combining Diacritical Marks for Symbols block, to dis-tinguish them from diacritical marks applied primarily to letters.
Enclosing Combining Marks. Figure 2-17 shows examples of combining enclosing marksfor symbols. The combination of an enclosing mark with a base character has the appear-ance of a symbol. As discussed in “Properties” later in this section, it is best to limit the useof combining enclosing marks to characters that represent symbols. This limitation mini-mizes the potential for surprises resulting from mismatched character properties.
A few symbol characters are intended primarily for use with enclosing combining marks.For example, U+2621 caution sign is a winding road symbol that can be used in combi-nation with U+20E4 combining enclosing upward pointing triangle or U+20DFcombining enclosing diamond. However, the enclosing combining marks can also beused in combination with arbitrar y symbols, as illustrated by applying U+20E0 combiningenclosing circle backslash to U+2615 hot beverage to create a “no drinks allowed”symbol. Furthermore, no formal restriction prevents enclosing combining marks frombeing used with non-symbols, as illustrated by applying U+20DD combining enclosingcircle to U+062D arabic letter hah to represent a circled hah.
Script-Specific Combining Characters. Some scripts, such as Hebrew, Arabic, and thescripts of India and Southeast Asia, have both spacing and nonspacing combining charac-ters specific to those scripts. Many of these combining characters encode vowel letters. Assuch, they are not generally referred to as diacritics, but may have script-specific terminol-ogy such as harakat (Arabic) or matra (Devanagari). See Section 7.9, Combining Marks.
Script-Specific Combining Characters. Some scripts, such as Hebrew, Arabic, and thescripts of India and Southeast Asia, have both spacing and nonspacing combining charac-ters specific to those scripts. Many of these combining characters encode vowel letters. Assuch, they are not generally referred to as diacritics, but may have script-specific terminol-ogy such as harakat (Arabic) or matra (Devanagari). See Section 7.9, Combining Marks.
Sequence of Base Characters and Diacritics
In the Unicode Standard, all combining characters are to be used in sequence following thebase characters to which they apply. The sequence of Unicode characters unambiguously represents “äu” and not “aü”, as shown in Figure 2-18.
General Structure562.11Combining Characters Script-Specific Combining Characters. Some scripts, such as Hebrew, Arabic, and thescripts of India and Southeast Asia, have both spacing and nonspacing combining charac-ters specific to those scripts. Many of these combining characters encode vowel letters. Assuch, they are not generally referred to as diacritics, but may have script-specific terminol-ogy such as harakat (Arabic) or matra (Devanagari). See Section 7.9, Combining Marks.Sequence of Base Characters and DiacriticsIn the Unicode Standard, all combining characters are to be used in sequence following thebase characters to which they apply. The sequence of Unicode characters unambiguously represents “äu” and not “aü”, as shown in Figure 2-18.Ordering. The ordering convention used by the Unicode Standard—placing combiningmarks after the base character to which they apply—is consistent with the logical order ofcombining characters in Semitic and Indic scripts, the great majority of which (logically orphonetically) follow the base characters with which they are associated. This conventionalso conforms to the way modern font technology handles the rendering of nonspacinggraphical forms (glyphs), so that mapping from character memory representation order tofont rendering order is simplified. It is different from the convention used in the bib-liographic standard ISO 5426.
Indic Vowel Signs. Some Indic vowel signs are rendered to the left of a consonant letter orconsonant cluster, even though their logical order in the Unicode encoding follows theconsonant letter. In the charts, these vowels are depicted to the left of dotted circles (seeFigure 2-19). The coding of these vowels in pronunciation order and not in visual order isconsistent with the ISCII standard
General Structure572.11Combining Characters Properties. A sequence of a base character plus one or more combining characters gener-ally has the same properties as the base character. For example, “A” followed by “ˆ” has thesame properties as “”. For this reason, most Unicode algorithms ensure that suchsequences behave the same way as the corresponding base character. However, when thecombining character is an enclosing combining mark—in other words, when its General_-Category value is Me—the resulting sequence has the appearance of a symbol. InFigure 2-20, enclosing the exclamation mark with U+20E4 combining enclosingupward pointing triangle produces a sequence that looks like U+26A0 warning sign.
Because the properties of U+0021 exclamation mark are that of a punctuation character,they are different from those of U+26A0 warning sign. For example, the two will behavedifferently for line breaking. To avoid unexpected results, it is best to limit the use of com-bining enclosing marks to characters that encode symbols. For that reason, the warningsign is separately encoded as a miscellaneous symbol in the Unicode Standard and does nothave a decomposition.
Multiple Combining Characters
In some instances, more than one diacritical mark is applied to a single base character (seeFigure 2-21). The Unicode Standard does not restrict the number of combining charactersthat may follow a base character. The following discussion summarizes the default treat-ment of multiple combining characters. (For further discussion, see Section 3.6, Combina-tion.)
If the combining characters can interact typographically—for example, U+0304 combin-ing macron and U+0308 combining diaeresis—then the order of graphic display isdetermined by the order of coded characters (see Ta b l e 2 - 5). By default, the diacritics orother combining characters are positioned from the base character’s glyph outward. Com-bining characters placed above a base character will be stacked vertically, starting with thefirst encountered in the logical store and continuing for as many marks above as arerequired by the character codes following the base character. For combining charactersplaced below a base character, the situation is reversed, with the combining charactersstarting from the base character and stacking downward.When combining characters do not interact typographically, the relative ordering of con-tiguous combining marks cannot result in any visual distinction and thus is insignificant.
Another example of multiple combining characters above the base character can be foundin Thai, where a consonant letter can have above it one of the vowels U+0E34 throughU+0E37 and, above that, one of four tone marks U+0E48 through U+0E4B. The order ofcharacter codes that produces this graphic display is base consonant character + vowel char-acter + tone mark character, as shown in Figure 2-21.
Many combining characters have specific typographical traditions that provide detailedrules for the expected rendering. These rules override the default stacking behavior. Forexample, certain combinations of combining marks are sometimes positioned horizontallyrather than stacking or by ligature with an adjacent nonspacing mark (see Ta b l e 2 - 6).When positioned horizontally, the order of codes is reflected by positioning in the predom-inant direction of the script with which the codes are used. For example, in a left-to-right script, horizontal accents would be coded from left to right. In Ta b l e 2 - 6, the top example iscorrect and the bottom example is incorrect.Such override behavior is associated with specific scripts or alphabets. For example, whenused with the Greek script, the “breathing marks” U+0313 combining comma above(psili) and U+0314 combining reversed comma above (dasia) require that, when usedtogether with a following acute or grave accent, they be rendered side-by-side rather thanthe accent marks being stacked above the breathing marks. The order of codes here is basecharacter code + breathing mark code + accent mark code. This example demonstrates thescript-dependent or writing-system-dependent nature of rendering combining diacriticalmarks.
For additional examples of script-specific departure from default stacking of sequences ofcombining marks, see the discussion about the positioning of multiple points and marks inSection 9.1, Hebrew, the discussion of nondefault placement of Arabic vowel marks accom-panying Figure 9-5 in Section 9.2, Arabic, or the discussion of horizontal combination oftitlo letters in Old Church Slavonic found in the subsection “Cyrillic Extended-A:U+2DE0–U+2DFF” in Section 7.4, Cyrillic.For other types of nondefault stacking behavior, see the discussion about the positioning ofcombining parentheses in the subsection “Combining Diacritical Marks Extended:U+1AB0–U+1AFF” in Section 7.9, Combining Marks.The Unicode Standard specifies default stacking behavior to offer guidance about whichcharacter codes are to be used in which order to represent the text, so that texts containingmultiple combining marks can be interchanged reliably. The Unicode Standard does notaim to regulate or restrict typographical tradition.
Ligated Multiple Base Characters
When the glyphs representing two base characters merge to form a ligature, the combiningcharacters must be rendered correctly in relation to the ligated glyph (see Figure 2-22).Internally, the software must distinguish between the nonspacing marks that apply to posi-tions relative to the first part of the ligature glyph and those that apply to the second part.(For a discussion of general methods of positioning nonspacing marks, see Section 5.12,Strategies for Handling Nonspacing Marks.)
For more information, see “Application of Combining Marks” in Section 3.6, Combination.
General Structure602.11Combining Characters Ligated base characters with multiple combining marks do not commonly occur in mostscripts. However, in some scripts, such as Arabic, this situation occurs quite often whenvowel marks are used. It arises because of the large number of ligatures in Arabic, whereeach element of a ligature is a consonant, which in turn can have a vowel mark attached toit. Ligatures can even occur with three or more characters merging; vowel marks may beattached to each part.
Exhibiting Nonspacing Marks in Isolation
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparentisolation by applying them to U+00A0 no-break space. This convention might beemployed, for example, when talking about the combining mark itself as a mark, ratherthan using it in its normal way in text (that is, applied as an accent to a base letter or inother combinations).Prior to Version 4.1 of the Unicode Standard, the standard recommended the use ofU+0020 space for display of isolated combining marks. This practice is no longer recom-mended because of potential conflicts with the handling of sequences of U+0020 spacecharacters in such contexts as XML. For additional ways of displaying some diacriticalmarks, see “Spacing Clones of Diacritical Marks” in Section 7.9, Combining Marks.
“Characters” and Grapheme Clusters
End users have various concepts about what constitutes a letter or “character” in the writ-ing system for their language or languages. The precise scope of these end-user “charac-ters” depends on the particular written language and the orthography it uses. In addition tothe many instances of accented letters, they may extend to digraphs such as Slovak “ch”, tri-graphs or longer combinations, and sequences using spacing letter modifiers, such as “kw”.Such concepts are often important for processes such as collation, for the definition ofcharacters in regular expressions, and for counting “character” positions within text. Ininstances such as these, what the user thinks of as a character may affect how the collationor regular expression will be defined or how the “characters” will be counted. Words andother higher-level text elements generally do not split within elements that a user thinks ofas a character, even when the Unicode representation of them may consist of a sequence ofencoded characters.
The variety of these end-user-perceived characters is quite great—particularly for digraphs,ligatures, or syllabic units. Furthermore, it depends on the particular language and writingsystem that may be involved. Despite this variety, however, the core concept “charactersthat should be kept together” can be defined for the Unicode Standard in a language-inde-pendent way. This core concept is known as a grapheme cluster, and it consists of any com-bining character sequence that contains only nonspacing combining marks or anysequence of characters that constitutes a Hangul syllable (possibly followed by one or morenonspacing marks). An implementation operating on such a cluster would almost neverwant to break between its elements for rendering, editing, or other such text processes; thegrapheme cluster is treated as a single unit. Unicode Standard Annex #29, “Unicode TextSegmentation,” provides a complete formal definition of a grapheme cluster and discussesits application in the context of editing and other text processes. Implementations also maytailor the definition of a grapheme cluster, so that under limited circumstances, particularto one written language or another, the grapheme cluster may more closely pertain to whatend users think of as “characters” for that language.
-
2-12 Equivalent Sequences
In cases involving two or more sequences considered to be equivalent, the Unicode Stan-dard does not prescribe one particular sequence as being the correct one; instead, eachsequence is merely equivalent to the others. Figure 2-23 illustrates the two major forms ofequivalent sequences formally defined by the Unicode Standard. In the first example, thesequences are canonically equivalent. Both sequences should display and be interpretedthe same way. The second and third examples illustrate different compatibility sequences.Compatible-equivalent sequences may have format differences in display and may beinterpreted differently in some contexts.
If an application or user attempts to distinguish between canonically equivalent sequences,as shown in the first example in Figure 2-23, there is no guarantee that other applicationswould recognize the same distinctions. To prevent the introduction of interoperabilityproblems between applications, such distinctions must be avoided wherever possible.Making distinctions between compatibly equivalent sequences is less problematical. How-ever, in restricted contexts, such as the use of identifiers, avoiding compatibly equivalentsequences reduces possible security issues. See Unicode Technical Report #36, “UnicodeSecurity Considerations.”
Normalization
Where a unique representation is required, a normalized form of Unicode text can be usedto eliminate unwanted distinctions. The Unicode Standard defines four normalizationforms: Normalization Form D (NFD), Normalization Form KD (NFKD), NormalizationForm C (NFC), and Normalization Form KC (NFKC). Roughly speaking, NFD andNFKD decompose characters where possible, while NFC and NFKC compose characterswhere possible. For more information, see Unicode Standard Annex #15, “Unicode Nor-malization Forms,” and Section 3.11, Normalization Forms.
A key part of normalization is to provide a unique canonical order for visually nondistinctsequences of combining characters. Figure 2-24 shows the effect of canonical ordering formultiple combining marks applied to the same base character.
In the first row of Figure 2-24, the two sequences are visually nondistinct and, therefore,equivalent. The sequence on the right has been put into canonical order by reordering inascending order of the Canonical_Combining_Class (ccc) values. The ccc values areshown below each character. The second row of Figure 2-24 shows an example where com-bining marks interact typographically—the two sequences have different stacking order,and the order of combining marks is significant. Because the two combining marks havebeen given the same combining class, their ordering is retained under canonical reorder-ing. Thus the two sequences in the second row are not equivalent.
Decompositions
Precomposed characters are formally known as decomposables, because they have decom-positions to one or more other characters. There are two types of decompositions:
•Canonical. The character and its decomposition should be treated as essentiallyequivalent.
•Compatibility. The decomposition may remove some information (typicallyformatting information) that is important to preserve in particular contexts.
Types of Decomposables. Conceptually, a decomposition implies reducing a character toan equivalent sequence of constituent parts, such as mapping an accented character to abase character followed by a combining accent. The vast majority of nontrivial decomposi-tions are indeed a mapping from a character code to a character sequence. However, in asmall number of exceptional cases, there is a mapping from one character to another char-acter, such as the mapping from ohm to capital omega. Finally, there are the “trivial”decompositions, which are simply a mapping of a character to itself. They are really anindication that a character cannot be decomposed, but are defined so that all charactersformally have a decomposition. The definition of decomposable is written to encompassonly the nontrivial types of decompositions; therefore these characters are considered non-decomposable.
In summary, three types of characters are distinguished based on their decompositionbehavior:
•Canonical decomposable. A character that is not identical to its canonicaldecomposition.
•Compatibility decomposable. A character whose compatibility decomposition isnot identical to its canonical decomposition.
•Nondecomposable. A character that is identical to both its canonical decompo-sition and its compatibility decomposition. In other words, the character hastrivial decompositions (decompositions to itself ). Loosely speaking, these char-acters are said to have “no decomposition,” even though, for completeness, thealgorithm that defines decomposition maps such characters to themselves.
Because of the way decompositions are defined, a character cannot have a nontrivialcanonical decomposition while having a trivial compatibility decomposition. Characterswith a trivial compatibility decomposition are therefore always nondecomposables.Examples. Figure 2-25 illustrates these three types. Compatibility decompositions that areredundant because they are identical to the canonical decompositions are not shown.
General Structure642.12Equivalent Sequences In summary, three types of characters are distinguished based on their decompositionbehavior:•Canonical decomposable. A character that is not identical to its canonicaldecomposition.•Compatibility decomposable. A character whose compatibility decomposition isnot identical to its canonical decomposition.•Nondecomposable. A character that is identical to both its canonical decompo-sition and its compatibility decomposition. In other words, the character hastrivial decompositions (decompositions to itself ). Loosely speaking, these char-acters are said to have “no decomposition,” even though, for completeness, thealgorithm that defines decomposition maps such characters to themselves. Because of the way decompositions are defined, a character cannot have a nontrivialcanonical decomposition while having a trivial compatibility decomposition. Characterswith a trivial compatibility decomposition are therefore always nondecomposables.Examples. Figure 2-25 illustrates these three types. Compatibility decompositions that areredundant because they are identical to the canonical decompositions are not shown. The figure illustrates two important points:
• Decompositions may be to single characters or to sequences of characters.Decompositions to a single character, also known as singleton decompositions,are seen for the ohm sign and the halfwidth katakana ka in Figure 2-25. Becauseof examples like these, decomposable characters in Unicode do not always con-sist of obvious, separate parts; one can know their status only by examining thedata tables for the standard.
• A very small number of characters are both canonical and compatibilitydecomposable. The example shown in Figure 2-25 is for the Greek hookedupsilon symbol with an acute accent. It has a canonical decomposition to onesequence and a compatibility decomposition to a different sequence.
For more precise definitions of these terms, see Chapter 3, Conformance.
Non-decomposition of Certain Diacritics
Most characters that one thinks of as being a letter “plus accent” have formal decomposi-tions in the Unicode Standard. For example, see the canonical decomposable U+00C1latin capital letter a with acute shown in Figure 2-25. There are, however, exceptionsinvolving certain types of diacritics and other marks.
Overlaid and Attached Diacritics. Based on the pattern for accented letters, implementersoften also expect to encounter formal decompositions for characters which use variousoverlaid diacritics such as slashes and bars to form new Latin (or Cyrillic) letters. Forexample, one might expect a decomposition for U+00D8 latin capital letter o withstroke involving U+0338 combining long solidus overlay. However, such decomposi-tions involving overlaid diacritics are not formally defined in the Unicode Standard.
For historical and implementation reasons, there are no decompositions for characterswith overlaid diacritics such as slashes and bars, nor for most diacritic hooks, swashes,tails, and other similar modifications to the graphic form of a base character. In such cases,the generic identification of the overlaid element is not specific enough to identify whichpart of the base glyph is to be overlaid. The characters involved include prototypical over-laid diacritic letters as U+0268 latin small letter i with stroke, but also characterswith hooks and descenders, such as U+0188 latin small letter c with hook, U+049Bcyrillic small letter ka with descender, and U+0499 cyrillic small letter zewith descender.
There are three exceptional attached diacritics which are regularly decomposed, namelyU+0327 combining cedilla, U+0328 combining ogonek, and U+031B combininghorn (which is used in Vietnamese letters).
Other Diacritics. There are other characters for which the name and glyph appear toimply the presence of a decomposable diacritic, but which have no decomposition definedin the Unicode Standard. A prominent example is the Pashto letter U+0681 arabic letterhah with hamza above. In these cases, as for the overlaid diacritics, the composed char-acter and the sequence of base letter plus combining diacritic are not equivalent, althoughtheir renderings would be very similar. See the text on “Combining Hamza Above” inSection 9.2, Arabic for further complications related to this and similar characters.
Character Names and Decomposition. One cannot determine the decomposition status ofa Latin letter from its Unicode name, despite the existence of phrases such as “...withacute” or “...with stroke”. The normative decomposition mappings listed in the Uni-code Character Database are the only formal definition of decomposition status.
Simulated Decomposition in Processing. Because the Unicode characters with overlaiddiacritics or similar modifications to their base form shapes have no formal decomposi-tions, some kinds of text processing that would ordinarily use Normalization Form D(NFD) internally to separate base letters from accents may end up simulating decomposi-tions instead. Effectively, this processing treats overlaid diacritics as if they were repre-sented by a separately encoded combining mark. For example, a common operation insearching or matching is to sort (or match) while ignoring accents and diacritics on letters.This is easy to do with characters that formally decompose; the text is decomposed, andthen the combining marks for the accents are ignored. However, for letters with overlaiddiacritics, the effect of ignoring the diacritic has to be simulated instead with data tablesthat go beyond simple use of Unicode decomposition mappings.
Security Issue. The lack of formal decompositions for characters with overlaid diacriticsmeans that there are increased opportunities for spoofing involving such characters. Thedisplay of a base letter plus a combining overlaid mark such as U+0335 combining shortstroke overlay may look the same as the encoded base letter with bar diacritic, but thetwo sequences are not canonically equivalent and would not be folded together by Unicodenormalization.
Implementations of writing systems which make use of letters with overlaid diacritics typi-cally do not mix atomic representation (use of a precomposed letter with overlaid diacritic)with sequential representation (use of a sequence of base letter plus combining mark forthe overlaid diacritic). Mixing these conventions is avoided precisely because the atomicrepresentation and the sequential representation are not canonically equivalent. In mostcases the atomic representation is the preferred choice, because of its convenience andmore reliable display.
Security protocols for identifiers may disallow either the sequential representation or theatomic representation of a letter with an overlaid diacritic to tr y to minimize spoofingopportunities. However, when this is done, it is incumbent on the protocol designers firstto verify whether the atomic or the sequential representation is in actual use. Disallowingthe preferred convention, while instead forcing use of the unpreferred one for a particularwriting system can have the unintended consequence of increasing confusion about use—and may thereby reduce the usability of the protocol for its intended purpose.For more information and data for handling these confusable sequences involving overlaiddiacritics, see Unicode Technical Report #36, “Unicode Security Considerations.”
-
2-13 Special Characters
The Unicode Standard includes a small number of important characters with specialbehavior; some of them are introduced in this section. It is important that implementa-tions treat these characters properly. For a list of these and similar characters, seeSection 4.12, Characters with Unusual Properties; for more information about such charac-ters, see Section 23.1, Control Codes; Section 23.2, Layout Controls; Section 23.7, Noncharac-ters; and Section 23.8, Specials.
Special Noncharacter Code Points
The Unicode Standard contains a number of code points that are intentionally not used torepresent assigned characters. These code points are known as noncharacters. They arepermanently reserved for internal use and are not used for open interchange of Unicodetext. For more information on noncharacters, see Section 23.7, Noncharacters.
Byte Order Mark (BOM)
The UTF-16 and UTF-32 encoding forms of Unicode plain text are sensitive to the byteordering that is used when serializing text into a sequence of bytes, such as when writingdata to a file or transferring data across a network. Some processors place the least signifi-cant byte in the initial position; others place the most significant byte in the initial position.Ideally, all implementations of the Unicode Standard would follow only one set of byteorder rules, but this scheme would force one class of processors to swap the byte order onreading and writing plain text files, even when the file never leaves the system on which itwas created.
To have an efficient way to indicate which byte order is used in a text, the Unicode Stan-dard contains two code points, U+FEFF zero width no-break space (byte order mark)and U+FFFE (a noncharacter), which are the byte-ordered mirror images of each other.When a BOM is received with the opposite byte order, it will be recognized as a nonchar-acter and can therefore be used to detect the intended byte order of the text. The BOM isnot a control character that selects the byte order of the text; rather, its function is to allowrecipients to determine which byte ordering is used in a file.
Unicode Signature. An initial BOM may also serve as an implicit marker to identify a file ascontaining Unicode text. For UTF-16, the sequence FE16 FF16 (or its byte-reversed coun-terpart, FF16 FE16) is exceedingly rare at the outset of text files that use other characterencodings. The corresponding UTF-8 BOM sequence, EF16 BB16 BF16, is also exceedinglyrare. In either case, it is therefore unlikely to be confused with real text data. The same istrue for both single-byte and multibyte encodings.
Data streams (or files) that begin with the U+FEFF byte order mark are likely to containUnicode characters. It is recommended that applications sending or receiving untypeddata streams of coded characters use this signature. If other signaling methods are used,signatures should not be employed.
Conformance to the Unicode Standard does not require the use of the BOM as such a sig-nature. See Section 23.8, Specials, for more information on the byte order mark and its useas an encoding signature.
Layout and Format Control Characters
The Unicode Standard defines several characters that are used to control joining behavior,bidirectional ordering control, and alternative formats for display. Their specific use in lay-out and formatting is described in Section 23.2, Layout Controls.
The Replacement Character
U+FFFD replacement character is the general substitute character in the UnicodeStandard. It can be substituted for any “unknown” character in another encoding that can-not be mapped in terms of known Unicode characters (see Section 5.3, Unknown and Miss-ing Characters, and Section 23.8, Specials).
Control Codes
In addition to the special characters defined in the Unicode Standard for a number of pur-poses, the standard incorporates the legacy control codes for compatibility with the ISO/IEC 2022 framework, ASCII, and the various protocols that make use of control codes.Rather than simply being defined as byte values, however, the legacy control codes areassigned to Unicode code points: U+0000..U+001F, U+007F..U+009F. Those code pointsfor control codes must be represented consistently with the various Unicode encodingforms when they are used with other Unicode characters. For more information on controlcodes, see Section 23.1, Control Codes.
-
2-14 Conforming to the Unicode Standard
Conformance requirements are a set of unambiguous criteria to which a conformantimplementation of a standard must adhere, so that it can interoperate with other confor-mant implementations. The universal scope of the Unicode Standard complicates the taskof rigorously defining such conformance requirements for all aspects of the standard. Mak-ing conformance requirements overly confining runs the risk of unnecessarily restrictingthe breadth of text operations that can be implemented with the Unicode Standard or oflimiting them to a one-size-fits-all lowest common denominator. In many cases, therefore,the conformance requirements deliberately cover only minimal requirements, falling farshort of providing a complete description of the behavior of an implementation. Neverthe-less, there are many core aspects of the standard for which a precise and exhaustive defini-tion of conformant behavior is possible.
This section gives examples of both conformant and nonconformant implementationbehavior, illustrating key aspects of the formal statement of conformance requirementsfound in Chapter 3, Conformance.
Characteristics of Conformant Implementations
An implementation that conforms to the Unicode Standard has the following characteris-tics:It treats characters according to the specified Unicode encoding form.
• The byte sequence <20 20> is interpreted as U+2020 ‘†’ dagger in the UTF-16encoding form.
• The same byte sequence <20 20> is interpreted as the sequence of two spaces,, in the UTF-8 encoding form.
It interprets characters according to the identities, properties, and rules defined for themin this standard.
• U+2423 is ‘£’ open box,not ‘√’ hiragana small i (which is the meaning of thebytes 242316 in JIS).
• U+00F4 ‘ô’ is equivalent to U+006F ‘o’ followed by U+0302 ‘u’, but notequiva-lent to U+0302 followed by U+006F.
• U+05D0 ‘–’ followed by U+05D1 ‘—’ looks like ‘—–’, not ‘–—’ when displayed.
When an implementation supports the display of Arabic, Hebrew, or other right-to-leftcharacters and displays those characters, they must be ordered according to the Bidirec-tional Algorithm described in Unicode Standard Annex #9, “Unicode Bidirectional Algo-rithm.”
When an implementation supports Arabic, Devanagari, or other scripts with complexshaping for their characters and displays those characters, at a minimum the characters are shaped according to the relevant block descriptions. (More sophisticated shaping can beused if available.)
#######Unacceptable Behavior
It is unacceptable for a conforming implementation:To use unassigned codes.
• U+2073 is unassigned and not usable for ‘3’ (superscript 3) or any other charac-ter.
To corrupt unsupported characters.
• U+03A1 “P” greek capital letter rho should not be changed to U+00A1(first byte dropped), U+0050 (mapped to Latin letter P), U+A103 (bytesreversed), or anything other than U+03A1.
To remove or alter uninterpreted code points in text that purports to be unmodified.
• U+2029 is paragraph separator and should not be dropped by applicationsthat do not support it.
Acceptable Behavior
It is acceptable for a conforming implementation:To support only a subset of the Unicode characters.
• An application might not provide mathematical symbols or the Thai script, forexample.
To t r a n s f o r m d a t a k n o w i n g l y .
• Uppercase conversion: ‘a’ transformed to ‘A’
• Romaji to kana: ‘kyo’ transformed to Õá
• Decomposition: U+247D ‘(10)’ decomposed to
To build higher-level protocols on the character set.
• Examples are defining a file format for compression of characters or for usewith rich text.
To define private-use characters.
• Examples of characters that might be defined for private use include additionalideographic characters (gaiji) or existing corporate logo characters.
To not support the Bidirectional Algorithm or character shaping in implementations thatdo not support complex scripts, such as Arabic and Devanagari.
To not support the Bidirectional Algorithm or character shaping in implementations thatdo not display characters, as, for example, on servers or in programs that simply parse ortranscode text, such as an XML parser.
Code conversion between other character encodings and the Unicode Standard will beconsidered conformant if the conversion is accurate in both directions.
Supported Subsets
The Unicode Standard does not require that an application be capable of interpreting andrendering all Unicode characters so as to be conformant. Many systems will have fonts foronly some scripts, but not for others; sorting and other text-processing rules may be imple-mented for only a limited set of languages. As a result, an implementation is able to inter-pret a subset of characters. The Unicode Standard provides no formalized method for identifying an implementedsubset. Furthermore, such a subset is typically different for different aspects of an imple-mentation. For example, an application may be able to read, write, and store any Unicodecharacter and to sort one subset according to the rules of one or more languages (and therest arbitrarily), but have access to fonts for only a single script. The same implementationmay be able to render additional scripts as soon as additional fonts are installed in its envi-ronment. Therefore, the subset of interpretable characters is typically not a static concept.