Java 平台中的增补字符

Skip to Content Sun

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/index.html



 

Article
Java 平台中的增补字符







 

作者:Sun Microsystems, Inc. 的 Norbert Lindenberg 和 Masayoshi Okutsu

2004 年 5 月

English: Supplementary Characters in the Java Platform
日本語: Java プラットフォームにおける補助文字のサポート

摘要

本 文介绍 Java 平台支持增补字符的方式。增补字符是 Unicode 标准中代码点超出 U+FFFF 的字符,因此它们无法在 Java 编程语言中描述为单个的 16 位实体(例如 char 数据类型)。这些字符一般极少用,但是,有些会在诸如中文或日文人名中用到,因此,在东亚国家,政府应用程序通常会要求支持这些字符。

Java 平台目前正在改进,以便支持对增补字符的处理,这种改进对现有的应用程序影响微乎其微。新的低层 API 在需要时能够使用单个的字符运行。不过,大多数文本处理 API 均使用字符序列,例如 String 类或字符数组。现在,这些均解释为 UTF-16 序列,而且,这些 API 实现已转变为正确地处理增补字符。这些改进已融入 Java 2 平台 5.0 版,标准版 (J2SE)。

除详细解释这些改进之外,本文同时为应用程序开发人员确定和实现必要的更改提供指导,以支持整个 Unicode 字符集的使用。

背景

Unicode 最初设计是作为一种固定宽度的 16 位字符编码。在 Java 编程语言中,基本数据类型 char 初衷是通过提供一种简单的、能够包含任何字符的数据类型来充分利用这种设计的优点。不过,现在看来,16 位编码的所有 65,536 个字符并不能完全表示全世界所有正在使用或曾经使用的字符。于是,Unicode 标准已扩展到包含多达 1,112,064 个字符。那些超出原来的 16 位限制的字符被称作增补字符。Unicode 标准 2.0 版是第一个包含启用增补字符设计的版本,但是,直到 3.1 版才收入第一批增补字符集。由于 J2SE 的 5.0 版必须支持 Unicode 标准 4.0 版,因此它必须支持增补字符。

对增补字符的支持也可能会成为东亚市场的一个普遍商业要求。政府应用程序会需要这些增补字符,以正确表示一些包含罕见中文字符的姓名。出版应用程序 可能会需要这些增补字符,以表示所有的古代字符和变体字符。中国政府要求支持 GB18030(一种对整个 Unicode 字符集进行编码的字符编码标准),因此,如果是 Unicode 3.1 版或更新版本,则将包括增补字符。台湾标准 CNS-11643 包含的许多字符在 Unicode 3.1 中列为增补字符。香港政府定义了一种针对粤语的字符集,其中的一些字符是 Unicode 中的增补字符。最后,日本的一些供应商正计划利用增补字符空间中大量的专用空间收入 50,000 多个日文汉字字符变体,以便从其专有系统迁移至基于 Java 平台的解决方案。

因此,Java 平台不仅需要支持增补字符,而且必须使应用程序能够方便地做到这一点。由于增补字符打破了 Java 编程语言的基础设计构想,而且可能要求对编程模型进行根本性的修改,因此,Java Community Process 召集了一个专家组,以期找到一个适当的解决方案。该小组被称为 JSR-204 专家组,使用 Unicode 增补字符支持的 Java 技术规范请求 的编号。从技术上来说,该专家组的决定仅适用于 J2SE 平台,但是由于 Java 2 平台企业版 (J2EE) 处于 J2SE 平台的最上层,因此它可以直接受益,我们期望 Java 2 平台袖珍版 (J2ME) 的配置也采用相同的设计方法。

不过,在了解 JSR-204 专家组确定的解决方案之前,我们需要先理解一些术语。

代码点、字符编码方案、UTF-16:这些是指什么?

不幸的是,引入增补字符使字符模型变得更加复杂了。在过去,我们可以简单地说“字符”,在一个基于 Unicode 的环境(例如 Java 平台)中,假定字符有 16 位,而现在我们需要更多的术语。我们会尽量介绍得相对简单一些 — 如需了解所有详细的讨论信息,您可以阅读 Unicode 标准第 2 章 或 Unicode 技术报告 17“ 字符编码模型 ”。Unicode 专业人士可略过所有介绍直接参阅本部分中的最后定义。

字符是抽象的最小文本单位。它没有固定的形状(可能是一个字形),而且没有值。“A”是一个字符,“€”(德国、法国和许多其他欧洲国家通用货币的标志)也是一个字符。

字符集是字符的集合。例如,汉字字符是中国人最先发明的字符,在中文、日文、韩文和越南文的书写中使用。

编 码字符集是一个字符集,它为每一个字符分配一个唯一数字。Unicode 标准的核心是一个编码字符集,字母“A”的编码为 004116 和字符“€”的编码为 20AC16。Unicode 标准始终使用十六进制数字,而且在书写时在前面加上前缀“U+”,所以“A”的编码书写为“U+0041”。

代码点是指可用于编码字符集的数字。编码字符集定义一个有效的代码点范围,但是并不一定将字符分配给所有这些代码点。有效的 Unicode 代码点范围是 U+0000 至 U+10FFFF。Unicode 4.0 将字符分配给一百多万个代码点中的 96,382 代码点。

增补字符是代码点在 U+10000 至 U+10FFFF 范围之间的字符,也就是那些使用原始的 Unicode 的 16 位设计无法表示的字符。从 U+0000 至 U+FFFF 之间的字符集有时候被称为基本多语言面 (BMP)。因此,每一个 Unicode 字符要么属于 BMP,要么属于增补字符。

字符编码方案是从一个或多个编码字符集到一个或多个固定宽度代码单元序列的映射。最常用的代码单元是字节,但是 16 位或 32 位整数也可用于内部处理。UTF-32、UTF-16 和 UTF-8 是 Unicode 标准的编码字符集的字符编码方案。

UTF-32 即将每一个 Unicode 代码点表示为相同值的 32 位整数。很明显,它是内部处理最方便的表达方式,但是,如果作为一般字符串表达方式,则要消耗更多的内存。

UTF-16 使用一个或两个未分配的 16 位代码单元的序列对 Unicode 代码点进行编码。值 U+0000 至 U+FFFF 编码为一个相同值的 16 位单元。增补字符编码为两个代码单元,第一个单元来自于高代理范围(U+D800 至 U+DBFF),第二个单元来自于低代理范围(U+DC00 至 U+DFFF)。这在概念上可能看起来类似于多字节编码,但是其中有一个重要区别:值 U+D800 至 U+DFFF 保留用于 UTF-16;没有这些值分配字符作为代码点。这意味着,对于一个字符串中的每个单独的代码单元,软件可以识别是否该代码单元表示某个单单元字符,或者是 否该代码单元是某个双单元字符的第一个或第二单元。这相当于某些传统的多字节字符编码来说是一个显著的改进,在传统的多字节字符编码中,字节值 0x41 既可能表示字母“A”,也可能是一个双字节字符的第二个字节。

UTF-8 使用一至四个字节的序列对编码 Unicode 代码点进行编码。U+0000 至 U+007F 使用一个字节编码,U+0080 至 U+07FF 使用两个字节,U+0800 至 U+FFFF 使用三个字节,而 U+10000 至 U+10FFFF 使用四个字节。UTF-8 设计原理为:字节值 0x00 至 0x7F 始终表示代码点 U+0000 至 U+007F(Basic Latin 字符子集,它对应 ASCII 字符集)。这些字节值永远不会表示其他代码点,这一特性使 UTF-8 可以很方便地在软件中将特殊的含义赋予某些 ASCII 字符。

下表所示为几个字符不同表达方式的比较:

Unicode 代码点 U+0041 U+00DF U+6771 U+10400
表示字形
UTF-32 代码单元
00000041
000000DF
00006771
00010400
UTF-16 代码单元
0041
00DF
6771
D801 DC00
UTF-8 代码单元
41
C3 9F
E6 9D B1
F0 90 90 80

另 外,本文在许多地方使用术语字符序列或 char 序列概括 Java 2 平台识别的所有字符序列的容器:char[], java.lang.CharSequence 的实现(例如 String 类),和 java.text.CharacterIterator 的实现。

这么多术语。它们与在 Java 平台中支持增补字符有什么关系呢?

Java 平台中增补字符的设计方法

JSR-204 专家组必须作出的主要决定是如何在 Java API 中表示增补字符,包括单个字符和所有形式的字符序列。专家组考虑并排除了多种方法:

重 新定义基本类型 char,使其具有 32 位,这样也会使所有形式的 char 序列成为 UTF-32 序列。 在现有的 16 位类型 char 的基础上,为字符引入一种新的 32 位基本类型(例如,char32)。所有形式的 Char 序列均基于 UTF-16。 在现有的 16 位类型 char 的基础上,为字符引入一种新的 32 位基本类型(例如,char32)。String 和 StringBuffer 接受并行 API,并将它们解释为 UTF-16 序列或 UTF-32 序列;其他 char 序列继续基于 UTF-16。 使用 int 表示增补的代码点。String 和 StringBuffer 接受并行 API,并将它们解释为 UTF-16 序列或 UTF-32 序列;其他 char 序列继续基于 UTF-16。 使用代理 char 对,表示增补代码点。所有形式的 char 序列基于 UTF-16。 引入一种封装字符的类。String 和 StringBuffer 接受新的 API,并将它们解释为此类字符的序列。 使用一个 CharSequence 实例和一个索引的组合表示代码点。

在这些方法中,一些在早期就被排除了。例如,重新定义基本类型 char,使其具有 32 位,这对于全新的平台可能会非常有吸引力,但是,对于 J2SE 来说,它会与现有的 Java 虚拟机 1 、序列化和其他接口不兼容,更不用说基于 UTF-32 的字符串要使用两倍于基于 UTF-16 的字符串的内存了。添加一种新类型的 char32 可能会简单一些,但是仍然会出现虚拟机和序列化方面的问题。而且,语言更改通常需要比 API 更改有更长的提前期,因此,前面两种方法会对增补字符支持带来无法接受的延迟。为了在余下的方法中筛选出最优方案,实现小组使用四种不同的方法,在大量进 行低层字符处理的代码(java.util.regex 包)中实现了对增补字符支持,并对这四种方法的难易程度和运行表现进行了比较。

最终,专家组确定了一种分层的方法:

使用基本类型 int 在低层 API 中表示代码点,例如 Character 类的静态方法。 将所有形式的 char 序列均解释为 UTF-16 序列,并促进其在更高层级 API 中的使用。 提供 API,以方便在各种 char 和基于代码点的表示法之间的转换。

在需要时,此方法既能够提供一种概念简明且高效的单个字符表示法,又能够充分利用通过改进可支持增补字符的现有 API。同时,还能够促进字符序列在单个字符上的应用,这一点一般对于国际化的软件很有好处。

在 这种方法中,一个 char 表示一个 UTF-16 代码单元,这样对于表示代码点有时并不够用。您会注意到,J2SE 技术规范现在使用术语代码点和 UTF-16 代码单元(表示法是相关的)以及通用术语字符(表示法与该讨论没有关系)。API 通常使用名称 codePoint 描述表示代码点的类型 int 的变量,而 UTF-16 代码单元的类型当然为 char。

我们将在下面两部分中了解到 J2SE 平台的实质变化 — 其中一部分介绍单个代码点的低层 API,另一部分介绍采用字符序列的高层接口。

开放的增补字符:基于代码点的 API

新增的低层 API 分为两大类:用于各种 char 和基于代码点的表示法之间转换的方法和用于分析和映射代码点的方法。

最 基本的转换方法是 Character.toCodePoint(char high, char low)(用于将两个 UTF-16 代码单元转换为一个代码点)和 Character.toChars(int codePoint)(用于将指定的代码点转换为一个或两个 UTF-16 代码单元,然后封装到一个 char[] 内。不过,由于大多数情况下文本以字符序列的形式出现,因此,另外提供 codePointAt 和 codePointBefore 方法,用于将代码点从各种字符序列表示法中提取出来:Character.codePointAt(char[] a, int index) 和 String.codePointBefore(int index) 是两种典型的例子。在将代码点插入字符序列时,大多数情况下均有一些针对 StringBuffer 和 StringBuilder 类的 appendCodePoint(int codePoint) 方法,以及一个用于提取表示代码点的 int[] 的 String 构建器。

几 种用于分析代码单元和代码点的方法有助于转换过程:Character 类中的 isHighSurrogate 和 isLowSurrogate 方法可以识别用于表示增补字符的 char 值;charCount(int codePoint) 方法可以确定是否需要将某个代码点转换为一个或两个 char。

但是,大多数基于代码点的方法均能够对所有 Unicode 字符实现基于 char 的旧方法对 BMP 字符所实现的功能。以下是一些典型例子:

Character.isLetter(int codePoint) 可根据 Unicode 标准识别字母。 Character.isJavaIdentifierStart(int codePoint) 可根据 Java 语言规范确定代码点是否可以启动标识符。 Character.UnicodeBlock.of(int codePoint) 可搜索代码点所属的 Unicode 字符子集。 Character.toUpperCase(int codePoint) 可将给定的代码点转换为其大写等值字符。尽管此方法能够支持增补字符,但是它仍然不能解决根本的问题,即在某些情况下,逐个字符的转换无法正确完成。例 如,德文字符“"ß"”应该转换为“SS”,这需要使用 String.toUpperCase 方法。

注意大多数接受代码点的方法并不检查给定的 int 值是否处于有效的 Unicode 代码点范围之内(如上所述,只有 0x0 至 0x10FFFF 之间的范围是有效的)。在大多数情况下,该值是以确保其有效的方法产生的,在这些低层 API 中反复检查其有效性可能会对系统性能造成负面的影响。在无法确保有效性的情况下,应用程序必须使用 Character.isValidCodePoint 方法确保代码点有效。大多数方法对于无效的代码点采取的行为没有特别加以指定,不同的实现可能会有所不同。

API 包含许多简便的方法,这些方法可使用其他低层的 API 实现,但是专家组觉得,这些方法很常用,将它们添加到 J2SE 平台上很有意义。不过,专家组也排除了一些建议的简便方法,这给我们提供了一次展示自己实现此类方法能力的机会。例如,专家组经过讨论,排除了一种针对 String 类的新构建器(该构建器可以创建一个保持单个代码点的 String)。以下是使应用程序使用现有的 API 提供功能的一种简便方法:

/** * 创建仅含有指定代码点的新 String。 */ String newString(int codePoint) { return new String(Character.toChars(codePoint)); }

您会注意到,在这个简单的实现中,toChars 方法始终创建一个中间数列,该数列仅使用一次即立即丢弃。如果该方法在您的性能评估中出现,您可能会希望将其优化为针对最为普通的情况,即该代码点为 BMP 字符:

/** * 创建仅含有指定代码点的新 String。 * 针对 BMP 字符优化的版本。 */ String newString(int codePoint) { if (Character.charCount(codePoint) == 1) { return String.valueOf((char) codePoint); } else { return new String(Character.toChars(codePoint)); } }

或者,如果您需要创建许多个这样的 string,则可能希望编写一个重复使用 toChars 方法所使用的数列的通用版本:

/** * 创建每一个均含有一个指定 * 代码点的新 String。 * 针对 BMP 字符优化的版本。 */ String[] newStrings(int[] codePoints) { String[] result = new String[codePoints.length]; char[] codeUnits = new char[2]; for (int i = 0; i < codePoints.length; i++) {     int count = Character.toChars(codePoints[i], codeUnits, 0);     result[i] = new String(codeUnits, 0, count); } return result; }

不过,最终您可能会发现,您需要的是一个完全不同的解决方案。新的构建器 String(int codePoint) 实际上建议作为 String.valueOf(char) 的一个基于代码点的备选方案。在很多情况下,此方法用于消息生成的环境,例如:

System.out.println("Character " + String.valueOf(char) + " is invalid.");

新的 格式化 API 支持增补文字,提供一种更加简单的备选方案:

System.out.printf("Character %c is invalid.%n", codePoint);

使用此高层 API 不仅简捷,而它有很多特殊的优点:它可以避免串联(串联会使消息很难本地化),并将需要移进资源包 (resource bundle) 的字符串数量从两个减少到一个。

增补字符透视:功能增强

在支持使用增补字符的 Java 2 平台中的大部分更改没有反映到新的 API 内。一般预期是,处理字符序列的所有接口将以适合其功能的方式处理增补字符。本部分着重讲述为达到此预期所作一些功能增强。

Java 编程语言中的标识符

Java 语言规范指出所有 Unicode 字母和数字均可用于标识符。许多增补字符是字母或数字,因此 Java 语言规范已经参照新的基于代码点的方法进行更新,以在标识符内定义合法字符。为使用这些新方法,需要检测标识符的 javac 编译器和其他工具都进行了修订。

库内的增补字符支持

许多 J2SE 库已经过增强,可以通过现有接口支持增补字符。以下是一些例子:

字符串大小写转换功能已更新,可以处理增补字符,也可以实现 Unicode 标准中规定的特殊大小写规则。 java.util.regex 包已更新,这样模式字符串和目标字符串均可以包含增补字符并将其作为完整单元处理。 现在,在 java.text 包内进行整理处理时,会将增补字符看作完整单元。 java.text.Bidi 类已更新,可以处理增补字符和 Unicode 4.0 中新增的其他字符。请注意,Cypriot Syllabary 字符子集内的增补字符具有从右至左的方向性。 Java 2D API 内的字体渲染和打印技术已经过增强,可以正确渲染和测量包含增补字符的字符串。 Swing 文本组件实现已更新,可以处理包含增补字符的文本。 字符转换

只有很少的字符编码可以表示增补字符。如果是基于 Unicode 的编码(如 UTF-8 和 UTF-16LE),则旧版的 J2RE 内的字符转换器已经按照正确处理增补字符的方式实现转换。对于 J2RE 5.0,可以表示增补字符的其他编码的转换器已更新:GB18030、x-EUC-TW(现在实现所有 CNS 11643 层面)和 Big5-HKSCS(现在实现 HKSCS-2001)。

在源文件内表示增补字符

在 Java 编程语言源文件中,如果使用可以直接表示增补字符的字符编码,则使用增补字符最为方便。UTF-8 是最佳的选择。在所使用的字符编码无法直接表示字符的情况下,Java 编程语言提供一种 Unicode 转义符语法。此语法没有经过增强,无法直接表示增补字符。而是使用两个连续的 Unicode 转义符将其表示为 UTF-16 字符表示法中的两个编码单元。例如,字符 U+20000 写作“\uD840\uDC00”。您也许不愿意探究这些转义序列的含义;最好是写入支持所需增补字符的编码,然后使用一种工具(如 native2ascii)将其转换为转义序列。

遗憾的是,由于其编码问题,属性文件仍局限于 ISO 8859-1(除非您的应用程序使用新的 XML 格式)。这意味着您始终必须对增补字符使用转义序列,而且可能要使用不同的编码进行编写,然后使用诸如 native2ascii 的工具进行转换。

经修订的 UTF-8

Java 平台对经修订的 UTF-8 已经很熟悉,但是,问题是应用程序开发人员在可能包含增补字符的文本和 UTF-8 之间进行转换时需要更加留神。需要特别注意的是,某些 J2SE 接口使用的编码与 UTF-8 相似但与其并不兼容。以前,此编码有时被称为“Java modified UTF-8”(经 Java 修订的 UTF-8) 或(错误地)直接称为“UTF-8”。对于 J2SE 5.0,其说明文档正在更新,此编码将统称为“modified UTF-8”(经修订的 UTF-8)。

经修订的 UTF-8 和标准 UTF-8 之间之所以不兼容,其原因有两点。其一,经修订的 UTF-8 将字符 U+0000 表示为双字节序列 0xC0 0x80,而标准 UTF-8 使用单字节值 0x0。其二,经修订的 UTF-8 通过对其 UTF-16 表示法的两个代理代码单元单独进行编码表示增补字符 。每个代理代码单元由三个字节来表示,共有六个字节。而标准 UTF-8 使用单个四字节序列表示整个字符。

Java 虚拟机及其附带的接口(如 Java 本机接口、多种工具接口或 Java 类文件)在 java.io.DataInput 和 DataOutput 接口和类中使用经修订的 UTF-8 实现或使用这些接口和类 ,并进行序列化。Java 本机接口提供与经修订的 UTF-8 之间进行转换的例程。而标准 UTF-8 由 String 类、java.io.InputStreamReader 和 OutputStreamWriter 类、java.nio.charset 设施 (facility) 以及许多其上层的 API 提供支持。

由于经修订的 UTF-8 与标准的 UTF-8 不兼容,因此切勿同时使用这两种版本的编码。经修订的 UTF-8 只能与上述的 Java 接口配合使用。在任何其他情况下,尤其对于可能来自非基于 Java 平台的软件的或可能通过其编译的数据流,必须使用标准的 UTF-8。需要使用标准的 UTF-8 时,则不能使用 Java 本机接口例程与经修订的 UTF-8 进行转换。

在应用程序内支持增补字符

现在,对大多数读者来说最为重要的问题是:必须对应用程序进行哪些更改才能支持增补字符?

答案取决于在应用程序中进行哪种类型的文本处理和使用哪些 Java 平台 API。

对 于仅以各种形式 char 序列([char[]、java.lang.CharSequence 实现、java.text.CharacterIterator 实现)处理文本和仅使用接受和退回序列(如 char 序列)的 Java API 的应用程序,可能根本不需要进行任何更改。Java 平台 API 的实现应该能够处理增补字符。

对 于本身解释单个字符、将单个字符传送给 Java 平台 API 或调用能够返回单个字符的方法的应用程序,则需要考虑这些字符的有效值。在很多情况下,往往不要求支持增补字符。例如,如果某应用程序搜索 char 序列中的 HTML 标记,并逐一检查每个 char,它会知道这些标记仅使用 Basic Latin 字符子集中的字符。如果所搜索的文本含有增补字符,则这些字符不会与标记字符混淆,因为 UTF-16 使用代码单元表示增补字符,而代码单元的值不会用于 BMP 字符。

只 有在某应用程序本身解释单个字符、将单个字符传送给 Java 平台 API 或调用能够返回单个字符的方法且这些字符可能为增补字符时,才必须更改该应用程序。在提供使用 char 序列的并行 API 时,最好转而使用此类 API。在其他情况下,有必要使用新的 API 在 char 和基于代码点的表示法之间进行转换,并调用基于代码点的 API。当然,如果您发现在 J2SE 5.0 中有更新、更方便的 API,使您能够支持增补字符并同时简化代码(如上 格式化范例 中所述),则没有必要这样做。

您 可能会犹豫,是将所有文本转换为代码点表示法(即 int[])然后在该表示法中处理,还是在大多数情况下仍采用 char 序列,仅在需要时转换为代码点,两者之间孰优孰劣很难确定。当然,总体来说,Java 平台 API 相对于 char 序列肯定具有一定的优势,而且采用 Java 平台 API 可以节省内存空间。

对于需要与 UTF-8 之间进行转换的应用程序,还需要认真考虑是需要标准的 UTF-8 还是经修订的 UTF-8,并针对每种 UTF-8 采用适当的 Java 平台。“ 经修订的 UTF-8 ”部分介绍进行正确选择所需的信息。

使用增补字符测试应用程序

经过前面部分的介绍后,无论您是否需要修订应用程序,测试应用程序是否运行正常始终是一种正确的做法。对于不含有图形用户界面的应用程序,有关“ 在源文件内表示增补字符 ” 的信息有助于设计测试用例。以下是有关使用图形用户界面进行测试的补充信息。

对于文本输入, Java 2 SDK 提供用于接受“\Uxxxxxx”格式字符串的代码点输入方法,这里大写的“U”表示转义序列包含六个十六进制数字,因此允许使用增补字符。小写的“u” 表示转义序列“\uxxxx”的原始格式。您可以在 J2SDK 目录 demo/jfc/CodePointIM 内找到此输入方法及其说明文档。

对于字体渲染,您需要至少能够渲染一些增补字符的字体。其中一种此类字体为 James Kass 的 Code2001 字体,它提供手写体字形(如 Deseret 和 Old Italic)。利用 Java 2D 库中提供新功能,您只需将该字体安装到 J2RE 的 lib/fonts/fallback 目录内即可,然后它可自动添加至在 2D 和 XAWT 渲染时使用的所有逻辑字体 — 无需编辑字体配置文件。

至此,您就可以确认,您的应用程序能够完全支持增补字符了!

结论

对增补字符的支持已经引入 Java 平台,大部分应用程序无需更改代码即可处理这些字符。解释单个字符的应用程序可以在 Character 类和多种 CharSequence 子类中使用基于代码点的新 API。

鸣谢

Java 平台中的增补字符支持由 Java Community Process 的 JSR-204 专家组设计。技术规范设计主持为 Masayoshi Okutsu 和 Brian Beck (Sun Microsystems),其他专家组成员有 Craig Cummings (Oracle)、Mark Davis (IBM)、Markus Eble (SAP AG)、Jere Käpyaho (Nokia Corp.)、Kazuhiro Kazama (NTT)、Kenji Kazumura (Fujitsu Limited)、Eiichi Kimura (NEC Corp.)、Changshin Lee (Tmax Soft Inc.) 和 Toshiki Murata (Oki Electric Industry Co.)。参考实现由 Sun Microsystems 的 Java Internationalization 团队完成,并承蒙位于圣何塞的 IBM Globalization Center of Competency 的协助。技术规范的技术兼容套件为 Java Compatibility Kit,由 Sun Microsystems 的 JCK 团队实现。

参考书目

Masayoshi Okutsu, Brian Beck (ed.): Unicode Supplementary Character Support. Proposed Final Draft . Sun Microsystems, 2004.

Java 2 Platform Standard Edition 5.0 API Specification . Sun Microsystems, 2004.

The Unicode Consortium: The Unicode Standard, Version 4.0 . Addison-Wesley, 2003.

Ken Whistler, Mark Davis: Character Encoding Model . Unicode Technical Report #17. The Unicode Consortium, 2000.

James Kass: Code2001, a Plane 1 Unicode-based Font .

关于作者

Norbert Lindenberg 是 Sun Microsystems 的 Java Web Services 团队内 Java Internationalization 技术主管。在加盟 Sun 之前,曾经供职于 General Magic 和 Apple Computer,参与过多个国际化项目。他毕业于德国的卡尔斯鲁厄大学,拥有计算机科学理科硕士学位。

Masayoshi Okutsu 是 Sun Microsystems 的 Java Web Services 团队的一名国际化工程师,目前担任 Unicode Supplementary Character Support 的 Java Specification Request 204 的技术规范主管。在加盟 Sun Microsystems 之前,供职于 Digital Equipment Corporation,期间曾经参与多个国际化项目。他毕业于日本山形大学,拥有电子工程理学士学位。


1 本网站中使用的术语“Java 虚拟机”或“JVM”是指针对 Java 平台的虚拟机。



-----------------------英文 -                 -------

Article
Supplementary Characters in the Java Platform







 
By Norbert Lindenberg and Masayoshi Okutsu , Sun Microsystems, Inc., May 2004  

 
Abstract
This article describes how supplementary characters are supported in the Java platform. Supplementary characters are characters in the Unicode standard whose code points are above U+FFFF, and which therefore cannot be described as single 16-bit entities such as the char data type in the Java programming language. Such characters are generally rare, but some are used, for example, as part of Chinese and Japanese personal names, and so support for them is commonly required for government applications in East Asian countries.

The Java platform is being enhanced to enable processing of supplementary characters with minimal impact on existing applications. New low-level APIs enable operations on individual characters where necessary. Most text-processing APIs, however, use character sequences, such as the String class or character arrays. These are now interpreted as UTF-16 sequences, and the implementations of these APIs is changed to correctly handle supplementary characters. The enhancements are part of version 5.0 of the Java 2 Platform, Standard Edition (J2SE).

Besides explaining these enhancements in detail, this article also provides guidelines for application developers for determining and implementing necessary changes to enable use of the complete Unicode character set.
Background
Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character. However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth. The Unicode standard therefore has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters . Version 2.0 of the Unicode standard was the first to include a design to enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned. Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it has to support supplementary characters.

Support for supplementary characters is likely to also become a common business requirement in East Asian markets. Government applications are going to require them in order to correctly represent names that include rare Chinese characters. Publishing applications may need them in order to represent the full set of historical and variant characters. The Chinese government requires support for GB18030, a character encoding that encodes the entire Unicode character set, and so includes supplementary characters if Unicode version 3.1 or later is assumed. The Taiwanese standard CNS-11643 includes numerous characters that have been included in Unicode 3.1 as supplementary characters. The Hong Kong government defined a collection of characters that are needed for Cantonese, and some of these characters are supplementary characters in Unicode. Finally, some vendors in Japan are planning to use the large private use area in the supplementary character space for more than 50,000 kanji character variants in order to migrate from their proprietary systems to solutions based on the Java platform.

The Java platform therefore not only has to support supplementary characters, but it also has to make it easy for applications to do the same. Since supplementary characters break a fundamental assumption of the Java programming language and might require a fundamental change in the programming model, an expert group was convened under the Java Community Process to choose the right solution for the problem. The group is called the JSR-204 expert group, using the number of the Java Specification Request for Unicode Supplementary Character Support . Technically, the decisions of the expert group only apply to the J2SE platform, but since the Java 2 Platform, Enterprise Edition (J2EE) sits on top of the J2SE platform, it benefits directly, and we expect that the configurations of the Java 2 Platform, Micro Edition (J2ME) will adopt the same design approach.

But before we can look at the solution that the JSR-204 expert group came up with, we need to learn some terminology.
 
Code Points, Character Encoding Schemes, UTF-16: What's All This?
The introduction of supplementary characters unfortunately makes the character model quite a bit more complicated. Where in the past we could simply talk about "characters" and, in a Unicode based environment such as the Java platform, assume that a character has 16 bits, we now need more terminology. We'll try to keep it relatively simple -- for a full-blown discussion with all details you can read Chapter 2 of The Unicode Standard or Unicode Technical Report 17 " Character Encoding Model ." Unicode experts may skip all but the last definition in this section.

A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph ), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.

A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.

A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 004116 and the letter "€" the number 20AC16. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".

Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.

Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP) . Thus, each Unicode character is either in the BMP or a supplementary character.

A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units . The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.

UTF-32 simply represents each Unicode code point as the 32-bit integer of the same value. It's clearly the most convenient representation for internal processing, but uses significantly more memory than necessary if used as a general string representation.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

UTF-8 uses sequences of one to four bytes to encode Unicode code points. U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in four bytes. UTF-8 is designed so that the byte values 0x00 to 0x7F always represent code points U+0000 to U+007F (the Basic Latin block, which corresponds to the ASCII character set). These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters.

The following table shows the different representations of a few characters in comparison:
Unicode code point U+0041 U+00DF U+6771 U+10400
Representative glyph
UTF-32 code units
00000041
000000DF
00006771
00010400
UTF-16 code units
0041
00DF
6771
D801 DC00
UTF-8 code units
41
C3 9F
E6 9D B1
F0 90 90 80
This article also uses the terms character sequence or char sequence in many places to summarize all the containers of character sequences that the Java 2 Platform knows: char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator.

This is a lot of terminology. What does all this have to do with supporting supplementary characters in the Java platform?
 
Design Approach for Supplementary Characters in the Java Platform
The main decision the JSR-204 expert group had to make was how to represent supplementary characters in Java APIs, both for individual characters and for character sequences in all forms. A number of approaches were considered and rejected by the expert group: Redefining the primitive type char to have 32 bits, which would also make char sequences in all forms UTF-32 sequences. Introducing a new primitive 32-bit type for characters (e.g., char32) in addition to the existing 16-bit type char. Char sequences in all forms would be based on UTF-16. Introducing a new primitive 32-bit type for characters (e.g., char32) in addition to the existing 16-bit type char. String and StringBuffer receive parallel APIs to interpret them either as UTF-16 sequences or as UTF-32 sequences; other char sequences continue to be based on UTF-16. Using int to represent supplementary code points. String and StringBuffer receive parallel APIs to interpret them either as UTF-16 sequences or as UTF-32 sequences; other char sequences continue to be based on UTF-16. Using surrogate char pairs to represent supplementary code points. char sequences in all forms would be based on UTF-16. Introducing a class that encapsulates a character. String and StringBuffer receive new APIs to interpret them as sequences of such characters. Using a combination of a CharSequence instance and an index to represent code points. Some of these approaches were ruled out early on. Redefining the primitive type char to have 32 bits, for example, might have been very attractive for a brand-new platform, but for J2SE it would have been incompatible with existing Java virtual machines 1 , serialization, and other interfaces, not to mention that UTF-32 based strings use twice as much memory as UTF-16 based ones. Adding a new type char32 would have been easier, but would still have created problems for virtual machines and serialization. Also, language changes usually have longer lead times than API changes, so the two previous approaches might have unacceptably delayed support for supplementary characters. To help determine the winner among the remaining ones, the implementation team actually implemented supplementary character support in a substantial piece of code that does low-level character processing (the java.util.regex package) using four different approaches, comparing them in terms of ease of development and runtime performance.

In the end, the decision was for a tiered approach: Use the primitive type int to represent code points in low-level APIs, such as the static methods of the Character class. Interpret char sequences in all forms as UTF-16 sequences, and promote their use in higher-level APIs. Provide APIs to easily convert between various char and code point-based representations. This approach provides a conceptually simple and efficient representation of individual characters where needed, while leveraging existing APIs that can be retrofitted to support supplementary characters. It also promotes the use of character sequences over individual characters, which is generally better for internationalized software.

With this approach, a char represents a UTF-16 code unit, which is not always sufficient to represent a code point. You'll see that the J2SE specifications now use the terms code point and UTF-16 code unit where the representation is relevant, and the generic term character where the representation is irrelevant to the discussion. APIs usually use the name codePoint for variables of type int that represent code points, while UTF-16 code units of course have type char.

We'll take a look at the actual changes in the J2SE platform in the next two sections -- one for the low-level APIs that work on individual code points, one for higher-level interfaces that work on character sequences.
 
Supplementary Characters in the Open: Code point-based APIs
The low-level APIs that were added fall into two broad categories: Methods that convert between various char and code point based representations, and methods that analyze or map code points.

The most basic conversion methods are Character.toCodePoint(char high, char low), which converts two UTF-16 code units to a code point, and Character.toChars(int codePoint), which converts the given code point to one or two UTF-16 code units, wrapped into a char[]. However, since text most of the time comes in the form of a character sequence, there are also codePointAt and codePointBefore methods to extract a code point from the various character sequence representations: Character.codePointAt(char[] a, int index) and String.codePointBefore(int index) are two typical examples. For the most common cases of inserting code points into a character sequence, there are appendCodePoint(int codePoint) methods for the StringBuffer and StringBuilder classes and a String constructor that takes an int[] representing code points.

A few methods that analyze code units and code points help in the conversion process: The isHighSurrogate and isLowSurrogate methods in the Character class identify the char values that are used to represent supplementary characters, and the charCount(int codePoint) method determines whether a code point needs to be converted to one or two chars.

But most code point-based methods perform functions for the complete range of Unicode characters that older char based methods performed for BMP characters. Here are some typical examples: Character.isLetter(int codePoint) identifies letters according to the Unicode standard. Character.isJavaIdentifierStart(int codePoint) determines whether a code point can start an identifier according to the Java Language Specification. Character.UnicodeBlock.of(int codePoint) looks up the Unicode block that the code point belongs to. Character.toUpperCase(int codePoint) converts the given code point to its uppercase equivalent. While this method does support supplementary characters, it still cannot work around the fundamental issue that some case conversions cannot be done correctly on a character-by-character basis. The German character "ß", for example, should be converted to "SS", which requires use of the String.toUpperCase method. Note that most methods that accept a code point do not check whether the given int value is in the range of valid Unicode code points (as mentioned above, only the range from 0x0 to 0x10FFFF is valid). In most cases the value is produced in a way that guarantees it is valid, and checking it repeatedly in these low-level APIs might adversely affect system performance. In cases where validity cannot be guaranteed, applications must use the Character.isValidCodePoint method to make sure that the code point is valid. The behavior of most methods for invalid code points is intentionally unspecified and may vary between implementations.

The API contains a number of convenience methods which could be implemented using other, lower-level APIs, but where the expert group felt that the methods would be used sufficiently often that adding them to the J2SE platform made sense. However, the expert group also rejected some proposed convenience methods, which gives us an opportunity to show how you can implement such methods yourself. For example, the expert group debated and rejected a new constructor for the String class which would create a String holding a single code point. Here's a simple way that an application can provide the functionality using the existing API: /** * Creates new String that contains just the given code point. */ String newString(int codePoint) { return new String(Character.toChars(codePoint)); } You'll notice that in this simple implementation the toChars method always creates an intermediate array, which is used once and then immediately discarded. If the method shows up in your performance measurements, you may want to optimize for the very, very, very common case where the code point is a BMP character: /** * Creates new String that contains just the given code point. * Version that optimizes for BMP characters. */ String newString(int codePoint) { if (Character.charCount(codePoint) == 1) { return String.valueOf((char) codePoint); } else { return new String(Character.toChars(codePoint)); } } Or, if you need to create a large number of such strings, you may want to write a bulk version that reuses the array used by the toChars method: /** * Creates new Strings each of which contains one of the given * code points. * Version that optimizes for BMP characters. */ String[] newStrings(int[] codePoints) { String[] result = new String[codePoints.length]; char[] codeUnits = new char[2]; for (int i = 0; i < codePoints.length; i++) {     int count = Character.toChars(codePoints[i], codeUnits, 0);     result[i] = new String(codeUnits, 0, count); } return result; } However, it may turn out that you actually want an entirely different solution. The new constructor String(int codePoint) was actually proposed as a code point-based alternative to String.valueOf(char). In many cases this method is used in the context of message generation, such as: System.out.println("Character " + String.valueOf(char) + " is invalid."); The new formatting API , which supports supplementary characters, provides a much simpler alternative: System.out.printf("Character %c is invalid.%n", codePoint); Using this higher-level API is not only simpler, it has additional advantages: It avoids the concatenation, which would make the message very hard to localize, and reduces the number of strings that need to be moved into a resource bundle from two to one.
 
Supplementary Characters under the Hood: Functionality Enhancements
Most changes in the Java 2 Platform that enable the use of supplementary characters are not reflected in new API. The general expectation is that all interfaces that handle character sequences will handle supplementary characters in a way that's appropriate for their functionality. This section highlights some enhancements that were made to meet this expectation.
 
Identifiers in the Java Programming Language
The Java Language Specification specifies that all Unicode letters and digits can be used in identifiers. Many supplementary characters are letters or digits, and so the Java Language Specification was updated to refer to new code point-based methods to define the legal characters in identifiers. The javac compiler and other tools that need to detect identifiers were changed to use these new methods.
 
Supplementary Character Support in Libraries
Numerous J2SE libraries have been enhanced to support supplementary characters through existing interfaces. Here are just a few examples: String case conversion has been updated to handle supplementary characters and also to implement the special casing rules as specified in the Unicode standard. The java.util.regex package has been updated so that both pattern strings and target strings can contain supplementary characters, which will be handled as complete units. Collation in the java.text package now treats supplementary characters as complete units. The java.text.Bidi class has been updated to handle supplementary characters and other characters that are new in Unicode 4.0. Note that the supplementary characters in the Cypriot Syllabary block have right-to-left directionality. Font rendering and printing in the Java 2D API have been enhanced to correctly render and measure strings containing supplementary characters. The Swing text component implementation has been updated to handle text that contains supplementary characters.
 
Character Conversion
There are only a small number of character encodings that can represent supplementary characters. In the case of Unicode-based encodings such as UTF-8 and UTF-16LE, the character converters in previous releases of the J2RE already implemented the conversions in a way that handled supplementary characters correctly. For J2RE 5.0, the converters for other encodings that can represent supplementary characters have been updated: GB18030, x-EUC-TW (which now implements all planes of CNS 11643), Big5-HKSCS (which now implements HKSCS-2001).
 
Representing Supplementary Characters in Source Files
In Java programming language source files, supplementary characters are easiest to use if a character encoding is used that can represent supplementary characters directly. UTF-8 is an excellent choice. For cases where the character encoding used cannot represent the characters directly, the Java programming language provides a Unicode escape syntax. This syntax has not been enhanced to express supplementary characters directly. Instead, they are represented by the two consecutive Unicode escapes for the two code units in the UTF-16 representation of the character. For example, the character U+20000 is written as "\uD840\uDC00". You probably don't want to figure out these escape sequences yourself; it's best to write in an encoding that supports the supplementary characters that you need and then use a tool such as native2ascii to convert to escape sequences.

Properties files, unfortunately, are still limited to ISO 8859-1 as their encoding (unless your application uses the new XML format). This means you always have to use escape sequences for supplementary characters, and again probably will want to write in a different encoding and then convert with a tool such as native2ascii.
 
Modified UTF-8
Modified UTF-8 is not new to the Java platform, but it's something that application developers need to be more aware of when converting text that might contain supplementary characters to and from UTF-8. The main thing to remember is that some J2SE interfaces use an encoding that's similar to UTF-8 but incompatible with it. This encoding has in the past sometimes been called "Java modified UTF-8" or (incorrectly) just "UTF-8". For J2SE 5.0, the documentation is being updated to uniformly call it "modified UTF-8."

The incompatibility between modified UTF-8 and standard UTF-8 stems from two differences. First, modified UTF-8 represents the character U+0000 as the two-byte sequence 0xC0 0x80, whereas standard UTF-8 uses the single byte value 0x0. Second, modified UTF-8 represents supplementary characters by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes, for a total of six bytes. Standard UTF-8, on the other hand, uses a single four byte sequence for the complete character.

Modified UTF-8 is used by the Java Virtual Machine and the interfaces attached to it (such as the Java Native Interface, the various tool interfaces, or Java class files), in the java.io.DataInput and DataOutput interfaces and classes implementing or using them, and for serialization. The Java Native Interface provides routines that convert to and from modified UTF-8. Standard UTF-8, on the other hand, is supported by the String class, by the java.io.InputStreamReader and OutputStreamWriter classes, the java.nio.charset facilities, and many APIs layered on top of them.

Since modified UTF-8 is incompatible with standard UTF-8, it is critical not to use one where the other is needed. Modified UTF-8 can only be used with the Java interfaces described above. In all other cases, in particular for data streams that may come from or may be interpreted by software that's not based on the Java platform, standard UTF-8 must be used. The Java Native Interface routines that convert to and from modified UTF-8 cannot be used when standard UTF-8 is required.
 
Supporting Supplementary Characters in Your Application
Now, the question that matters most to most readers: What changes do you have to make to your application in order to support supplementary characters?

The answer depends on what kind of text processing is done within the application and which Java platform APIs are used.

Applications that deal with text only in the form of char sequences in all forms (char[], implementations of java.lang.CharSequence, implementations of java.text.CharacterIterator), and only use Java APIs that accept and return such char sequences, will likely not have to make any changes. The implementation of the Java platform APIs should handle supplementary characters for you.

Applications that interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, need to consider the valid values for these characters. In many cases it turns out that support for supplementary characters is not required. For example, if an application scans a char sequence for HTML tags, checking each char individually, it knows that these tags only use characters from the Basic Latin block. If the text being scanned contains supplementary characters, then these characters cannot be confused with the tag characters, because UTF-16 represents supplementary characters using code units whose values are not used for BMP characters.

Only where applications interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, and these character can be supplementary characters, does the application have to be changed. Where parallel APIs are available that use char sequences, it is best to convert to use such APIs. In the remaining cases, it will be necessary to use the new API to convert between char and code point-based representations, and call code point-based APIs. Unless, of course, you're lucky and find that there are newer and more convenient APIs in J2SE 5.0 that let you support supplementary characters and simplify your code at the same time, as in the formatting sample above.

You might wonder whether it's better to convert all text into code point representation (say, an int[]) and process it in that representation, or whether it's better to stick with char sequences most of the time and only convert to code points when needed. Well, the Java platform APIs in general certainly have a preference for char sequences, and using them will also save memory space.

For applications that need conversion to and from UTF-8 you will also need to consider carefully whether standard or modified UTF-8 is required, and use the proper Java platform facilities for each. The section " Modified UTF-8 " provides the information needed to choose the right one.
 
Testing Your Application With Supplementary Characters
Whether the previous section led you to revise your application or not, it's always a good idea to test that it behaves correctly. For applications that don't include a graphical user interface, the information on " Representing Supplementary Characters in Source Files " helps in developing test cases. Here's additional information on testing with graphical user interfaces.

For text input, the Java 2 SDK provides a code point input method which accepts strings of the form "\Uxxxxxx", where the uppercase "U" indicates that the escape sequence contains six hexadecimal digits, thus allowing for supplementary characters. A lowercase "u" indicates the original form of the escape sequences, "\uxxxx". You can find this input method and its documentation in the directory demo/jfc/CodePointIM of the J2SDK.

For font rendering, you will need a font that can render at least some supplementary characters. One such font is James Kass's Code2001 font, which provides glyphs for scripts such as Deseret and Old Italic. Thanks to a new feature in the Java 2D library, you can simply install the font into the J2RE's lib/fonts/fallback directory, and it will be automatically added to all logical fonts used in 2D and XAWT rendering - you don't need to edit font configuration files.

And with that, you can call your application ready for supplementary characters!
 
Conclusion
Support for supplementary characters has been introduced into the Java platform with an approach that enables most applications to handle these characters without code changes. Applications that interpret individual characters can use new code point-based API in the Character class and various CharSequence subclasses.
 
Acknowledgments
Supplementary character support in the Java platform was designed by the JSR-204 expert group within the Java Community Process. The specification leads are Masayoshi Okutsu and Brian Beck (Sun Microsystems), the other members of the expert group are Craig Cummings (Oracle), Mark Davis (IBM), Markus Eble (SAP AG), Jere Käpyaho (Nokia Corp.), Kazuhiro Kazama (NTT), Kenji Kazumura (Fujitsu Limited), Eiichi Kimura (NEC Corp.), Changshin Lee (Tmax Soft Inc.), and Toshiki Murata (Oki Electric Industry Co.). The reference implementation was done by the Java Internationalization team at Sun Microsystems with contributions from the IBM Globalization Center of Competency, San José. The technology compatibility kit for the specification is the Java Compatibility Kit, implemented by the JCK Team at Sun Microsystems.
 
References
Masayoshi Okutsu, Brian Beck (ed.): Unicode Supplementary Character Support. Proposed Final Draft . Sun Microsystems, 2004.

Java 2 Platform Standard Edition 5.0 API Specification . Sun Microsystems, 2004.

The Unicode Consortium: The Unicode Standard, Version 4.0 . Addison-Wesley, 2003.

Ken Whistler, Mark Davis: Character Encoding Model . Unicode Technical Report #17. The Unicode Consortium, 2000.

James Kass: Code2001, a Plane 1 Unicode-based Font .
 
About the Authors
Norbert Lindenberg is the technical lead for Java Internationalization in Sun Microsystems' Java Web Services group. Before joining Sun, he worked on a variety of internationalization projects at General Magic and Apple Computer. He holds an M.S. degree in Computer Science from Universität Karlsruhe, Germany.

Masayoshi Okutsu is an internationalization engineer in Sun Microsystems' Java Web Services group, and currently the specification lead for Java Specification Request 204, Unicode Supplementary Character Support. Before joining Sun Microsystems, he worked on a variety of internationalization projects at Digital Equipment Corporation. He holds a B.S. degree in Electronic Engineering from Yamagata University, Japan.



 

 







你可能感兴趣的:(java)