acrbb

Java 平台中的增补字符

Skip to Content Sun

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/index.html

Article

Java 平台中的增补字符

Print-friendly Version

作者：Sun Microsystems, Inc. 的 Norbert Lindenberg 和 Masayoshi Okutsu

2004 年 5 月

English: Supplementary Characters in the Java Platform
日本語: Java プラットフォームにおける補助文字のサポート

摘要

本文介绍 Java 平台支持增补字符的方式。增补字符是 Unicode 标准中代码点超出 U+FFFF 的字符，因此它们无法在 Java 编程语言中描述为单个的 16 位实体（例如 char 数据类型）。这些字符一般极少用，但是，有些会在诸如中文或日文人名中用到，因此，在东亚国家，政府应用程序通常会要求支持这些字符。

Java 平台目前正在改进，以便支持对增补字符的处理，这种改进对现有的应用程序影响微乎其微。新的低层 API 在需要时能够使用单个的字符运行。不过，大多数文本处理 API 均使用字符序列，例如 String 类或字符数组。现在，这些均解释为 UTF-16 序列，而且，这些 API 实现已转变为正确地处理增补字符。这些改进已融入 Java 2 平台 5.0 版，标准版 (J2SE)。

除详细解释这些改进之外，本文同时为应用程序开发人员确定和实现必要的更改提供指导，以支持整个 Unicode 字符集的使用。

背景

Unicode 最初设计是作为一种固定宽度的 16 位字符编码。在 Java 编程语言中，基本数据类型 char 初衷是通过提供一种简单的、能够包含任何字符的数据类型来充分利用这种设计的优点。不过，现在看来，16 位编码的所有 65,536 个字符并不能完全表示全世界所有正在使用或曾经使用的字符。于是，Unicode 标准已扩展到包含多达 1,112,064 个字符。那些超出原来的 16 位限制的字符被称作增补字符。Unicode 标准 2.0 版是第一个包含启用增补字符设计的版本，但是，直到 3.1 版才收入第一批增补字符集。由于 J2SE 的 5.0 版必须支持 Unicode 标准 4.0 版，因此它必须支持增补字符。

对增补字符的支持也可能会成为东亚市场的一个普遍商业要求。政府应用程序会需要这些增补字符，以正确表示一些包含罕见中文字符的姓名。出版应用程序可能会需要这些增补字符，以表示所有的古代字符和变体字符。中国政府要求支持 GB18030（一种对整个 Unicode 字符集进行编码的字符编码标准），因此，如果是 Unicode 3.1 版或更新版本，则将包括增补字符。台湾标准 CNS-11643 包含的许多字符在 Unicode 3.1 中列为增补字符。香港政府定义了一种针对粤语的字符集，其中的一些字符是 Unicode 中的增补字符。最后，日本的一些供应商正计划利用增补字符空间中大量的专用空间收入 50,000 多个日文汉字字符变体，以便从其专有系统迁移至基于 Java 平台的解决方案。

因此，Java 平台不仅需要支持增补字符，而且必须使应用程序能够方便地做到这一点。由于增补字符打破了 Java 编程语言的基础设计构想，而且可能要求对编程模型进行根本性的修改，因此，Java Community Process 召集了一个专家组，以期找到一个适当的解决方案。该小组被称为 JSR-204 专家组，使用 Unicode 增补字符支持的 Java 技术规范请求的编号。从技术上来说，该专家组的决定仅适用于 J2SE 平台，但是由于 Java 2 平台企业版 (J2EE) 处于 J2SE 平台的最上层，因此它可以直接受益，我们期望 Java 2 平台袖珍版 (J2ME) 的配置也采用相同的设计方法。

不过，在了解 JSR-204 专家组确定的解决方案之前，我们需要先理解一些术语。

代码点、字符编码方案、UTF-16：这些是指什么？

不幸的是，引入增补字符使字符模型变得更加复杂了。在过去，我们可以简单地说“字符”，在一个基于 Unicode 的环境（例如 Java 平台）中，假定字符有 16 位，而现在我们需要更多的术语。我们会尽量介绍得相对简单一些 — 如需了解所有详细的讨论信息，您可以阅读 Unicode 标准第 2 章或 Unicode 技术报告 17“ 字符编码模型 ”。Unicode 专业人士可略过所有介绍直接参阅本部分中的最后定义。

字符是抽象的最小文本单位。它没有固定的形状（可能是一个字形），而且没有值。“A”是一个字符，“€”（德国、法国和许多其他欧洲国家通用货币的标志）也是一个字符。

字符集是字符的集合。例如，汉字字符是中国人最先发明的字符，在中文、日文、韩文和越南文的书写中使用。

编码字符集是一个字符集，它为每一个字符分配一个唯一数字。Unicode 标准的核心是一个编码字符集，字母“A”的编码为 004116 和字符“€”的编码为 20AC16。Unicode 标准始终使用十六进制数字，而且在书写时在前面加上前缀“U+”，所以“A”的编码书写为“U+0041”。

代码点是指可用于编码字符集的数字。编码字符集定义一个有效的代码点范围，但是并不一定将字符分配给所有这些代码点。有效的 Unicode 代码点范围是 U+0000 至 U+10FFFF。Unicode 4.0 将字符分配给一百多万个代码点中的 96,382 代码点。

增补字符是代码点在 U+10000 至 U+10FFFF 范围之间的字符，也就是那些使用原始的 Unicode 的 16 位设计无法表示的字符。从 U+0000 至 U+FFFF 之间的字符集有时候被称为基本多语言面 (BMP)。因此，每一个 Unicode 字符要么属于 BMP，要么属于增补字符。

字符编码方案是从一个或多个编码字符集到一个或多个固定宽度代码单元序列的映射。最常用的代码单元是字节，但是 16 位或 32 位整数也可用于内部处理。UTF-32、UTF-16 和 UTF-8 是 Unicode 标准的编码字符集的字符编码方案。

UTF-32 即将每一个 Unicode 代码点表示为相同值的 32 位整数。很明显，它是内部处理最方便的表达方式，但是，如果作为一般字符串表达方式，则要消耗更多的内存。

UTF-16 使用一个或两个未分配的 16 位代码单元的序列对 Unicode 代码点进行编码。值 U+0000 至 U+FFFF 编码为一个相同值的 16 位单元。增补字符编码为两个代码单元，第一个单元来自于高代理范围（U+D800 至 U+DBFF），第二个单元来自于低代理范围（U+DC00 至 U+DFFF）。这在概念上可能看起来类似于多字节编码，但是其中有一个重要区别：值 U+D800 至 U+DFFF 保留用于 UTF-16；没有这些值分配字符作为代码点。这意味着，对于一个字符串中的每个单独的代码单元，软件可以识别是否该代码单元表示某个单单元字符，或者是否该代码单元是某个双单元字符的第一个或第二单元。这相当于某些传统的多字节字符编码来说是一个显著的改进，在传统的多字节字符编码中，字节值 0x41 既可能表示字母“A”，也可能是一个双字节字符的第二个字节。

UTF-8 使用一至四个字节的序列对编码 Unicode 代码点进行编码。U+0000 至 U+007F 使用一个字节编码，U+0080 至 U+07FF 使用两个字节，U+0800 至 U+FFFF 使用三个字节，而 U+10000 至 U+10FFFF 使用四个字节。UTF-8 设计原理为：字节值 0x00 至 0x7F 始终表示代码点 U+0000 至 U+007F（Basic Latin 字符子集，它对应 ASCII 字符集）。这些字节值永远不会表示其他代码点，这一特性使 UTF-8 可以很方便地在软件中将特殊的含义赋予某些 ASCII 字符。

下表所示为几个字符不同表达方式的比较：

Unicode 代码点

U+0041

U+00DF

U+6771

U+10400

表示字形

UTF-32 代码单元

00000041

000000DF

00006771

00010400

UTF-16 代码单元

0041

00DF

6771

D801

DC00

UTF-8 代码单元

另外，本文在许多地方使用术语字符序列或 char 序列概括 Java 2 平台识别的所有字符序列的容器：char[], java.lang.CharSequence 的实现（例如 String 类），和 java.text.CharacterIterator 的实现。

这么多术语。它们与在 Java 平台中支持增补字符有什么关系呢？

Java 平台中增补字符的设计方法

JSR-204 专家组必须作出的主要决定是如何在 Java API 中表示增补字符，包括单个字符和所有形式的字符序列。专家组考虑并排除了多种方法：

重新定义基本类型 char，使其具有 32 位，这样也会使所有形式的 char 序列成为 UTF-32 序列。在现有的 16 位类型 char 的基础上，为字符引入一种新的 32 位基本类型（例如，char32）。所有形式的 Char 序列均基于 UTF-16。在现有的 16 位类型 char 的基础上，为字符引入一种新的 32 位基本类型（例如，char32）。String 和 StringBuffer 接受并行 API，并将它们解释为 UTF-16 序列或 UTF-32 序列；其他 char 序列继续基于 UTF-16。使用 int 表示增补的代码点。String 和 StringBuffer 接受并行 API，并将它们解释为 UTF-16 序列或 UTF-32 序列；其他 char 序列继续基于 UTF-16。使用代理 char 对，表示增补代码点。所有形式的 char 序列基于 UTF-16。引入一种封装字符的类。String 和 StringBuffer 接受新的 API，并将它们解释为此类字符的序列。使用一个 CharSequence 实例和一个索引的组合表示代码点。

在这些方法中，一些在早期就被排除了。例如，重新定义基本类型 char，使其具有 32 位，这对于全新的平台可能会非常有吸引力，但是，对于 J2SE 来说，它会与现有的 Java 虚拟机 1 、序列化和其他接口不兼容，更不用说基于 UTF-32 的字符串要使用两倍于基于 UTF-16 的字符串的内存了。添加一种新类型的 char32 可能会简单一些，但是仍然会出现虚拟机和序列化方面的问题。而且，语言更改通常需要比 API 更改有更长的提前期，因此，前面两种方法会对增补字符支持带来无法接受的延迟。为了在余下的方法中筛选出最优方案，实现小组使用四种不同的方法，在大量进行低层字符处理的代码（java.util.regex 包）中实现了对增补字符支持，并对这四种方法的难易程度和运行表现进行了比较。

最终，专家组确定了一种分层的方法：

使用基本类型 int 在低层 API 中表示代码点，例如 Character 类的静态方法。将所有形式的 char 序列均解释为 UTF-16 序列，并促进其在更高层级 API 中的使用。提供 API，以方便在各种 char 和基于代码点的表示法之间的转换。

在需要时，此方法既能够提供一种概念简明且高效的单个字符表示法，又能够充分利用通过改进可支持增补字符的现有 API。同时，还能够促进字符序列在单个字符上的应用，这一点一般对于国际化的软件很有好处。

在这种方法中，一个 char 表示一个 UTF-16 代码单元，这样对于表示代码点有时并不够用。您会注意到，J2SE 技术规范现在使用术语代码点和 UTF-16 代码单元（表示法是相关的）以及通用术语字符（表示法与该讨论没有关系）。API 通常使用名称 codePoint 描述表示代码点的类型 int 的变量，而 UTF-16 代码单元的类型当然为 char。

我们将在下面两部分中了解到 J2SE 平台的实质变化 — 其中一部分介绍单个代码点的低层 API，另一部分介绍采用字符序列的高层接口。

开放的增补字符：基于代码点的 API

新增的低层 API 分为两大类：用于各种 char 和基于代码点的表示法之间转换的方法和用于分析和映射代码点的方法。

最基本的转换方法是 Character.toCodePoint(char high, char low)（用于将两个 UTF-16 代码单元转换为一个代码点）和 Character.toChars(int codePoint)（用于将指定的代码点转换为一个或两个 UTF-16 代码单元，然后封装到一个 char[] 内。不过，由于大多数情况下文本以字符序列的形式出现，因此，另外提供 codePointAt 和 codePointBefore 方法，用于将代码点从各种字符序列表示法中提取出来：Character.codePointAt(char[] a, int index) 和 String.codePointBefore(int index) 是两种典型的例子。在将代码点插入字符序列时，大多数情况下均有一些针对 StringBuffer 和 StringBuilder 类的 appendCodePoint(int codePoint) 方法，以及一个用于提取表示代码点的 int[] 的 String 构建器。

几种用于分析代码单元和代码点的方法有助于转换过程：Character 类中的 isHighSurrogate 和 isLowSurrogate 方法可以识别用于表示增补字符的 char 值；charCount(int codePoint) 方法可以确定是否需要将某个代码点转换为一个或两个 char。

但是，大多数基于代码点的方法均能够对所有 Unicode 字符实现基于 char 的旧方法对 BMP 字符所实现的功能。以下是一些典型例子：

Character.isLetter(int codePoint) 可根据 Unicode 标准识别字母。 Character.isJavaIdentifierStart(int codePoint) 可根据 Java 语言规范确定代码点是否可以启动标识符。 Character.UnicodeBlock.of(int codePoint) 可搜索代码点所属的 Unicode 字符子集。 Character.toUpperCase(int codePoint) 可将给定的代码点转换为其大写等值字符。尽管此方法能够支持增补字符，但是它仍然不能解决根本的问题，即在某些情况下，逐个字符的转换无法正确完成。例如，德文字符“"ß"”应该转换为“SS”，这需要使用 String.toUpperCase 方法。

注意大多数接受代码点的方法并不检查给定的 int 值是否处于有效的 Unicode 代码点范围之内（如上所述，只有 0x0 至 0x10FFFF 之间的范围是有效的）。在大多数情况下，该值是以确保其有效的方法产生的，在这些低层 API 中反复检查其有效性可能会对系统性能造成负面的影响。在无法确保有效性的情况下，应用程序必须使用 Character.isValidCodePoint 方法确保代码点有效。大多数方法对于无效的代码点采取的行为没有特别加以指定，不同的实现可能会有所不同。

API 包含许多简便的方法，这些方法可使用其他低层的 API 实现，但是专家组觉得，这些方法很常用，将它们添加到 J2SE 平台上很有意义。不过，专家组也排除了一些建议的简便方法，这给我们提供了一次展示自己实现此类方法能力的机会。例如，专家组经过讨论，排除了一种针对 String 类的新构建器（该构建器可以创建一个保持单个代码点的 String）。以下是使应用程序使用现有的 API 提供功能的一种简便方法：

/** * 创建仅含有指定代码点的新 String。 */ String newString(int codePoint) { return new String(Character.toChars(codePoint)); }

您会注意到，在这个简单的实现中，toChars 方法始终创建一个中间数列，该数列仅使用一次即立即丢弃。如果该方法在您的性能评估中出现，您可能会希望将其优化为针对最为普通的情况，即该代码点为 BMP 字符：

/** * 创建仅含有指定代码点的新 String。 * 针对 BMP 字符优化的版本。 */ String newString(int codePoint) { if (Character.charCount(codePoint) == 1) { return String.valueOf((char) codePoint); } else { return new String(Character.toChars(codePoint)); } }

或者，如果您需要创建许多个这样的 string，则可能希望编写一个重复使用 toChars 方法所使用的数列的通用版本：

/** * 创建每一个均含有一个指定 * 代码点的新 String。 * 针对 BMP 字符优化的版本。 */ String[] newStrings(int[] codePoints) { String[] result = new String[codePoints.length]; char[] codeUnits = new char[2]; for (int i = 0; i < codePoints.length; i++) { int count = Character.toChars(codePoints[i], codeUnits, 0); result[i] = new String(codeUnits, 0, count); } return result; }

不过，最终您可能会发现，您需要的是一个完全不同的解决方案。新的构建器 String(int codePoint) 实际上建议作为 String.valueOf(char) 的一个基于代码点的备选方案。在很多情况下，此方法用于消息生成的环境，例如：

System.out.println("Character " + String.valueOf(char) + " is invalid.");

新的格式化 API 支持增补文字，提供一种更加简单的备选方案：

System.out.printf("Character %c is invalid.%n", codePoint);

使用此高层 API 不仅简捷，而它有很多特殊的优点：它可以避免串联（串联会使消息很难本地化），并将需要移进资源包 (resource bundle) 的字符串数量从两个减少到一个。

增补字符透视：功能增强

在支持使用增补字符的 Java 2 平台中的大部分更改没有反映到新的 API 内。一般预期是，处理字符序列的所有接口将以适合其功能的方式处理增补字符。本部分着重讲述为达到此预期所作一些功能增强。

Java 编程语言中的标识符

Java 语言规范指出所有 Unicode 字母和数字均可用于标识符。许多增补字符是字母或数字，因此 Java 语言规范已经参照新的基于代码点的方法进行更新，以在标识符内定义合法字符。为使用这些新方法，需要检测标识符的 javac 编译器和其他工具都进行了修订。

库内的增补字符支持

许多 J2SE 库已经过增强，可以通过现有接口支持增补字符。以下是一些例子：

字符串大小写转换功能已更新，可以处理增补字符，也可以实现 Unicode 标准中规定的特殊大小写规则。 java.util.regex 包已更新，这样模式字符串和目标字符串均可以包含增补字符并将其作为完整单元处理。现在，在 java.text 包内进行整理处理时，会将增补字符看作完整单元。 java.text.Bidi 类已更新，可以处理增补字符和 Unicode 4.0 中新增的其他字符。请注意，Cypriot Syllabary 字符子集内的增补字符具有从右至左的方向性。 Java 2D API 内的字体渲染和打印技术已经过增强，可以正确渲染和测量包含增补字符的字符串。 Swing 文本组件实现已更新，可以处理包含增补字符的文本。字符转换

只有很少的字符编码可以表示增补字符。如果是基于 Unicode 的编码（如 UTF-8 和 UTF-16LE），则旧版的 J2RE 内的字符转换器已经按照正确处理增补字符的方式实现转换。对于 J2RE 5.0，可以表示增补字符的其他编码的转换器已更新：GB18030、x-EUC-TW（现在实现所有 CNS 11643 层面）和 Big5-HKSCS（现在实现 HKSCS-2001）。

在源文件内表示增补字符

在 Java 编程语言源文件中，如果使用可以直接表示增补字符的字符编码，则使用增补字符最为方便。UTF-8 是最佳的选择。在所使用的字符编码无法直接表示字符的情况下，Java 编程语言提供一种 Unicode 转义符语法。此语法没有经过增强，无法直接表示增补字符。而是使用两个连续的 Unicode 转义符将其表示为 UTF-16 字符表示法中的两个编码单元。例如，字符 U+20000 写作“\uD840\uDC00”。您也许不愿意探究这些转义序列的含义；最好是写入支持所需增补字符的编码，然后使用一种工具（如 native2ascii）将其转换为转义序列。

遗憾的是，由于其编码问题，属性文件仍局限于 ISO 8859-1（除非您的应用程序使用新的 XML 格式）。这意味着您始终必须对增补字符使用转义序列，而且可能要使用不同的编码进行编写，然后使用诸如 native2ascii 的工具进行转换。

经修订的 UTF-8

Java 平台对经修订的 UTF-8 已经很熟悉，但是，问题是应用程序开发人员在可能包含增补字符的文本和 UTF-8 之间进行转换时需要更加留神。需要特别注意的是，某些 J2SE 接口使用的编码与 UTF-8 相似但与其并不兼容。以前，此编码有时被称为“Java modified UTF-8”（经 Java 修订的 UTF-8）或（错误地）直接称为“UTF-8”。对于 J2SE 5.0，其说明文档正在更新，此编码将统称为“modified UTF-8”（经修订的 UTF-8）。

经修订的 UTF-8 和标准 UTF-8 之间之所以不兼容，其原因有两点。其一，经修订的 UTF-8 将字符 U+0000 表示为双字节序列 0xC0 0x80，而标准 UTF-8 使用单字节值 0x0。其二，经修订的 UTF-8 通过对其 UTF-16 表示法的两个代理代码单元单独进行编码表示增补字符。每个代理代码单元由三个字节来表示，共有六个字节。而标准 UTF-8 使用单个四字节序列表示整个字符。

Java 虚拟机及其附带的接口（如 Java 本机接口、多种工具接口或 Java 类文件）在 java.io.DataInput 和 DataOutput 接口和类中使用经修订的 UTF-8 实现或使用这些接口和类，并进行序列化。Java 本机接口提供与经修订的 UTF-8 之间进行转换的例程。而标准 UTF-8 由 String 类、java.io.InputStreamReader 和 OutputStreamWriter 类、java.nio.charset 设施 (facility) 以及许多其上层的 API 提供支持。

由于经修订的 UTF-8 与标准的 UTF-8 不兼容，因此切勿同时使用这两种版本的编码。经修订的 UTF-8 只能与上述的 Java 接口配合使用。在任何其他情况下，尤其对于可能来自非基于 Java 平台的软件的或可能通过其编译的数据流，必须使用标准的 UTF-8。需要使用标准的 UTF-8 时，则不能使用 Java 本机接口例程与经修订的 UTF-8 进行转换。

在应用程序内支持增补字符

现在，对大多数读者来说最为重要的问题是：必须对应用程序进行哪些更改才能支持增补字符？

答案取决于在应用程序中进行哪种类型的文本处理和使用哪些 Java 平台 API。

对于仅以各种形式 char 序列（[char[]、java.lang.CharSequence 实现、java.text.CharacterIterator 实现）处理文本和仅使用接受和退回序列（如 char 序列）的 Java API 的应用程序，可能根本不需要进行任何更改。Java 平台 API 的实现应该能够处理增补字符。

对于本身解释单个字符、将单个字符传送给 Java 平台 API 或调用能够返回单个字符的方法的应用程序，则需要考虑这些字符的有效值。在很多情况下，往往不要求支持增补字符。例如，如果某应用程序搜索 char 序列中的 HTML 标记，并逐一检查每个 char，它会知道这些标记仅使用 Basic Latin 字符子集中的字符。如果所搜索的文本含有增补字符，则这些字符不会与标记字符混淆，因为 UTF-16 使用代码单元表示增补字符，而代码单元的值不会用于 BMP 字符。

只有在某应用程序本身解释单个字符、将单个字符传送给 Java 平台 API 或调用能够返回单个字符的方法且这些字符可能为增补字符时，才必须更改该应用程序。在提供使用 char 序列的并行 API 时，最好转而使用此类 API。在其他情况下，有必要使用新的 API 在 char 和基于代码点的表示法之间进行转换，并调用基于代码点的 API。当然，如果您发现在 J2SE 5.0 中有更新、更方便的 API，使您能够支持增补字符并同时简化代码（如上格式化范例中所述），则没有必要这样做。

您可能会犹豫，是将所有文本转换为代码点表示法（即 int[]）然后在该表示法中处理，还是在大多数情况下仍采用 char 序列，仅在需要时转换为代码点，两者之间孰优孰劣很难确定。当然，总体来说，Java 平台 API 相对于 char 序列肯定具有一定的优势，而且采用 Java 平台 API 可以节省内存空间。

对于需要与 UTF-8 之间进行转换的应用程序，还需要认真考虑是需要标准的 UTF-8 还是经修订的 UTF-8，并针对每种 UTF-8 采用适当的 Java 平台。“ 经修订的 UTF-8 ”部分介绍进行正确选择所需的信息。

使用增补字符测试应用程序

经过前面部分的介绍后，无论您是否需要修订应用程序，测试应用程序是否运行正常始终是一种正确的做法。对于不含有图形用户界面的应用程序，有关“ 在源文件内表示增补字符 ”　的信息有助于设计测试用例。以下是有关使用图形用户界面进行测试的补充信息。

对于文本输入， Java 2 SDK 提供用于接受“\Uxxxxxx”格式字符串的代码点输入方法，这里大写的“U”表示转义序列包含六个十六进制数字，因此允许使用增补字符。小写的“u” 表示转义序列“\uxxxx”的原始格式。您可以在 J2SDK 目录 demo/jfc/CodePointIM 内找到此输入方法及其说明文档。

对于字体渲染，您需要至少能够渲染一些增补字符的字体。其中一种此类字体为 James Kass 的 Code2001 字体，它提供手写体字形（如 Deseret 和 Old Italic）。利用 Java 2D 库中提供新功能，您只需将该字体安装到 J2RE 的 lib/fonts/fallback 目录内即可，然后它可自动添加至在 2D 和 XAWT 渲染时使用的所有逻辑字体 — 无需编辑字体配置文件。

至此，您就可以确认，您的应用程序能够完全支持增补字符了！

结论

对增补字符的支持已经引入 Java 平台，大部分应用程序无需更改代码即可处理这些字符。解释单个字符的应用程序可以在 Character 类和多种 CharSequence 子类中使用基于代码点的新 API。

鸣谢

Java 平台中的增补字符支持由 Java Community Process 的 JSR-204 专家组设计。技术规范设计主持为 Masayoshi Okutsu 和 Brian Beck (Sun Microsystems)，其他专家组成员有 Craig Cummings (Oracle)、Mark Davis (IBM)、Markus Eble (SAP AG)、Jere Käpyaho (Nokia Corp.)、Kazuhiro Kazama (NTT)、Kenji Kazumura (Fujitsu Limited)、Eiichi Kimura (NEC Corp.)、Changshin Lee (Tmax Soft Inc.) 和 Toshiki Murata (Oki Electric Industry Co.)。参考实现由 Sun Microsystems 的 Java Internationalization 团队完成，并承蒙位于圣何塞的 IBM Globalization Center of Competency 的协助。技术规范的技术兼容套件为 Java Compatibility Kit，由 Sun Microsystems 的 JCK 团队实现。

参考书目

Masayoshi Okutsu, Brian Beck (ed.): Unicode Supplementary Character Support. Proposed Final Draft . Sun Microsystems, 2004.

Java 2 Platform Standard Edition 5.0 API Specification . Sun Microsystems, 2004.

The Unicode Consortium: The Unicode Standard, Version 4.0 . Addison-Wesley, 2003.

Ken Whistler, Mark Davis: Character Encoding Model . Unicode Technical Report #17. The Unicode Consortium, 2000.

James Kass: Code2001, a Plane 1 Unicode-based Font .

关于作者

Norbert Lindenberg 是 Sun Microsystems 的 Java Web Services 团队内 Java Internationalization 技术主管。在加盟 Sun 之前，曾经供职于 General Magic 和 Apple Computer，参与过多个国际化项目。他毕业于德国的卡尔斯鲁厄大学，拥有计算机科学理科硕士学位。

Masayoshi Okutsu 是 Sun Microsystems 的 Java Web Services 团队的一名国际化工程师，目前担任 Unicode Supplementary Character Support 的 Java Specification Request 204 的技术规范主管。在加盟 Sun Microsystems 之前，供职于 Digital Equipment Corporation，期间曾经参与多个国际化项目。他毕业于日本山形大学，拥有电子工程理学士学位。

1 本网站中使用的术语“Java 虚拟机”或“JVM”是指针对 Java 平台的虚拟机。

-----------------------英文 - -------

Article

Supplementary Characters in the Java Platform

Print-friendly Version

By Norbert Lindenberg and Masayoshi Okutsu , Sun Microsystems, Inc., May 2004

日本語: Java プラットフォームにおける補助文字のサポート
 中文: Java 平台中的增补字符

Abstract

This article describes how supplementary characters are supported in the Java platform. Supplementary characters are characters in the Unicode standard whose code points are above U+FFFF, and which therefore cannot be described as single 16-bit entities such as the char data type in the Java programming language. Such characters are generally rare, but some are used, for example, as part of Chinese and Japanese personal names, and so support for them is commonly required for government applications in East Asian countries.

The Java platform is being enhanced to enable processing of supplementary characters with minimal impact on existing applications. New low-level APIs enable operations on individual characters where necessary. Most text-processing APIs, however, use character sequences, such as the String class or character arrays. These are now interpreted as UTF-16 sequences, and the implementations of these APIs is changed to correctly handle supplementary characters. The enhancements are part of version 5.0 of the Java 2 Platform, Standard Edition (J2SE).

Besides explaining these enhancements in detail, this article also provides guidelines for application developers for determining and implementing necessary changes to enable use of the complete Unicode character set.

Background

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character. However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth. The Unicode standard therefore has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters . Version 2.0 of the Unicode standard was the first to include a design to enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned. Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it has to support supplementary characters.

Support for supplementary characters is likely to also become a common business requirement in East Asian markets. Government applications are going to require them in order to correctly represent names that include rare Chinese characters. Publishing applications may need them in order to represent the full set of historical and variant characters. The Chinese government requires support for GB18030, a character encoding that encodes the entire Unicode character set, and so includes supplementary characters if Unicode version 3.1 or later is assumed. The Taiwanese standard CNS-11643 includes numerous characters that have been included in Unicode 3.1 as supplementary characters. The Hong Kong government defined a collection of characters that are needed for Cantonese, and some of these characters are supplementary characters in Unicode. Finally, some vendors in Japan are planning to use the large private use area in the supplementary character space for more than 50,000 kanji character variants in order to migrate from their proprietary systems to solutions based on the Java platform.

The Java platform therefore not only has to support supplementary characters, but it also has to make it easy for applications to do the same. Since supplementary characters break a fundamental assumption of the Java programming language and might require a fundamental change in the programming model, an expert group was convened under the Java Community Process to choose the right solution for the problem. The group is called the JSR-204 expert group, using the number of the Java Specification Request for Unicode Supplementary Character Support . Technically, the decisions of the expert group only apply to the J2SE platform, but since the Java 2 Platform, Enterprise Edition (J2EE) sits on top of the J2SE platform, it benefits directly, and we expect that the configurations of the Java 2 Platform, Micro Edition (J2ME) will adopt the same design approach.

But before we can look at the solution that the JSR-204 expert group came up with, we need to learn some terminology.

Code Points, Character Encoding Schemes, UTF-16: What's All This?

The introduction of supplementary characters unfortunately makes the character model quite a bit more complicated. Where in the past we could simply talk about "characters" and, in a Unicode based environment such as the Java platform, assume that a character has 16 bits, we now need more terminology. We'll try to keep it relatively simple -- for a full-blown discussion with all details you can read Chapter 2 of The Unicode Standard or Unicode Technical Report 17 " Character Encoding Model ." Unicode experts may skip all but the last definition in this section.

A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph ), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.

A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.

A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 004116 and the letter "€" the number 20AC16. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".

Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.

Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP) . Thus, each Unicode character is either in the BMP or a supplementary character.

A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units . The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.

UTF-32 simply represents each Unicode code point as the 32-bit integer of the same value. It's clearly the most convenient representation for internal processing, but uses significantly more memory than necessary if used as a general string representation.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

UTF-8 uses sequences of one to four bytes to encode Unicode code points. U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in four bytes. UTF-8 is designed so that the byte values 0x00 to 0x7F always represent code points U+0000 to U+007F (the Basic Latin block, which corresponds to the ASCII character set). These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters.

The following table shows the different representations of a few characters in comparison:

Unicode code point

U+0041

U+00DF

U+6771

U+10400

Representative glyph

UTF-32 code units

00000041

000000DF

00006771

00010400

UTF-16 code units

0041

00DF

6771

D801

DC00

UTF-8 code units

This article also uses the terms character sequence or char sequence in many places to summarize all the containers of character sequences that the Java 2 Platform knows: char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator.

This is a lot of terminology. What does all this have to do with supporting supplementary characters in the Java platform?

Design Approach for Supplementary Characters in the Java Platform

The main decision the JSR-204 expert group had to make was how to represent supplementary characters in Java APIs, both for individual characters and for character sequences in all forms. A number of approaches were considered and rejected by the expert group: Redefining the primitive type char to have 32 bits, which would also make char sequences in all forms UTF-32 sequences. Introducing a new primitive 32-bit type for characters (e.g., char32) in addition to the existing 16-bit type char. Char sequences in all forms would be based on UTF-16. Introducing a new primitive 32-bit type for characters (e.g., char32) in addition to the existing 16-bit type char. String and StringBuffer receive parallel APIs to interpret them either as UTF-16 sequences or as UTF-32 sequences; other char sequences continue to be based on UTF-16. Using int to represent supplementary code points. String and StringBuffer receive parallel APIs to interpret them either as UTF-16 sequences or as UTF-32 sequences; other char sequences continue to be based on UTF-16. Using surrogate char pairs to represent supplementary code points. char sequences in all forms would be based on UTF-16. Introducing a class that encapsulates a character. String and StringBuffer receive new APIs to interpret them as sequences of such characters. Using a combination of a CharSequence instance and an index to represent code points. Some of these approaches were ruled out early on. Redefining the primitive type char to have 32 bits, for example, might have been very attractive for a brand-new platform, but for J2SE it would have been incompatible with existing Java virtual machines 1 , serialization, and other interfaces, not to mention that UTF-32 based strings use twice as much memory as UTF-16 based ones. Adding a new type char32 would have been easier, but would still have created problems for virtual machines and serialization. Also, language changes usually have longer lead times than API changes, so the two previous approaches might have unacceptably delayed support for supplementary characters. To help determine the winner among the remaining ones, the implementation team actually implemented supplementary character support in a substantial piece of code that does low-level character processing (the java.util.regex package) using four different approaches, comparing them in terms of ease of development and runtime performance.

In the end, the decision was for a tiered approach: Use the primitive type int to represent code points in low-level APIs, such as the static methods of the Character class. Interpret char sequences in all forms as UTF-16 sequences, and promote their use in higher-level APIs. Provide APIs to easily convert between various char and code point-based representations. This approach provides a conceptually simple and efficient representation of individual characters where needed, while leveraging existing APIs that can be retrofitted to support supplementary characters. It also promotes the use of character sequences over individual characters, which is generally better for internationalized software.

With this approach, a char represents a UTF-16 code unit, which is not always sufficient to represent a code point. You'll see that the J2SE specifications now use the terms code point and UTF-16 code unit where the representation is relevant, and the generic term character where the representation is irrelevant to the discussion. APIs usually use the name codePoint for variables of type int that represent code points, while UTF-16 code units of course have type char.

We'll take a look at the actual changes in the J2SE platform in the next two sections -- one for the low-level APIs that work on individual code points, one for higher-level interfaces that work on character sequences.

Supplementary Characters in the Open: Code point-based APIs

The low-level APIs that were added fall into two broad categories: Methods that convert between various char and code point based representations, and methods that analyze or map code points.

The most basic conversion methods are Character.toCodePoint(char high, char low), which converts two UTF-16 code units to a code point, and Character.toChars(int codePoint), which converts the given code point to one or two UTF-16 code units, wrapped into a char[]. However, since text most of the time comes in the form of a character sequence, there are also codePointAt and codePointBefore methods to extract a code point from the various character sequence representations: Character.codePointAt(char[] a, int index) and String.codePointBefore(int index) are two typical examples. For the most common cases of inserting code points into a character sequence, there are appendCodePoint(int codePoint) methods for the StringBuffer and StringBuilder classes and a String constructor that takes an int[] representing code points.

A few methods that analyze code units and code points help in the conversion process: The isHighSurrogate and isLowSurrogate methods in the Character class identify the char values that are used to represent supplementary characters, and the charCount(int codePoint) method determines whether a code point needs to be converted to one or two chars.

But most code point-based methods perform functions for the complete range of Unicode characters that older char based methods performed for BMP characters. Here are some typical examples: Character.isLetter(int codePoint) identifies letters according to the Unicode standard. Character.isJavaIdentifierStart(int codePoint) determines whether a code point can start an identifier according to the Java Language Specification. Character.UnicodeBlock.of(int codePoint) looks up the Unicode block that the code point belongs to. Character.toUpperCase(int codePoint) converts the given code point to its uppercase equivalent. While this method does support supplementary characters, it still cannot work around the fundamental issue that some case conversions cannot be done correctly on a character-by-character basis. The German character "ß", for example, should be converted to "SS", which requires use of the String.toUpperCase method. Note that most methods that accept a code point do not check whether the given int value is in the range of valid Unicode code points (as mentioned above, only the range from 0x0 to 0x10FFFF is valid). In most cases the value is produced in a way that guarantees it is valid, and checking it repeatedly in these low-level APIs might adversely affect system performance. In cases where validity cannot be guaranteed, applications must use the Character.isValidCodePoint method to make sure that the code point is valid. The behavior of most methods for invalid code points is intentionally unspecified and may vary between implementations.

The API contains a number of convenience methods which could be implemented using other, lower-level APIs, but where the expert group felt that the methods would be used sufficiently often that adding them to the J2SE platform made sense. However, the expert group also rejected some proposed convenience methods, which gives us an opportunity to show how you can implement such methods yourself. For example, the expert group debated and rejected a new constructor for the String class which would create a String holding a single code point. Here's a simple way that an application can provide the functionality using the existing API: /** * Creates new String that contains just the given code point. */ String newString(int codePoint) { return new String(Character.toChars(codePoint)); } You'll notice that in this simple implementation the toChars method always creates an intermediate array, which is used once and then immediately discarded. If the method shows up in your performance measurements, you may want to optimize for the very, very, very common case where the code point is a BMP character: /** * Creates new String that contains just the given code point. * Version that optimizes for BMP characters. */ String newString(int codePoint) { if (Character.charCount(codePoint) == 1) { return String.valueOf((char) codePoint); } else { return new String(Character.toChars(codePoint)); } } Or, if you need to create a large number of such strings, you may want to write a bulk version that reuses the array used by the toChars method: /** * Creates new Strings each of which contains one of the given * code points. * Version that optimizes for BMP characters. */ String[] newStrings(int[] codePoints) { String[] result = new String[codePoints.length]; char[] codeUnits = new char[2]; for (int i = 0; i < codePoints.length; i++) { int count = Character.toChars(codePoints[i], codeUnits, 0); result[i] = new String(codeUnits, 0, count); } return result; } However, it may turn out that you actually want an entirely different solution. The new constructor String(int codePoint) was actually proposed as a code point-based alternative to String.valueOf(char). In many cases this method is used in the context of message generation, such as: System.out.println("Character " + String.valueOf(char) + " is invalid."); The new formatting API , which supports supplementary characters, provides a much simpler alternative: System.out.printf("Character %c is invalid.%n", codePoint); Using this higher-level API is not only simpler, it has additional advantages: It avoids the concatenation, which would make the message very hard to localize, and reduces the number of strings that need to be moved into a resource bundle from two to one.

Supplementary Characters under the Hood: Functionality Enhancements

Most changes in the Java 2 Platform that enable the use of supplementary characters are not reflected in new API. The general expectation is that all interfaces that handle character sequences will handle supplementary characters in a way that's appropriate for their functionality. This section highlights some enhancements that were made to meet this expectation.

Identifiers in the Java Programming Language

The Java Language Specification specifies that all Unicode letters and digits can be used in identifiers. Many supplementary characters are letters or digits, and so the Java Language Specification was updated to refer to new code point-based methods to define the legal characters in identifiers. The javac compiler and other tools that need to detect identifiers were changed to use these new methods.

Supplementary Character Support in Libraries

Numerous J2SE libraries have been enhanced to support supplementary characters through existing interfaces. Here are just a few examples: String case conversion has been updated to handle supplementary characters and also to implement the special casing rules as specified in the Unicode standard. The java.util.regex package has been updated so that both pattern strings and target strings can contain supplementary characters, which will be handled as complete units. Collation in the java.text package now treats supplementary characters as complete units. The java.text.Bidi class has been updated to handle supplementary characters and other characters that are new in Unicode 4.0. Note that the supplementary characters in the Cypriot Syllabary block have right-to-left directionality. Font rendering and printing in the Java 2D API have been enhanced to correctly render and measure strings containing supplementary characters. The Swing text component implementation has been updated to handle text that contains supplementary characters.

Character Conversion

There are only a small number of character encodings that can represent supplementary characters. In the case of Unicode-based encodings such as UTF-8 and UTF-16LE, the character converters in previous releases of the J2RE already implemented the conversions in a way that handled supplementary characters correctly. For J2RE 5.0, the converters for other encodings that can represent supplementary characters have been updated: GB18030, x-EUC-TW (which now implements all planes of CNS 11643), Big5-HKSCS (which now implements HKSCS-2001).

Representing Supplementary Characters in Source Files

In Java programming language source files, supplementary characters are easiest to use if a character encoding is used that can represent supplementary characters directly. UTF-8 is an excellent choice. For cases where the character encoding used cannot represent the characters directly, the Java programming language provides a Unicode escape syntax. This syntax has not been enhanced to express supplementary characters directly. Instead, they are represented by the two consecutive Unicode escapes for the two code units in the UTF-16 representation of the character. For example, the character U+20000 is written as "\uD840\uDC00". You probably don't want to figure out these escape sequences yourself; it's best to write in an encoding that supports the supplementary characters that you need and then use a tool such as native2ascii to convert to escape sequences.

Properties files, unfortunately, are still limited to ISO 8859-1 as their encoding (unless your application uses the new XML format). This means you always have to use escape sequences for supplementary characters, and again probably will want to write in a different encoding and then convert with a tool such as native2ascii.

Modified UTF-8

Modified UTF-8 is not new to the Java platform, but it's something that application developers need to be more aware of when converting text that might contain supplementary characters to and from UTF-8. The main thing to remember is that some J2SE interfaces use an encoding that's similar to UTF-8 but incompatible with it. This encoding has in the past sometimes been called "Java modified UTF-8" or (incorrectly) just "UTF-8". For J2SE 5.0, the documentation is being updated to uniformly call it "modified UTF-8."

The incompatibility between modified UTF-8 and standard UTF-8 stems from two differences. First, modified UTF-8 represents the character U+0000 as the two-byte sequence 0xC0 0x80, whereas standard UTF-8 uses the single byte value 0x0. Second, modified UTF-8 represents supplementary characters by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes, for a total of six bytes. Standard UTF-8, on the other hand, uses a single four byte sequence for the complete character.

Modified UTF-8 is used by the Java Virtual Machine and the interfaces attached to it (such as the Java Native Interface, the various tool interfaces, or Java class files), in the java.io.DataInput and DataOutput interfaces and classes implementing or using them, and for serialization. The Java Native Interface provides routines that convert to and from modified UTF-8. Standard UTF-8, on the other hand, is supported by the String class, by the java.io.InputStreamReader and OutputStreamWriter classes, the java.nio.charset facilities, and many APIs layered on top of them.

Since modified UTF-8 is incompatible with standard UTF-8, it is critical not to use one where the other is needed. Modified UTF-8 can only be used with the Java interfaces described above. In all other cases, in particular for data streams that may come from or may be interpreted by software that's not based on the Java platform, standard UTF-8 must be used. The Java Native Interface routines that convert to and from modified UTF-8 cannot be used when standard UTF-8 is required.

Supporting Supplementary Characters in Your Application

Now, the question that matters most to most readers: What changes do you have to make to your application in order to support supplementary characters?

The answer depends on what kind of text processing is done within the application and which Java platform APIs are used.

Applications that deal with text only in the form of char sequences in all forms (char[], implementations of java.lang.CharSequence, implementations of java.text.CharacterIterator), and only use Java APIs that accept and return such char sequences, will likely not have to make any changes. The implementation of the Java platform APIs should handle supplementary characters for you.

Applications that interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, need to consider the valid values for these characters. In many cases it turns out that support for supplementary characters is not required. For example, if an application scans a char sequence for HTML tags, checking each char individually, it knows that these tags only use characters from the Basic Latin block. If the text being scanned contains supplementary characters, then these characters cannot be confused with the tag characters, because UTF-16 represents supplementary characters using code units whose values are not used for BMP characters.

Only where applications interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, and these character can be supplementary characters, does the application have to be changed. Where parallel APIs are available that use char sequences, it is best to convert to use such APIs. In the remaining cases, it will be necessary to use the new API to convert between char and code point-based representations, and call code point-based APIs. Unless, of course, you're lucky and find that there are newer and more convenient APIs in J2SE 5.0 that let you support supplementary characters and simplify your code at the same time, as in the formatting sample above.

You might wonder whether it's better to convert all text into code point representation (say, an int[]) and process it in that representation, or whether it's better to stick with char sequences most of the time and only convert to code points when needed. Well, the Java platform APIs in general certainly have a preference for char sequences, and using them will also save memory space.

For applications that need conversion to and from UTF-8 you will also need to consider carefully whether standard or modified UTF-8 is required, and use the proper Java platform facilities for each. The section " Modified UTF-8 " provides the information needed to choose the right one.

Testing Your Application With Supplementary Characters

Whether the previous section led you to revise your application or not, it's always a good idea to test that it behaves correctly. For applications that don't include a graphical user interface, the information on " Representing Supplementary Characters in Source Files " helps in developing test cases. Here's additional information on testing with graphical user interfaces.

For text input, the Java 2 SDK provides a code point input method which accepts strings of the form "\Uxxxxxx", where the uppercase "U" indicates that the escape sequence contains six hexadecimal digits, thus allowing for supplementary characters. A lowercase "u" indicates the original form of the escape sequences, "\uxxxx". You can find this input method and its documentation in the directory demo/jfc/CodePointIM of the J2SDK.

For font rendering, you will need a font that can render at least some supplementary characters. One such font is James Kass's Code2001 font, which provides glyphs for scripts such as Deseret and Old Italic. Thanks to a new feature in the Java 2D library, you can simply install the font into the J2RE's lib/fonts/fallback directory, and it will be automatically added to all logical fonts used in 2D and XAWT rendering - you don't need to edit font configuration files.

And with that, you can call your application ready for supplementary characters!

Conclusion

Support for supplementary characters has been introduced into the Java platform with an approach that enables most applications to handle these characters without code changes. Applications that interpret individual characters can use new code point-based API in the Character class and various CharSequence subclasses.

Acknowledgments

Supplementary character support in the Java platform was designed by the JSR-204 expert group within the Java Community Process. The specification leads are Masayoshi Okutsu and Brian Beck (Sun Microsystems), the other members of the expert group are Craig Cummings (Oracle), Mark Davis (IBM), Markus Eble (SAP AG), Jere Käpyaho (Nokia Corp.), Kazuhiro Kazama (NTT), Kenji Kazumura (Fujitsu Limited), Eiichi Kimura (NEC Corp.), Changshin Lee (Tmax Soft Inc.), and Toshiki Murata (Oki Electric Industry Co.). The reference implementation was done by the Java Internationalization team at Sun Microsystems with contributions from the IBM Globalization Center of Competency, San José. The technology compatibility kit for the specification is the Java Compatibility Kit, implemented by the JCK Team at Sun Microsystems.

References

Masayoshi Okutsu, Brian Beck (ed.): Unicode Supplementary Character Support. Proposed Final Draft . Sun Microsystems, 2004.

Java 2 Platform Standard Edition 5.0 API Specification . Sun Microsystems, 2004.

The Unicode Consortium: The Unicode Standard, Version 4.0 . Addison-Wesley, 2003.

Ken Whistler, Mark Davis: Character Encoding Model . Unicode Technical Report #17. The Unicode Consortium, 2000.

James Kass: Code2001, a Plane 1 Unicode-based Font .

About the Authors

Norbert Lindenberg is the technical lead for Java Internationalization in Sun Microsystems' Java Web Services group. Before joining Sun, he worked on a variety of internationalization projects at General Magic and Apple Computer. He holds an M.S. degree in Computer Science from Universität Karlsruhe, Germany.

Masayoshi Okutsu is an internationalization engineer in Sun Microsystems' Java Web Services group, and currently the specification lead for Java Specification Request 204, Unicode Supplementary Character Support. Before joining Sun Microsystems, he worked on a variety of internationalization projects at Digital Equipment Corporation. He holds a B.S. degree in Electronic Engineering from Yamagata University, Japan.

你可能感兴趣的:(java)

华为OD机试 - 单向链表中间节点（Java & JS & Python & C & C++）华为OD题库华为od 链表 java
须知哈喽，本题库完全免费，收费是为了防止被爬，大家订阅专栏后可以私信联系退款。感谢支持文章目录须知题目描述输出描述解析代码题目描述给定一个单链表L，请编写程序输出L中间结点保存的数据。如果有两个中间结点，则输出第二个中间结点保存的数据。例如：给定L为1→7→5，则输出应该为7；给定L为1→2→3→4，则输出应该为3；输入描述每个输入包含1个测试用例。每个测试用例：第一行给出链表首结点的地址、结点总
学习JavaEE的日子 Day32 线程池 A 北枝学习JavaEE 学习 java-ee java 线程池
Day32线程池1.引入一个线程完成一项任务所需时间为：创建线程时间-Time1线程中执行任务的时间-Time2销毁线程时间-Time32.为什么需要线程池(重要)线程池技术正是关注如何缩短或调整Time1和Time3的时间，从而提高程序的性能。项目中可以把Time1，T3分别安排在项目的启动和结束的时间段或者一些空闲的时间段线程池不仅调整Time1，Time3产生的时间段，而且它还显著减少了创建
请简单介绍一下Shiro框架是什么？Shiro在Java安全领域的主要作用是什么？Shiro主要提供了哪些安全功能？ AaronWang94 shiro java java 安全开发语言
请简单介绍一下Shiro框架是什么？Shiro框架是一个强大且灵活的开源安全框架，为Java应用程序提供了全面的安全解决方案。它主要用于身份验证、授权、加密和会话管理等功能，可以轻松地集成到任何JavaWeb应用程序中，并提供了易于理解和使用的API，使开发人员能够快速实现安全特性。Shiro的核心组件包括Subject、SecurityManager和Realms。Subject代表了当前与应用
通俗易懂：什么是Java虚拟机（JVM）？它的主要作用是什么？大龄下岗程序员 mysql java mysql spring
Java虚拟机（JavaVirtualMachine,JVM）是一种软件实现的抽象计算机，它负责执行Java字节码（Bytecode）。Java程序并不是直接在物理计算机上运行，而是先由Java编译器将源代码编译成与平台无关的字节码，然后由JVM负责读取字节码并在实际硬件架构上运行。JVM的主要作用包括以下几个方面：1.跨平台性-JVM是Java语言“一次编写，到处运行”（WriteOnce,Ru
3、JavaWeb-Ajax/Axios-前端工程化-Element 所谓远行Misnearch #JavaWeb 前端 ajax elementui java 前端框架
P34Ajax介绍Ajax:AsynchroousJavaScriptAndXML，异步的JS和XMLJS网页动作，XML一种标记语言，存储数据，作用：数据交换：通过Ajax给服务器发送请求，并获取服务器响应的数据异步交互：在不重新加载整个页面的情况下，与服务器交换数据并实现更新部分网页的技术，例如：搜索联想、用户名是否可用的校验等等。同步与异步：同步：服务器在处理中客户端要处于等待状态，输入域名
枚举使用笔记万变不离其宗_8 项目笔记笔记
1.java枚举怎么放在方法上面的注释里面/***保存*@paramuserId用户id*@paramtype见枚举{@linkcom.common.enums.TypeEnum}*@return*/voidsave(LonguserId,Stringtype);
Python dict字符串转json对象，小数精度丢失问题朝如青丝暮成雪 json python
一前言JSON(JavaScriptObjectNotation)是一种轻量级的数据交换格式，dict是Python的一种数据格式。本篇介绍一个float数据转换时精度丢失的案例。二问题描述importjsontest_str1='{"π":3.1415926535897932384626433832795028841971}'test_str2='{"value":10.00000}'print
java实体中返回前端的double类型四舍五入（格式化）婲落ヽ紅顏誶 java
根据业务，需要通过后端给前端返回部分double类型的数值，一般需要保留两位小数，使用jackson转换对象packagecom.ruoyi.common.core.config;importcom.fasterxml.jackson.core.JsonGenerator;importcom.fasterxml.jackson.databind.JsonSerializer;importcom.f
Java中HashMap底层数据结构及主要参数? 山间漫步人生路 java 数据结构开发语言
在Java中，HashMap的底层数据结构主要基于数组和链表，同时在Java8及以后的版本中，当链表长度超过一定阈值时，链表会转换为红黑树来优化性能。这种结构结合了数组和链表的优点，既提供了快速的随机访问，又允许动态地扩展存储桶的大小。HashMap的主要参数包括：初始容量（InitialCapacity）：这是HashMap在创建时设定的桶数组的大小。默认值为16。这个值可以根据预计存储的键值对
Java学习笔记01 .wsy. 日常 java 学习笔记
1.1Java简介Java的前身是Oak，詹姆斯·高斯林是java之父。1.2Java体系Java是一种与平台无关的语言，其源代码可以被编译成一种结构中立的中间文件（.class，字节码文件）于Java虚拟机上运行。1.2.3专有名词JDK提供编译、运行Java程序所需要的种种工具及资源。JRE是运行Java所依赖的环境的集合。JVM是一个虚构出来的计算机，通过在实际的计算机上仿真模拟各种计算机功
Java回溯知识点（含面试大厂题和源码）一成码农 java 面试开发语言
回溯算法是一种通过遍历所有可能的候选解来寻找所有解的算法，如果候选解被确认不是一个解（或至少不是最后一个解），回溯算法会通过在上一步进行一些变化来丢弃这个解，即“回溯”并尝试另一个候选解。回溯法通常用递归方法来实现，在解决排列、组合、选择问题时非常有效。回溯算法的核心要点：路径：也就是已经做出的选择。选择列表：也就是你当前可以做的选择。结束条件：也就是到达决策树底层，无法再做出选择的条件。回溯算法
Azkaban各种类型的Job编写 __元昊__
一、概述原生的Azkaban支持的plugin类型有以下这些：command：Linuxshell命令行任务gobblin：通用数据采集工具hadoopJava：运行hadoopMR任务java：原生java任务hive：支持执行hiveSQLpig：pig脚本任务spark：spark任务hdfsToTeradata：把数据从hdfs导入TeradatateradataToHdfs：把数据从Te
java基础相关面试题详细总结。。。。。96 java 开发语言
1.Java中的数据类型有哪些？答：Java中的数据类型包括基本数据类型（如整数、浮点数、字符等）和引用数据类型（如类、接口、数组等）。2.什么是面向对象编程（OOP）？答：面向对象编程是一种编程范式，它将数据和对数据的操作封装在一起，形成对象。通过对象之间的交互来实现程序的功能。3.解释类和对象的关系。答：类是对象的抽象描述，而对象是类的具体实例。一个类可以创建多个对象，每个对象都具有类中定义的
javascript 日期转换为时间戳，时间戳转换为日期的函数 cdcdhj javascript学习日记 javascript 开发语言 ecmascript
日期转化为时间戳，主要用valueOf()来进行转化为毫秒时间戳，getTime()IOS系统无法解析转换，所以都有valueOf()letgetTimestampOrDate=function(timestamp){lettimeStamp='';constregex=/^\d{4}(-|\/)\d{2}(-|\/)\d{2}$/;constregex2=/^\d{4}(-|\/)\d{2}(-
Java面试题：解释JVM的内存结构，并描述堆、栈、方法区在内存结构中的角色和作用，Java中的多线程是如何实现的，Java垃圾回收机制的基本原理，并讨论常见的垃圾回收算法杰哥在此 Java系列 java jvm 算法面试
Java内存模型与多线程的深入探讨在Java的世界里，内存模型和多线程是开发者必须掌握的核心知识点。它们不仅关系到程序的性能和稳定性，还直接影响到系统的可扩展性和可靠性。下面，我将通过三个面试题，带领大家深入理解Java内存模型、多线程以及并发编程的相关原理和实践。面试题一：请解释JVM的内存结构，并描述堆、栈、方法区在内存结构中的角色和作用。关注点：JVM内存结构的基本组成堆、栈、方法区的功能和
COMP315 JavaScript Cloud Computing for E Commerce zhuyu0206girl javascript 开发语言 ecmascript
Assignment1:Javascript1IntroductionAcommontaskincloudcomputingisdatacleaning,whichistheprocessoftakinganinitialdatasetthatmaycontainerroneousorincompletedata,andremovingorfixingthoseelementsbeforeform
JSON与AJAX：网页交互的利器入冉心 json ajax 前端
在现代Web开发中，JSON（JavaScriptObjectNotation）和AJAX（AsynchronousJavaScriptandXML）是两项不可或缺的技术。它们共同为网页提供了动态、实时的数据交互能力，为用户带来了更加流畅和丰富的体验。本文将详细介绍JSON和AJAX的概念、原理，并通过代码示例展示它们在实际开发中的应用。一、JSON：轻量级的数据交换格式JSON是一种轻量级的数据
程序员开发技术整理 laizhixue 学习前端框架
前端技术：vue-前端框架element-前端框架bootstrap-前端框架echarts-图标组件C#后端技术：webservice：soap架构：简单的通信协议，用于服务通信ORM框架：对象关系映射，如EF：对象实体模型，是ado.net中的应用技术soap服务通讯：xml通讯ado.net：OAuth2:登录授权认证：Token认证：JWT：jsonwebtokenJava后端技术：便捷工
javascript的数据类型及转换田小田txt
一、JavaScript数据类型：共有string，number，boolean，object，function五种数据类型；其中Object，Date，Array为对象型；2个不包含任何值的数据类型：null，undefined。二、Typeof查看数据类型：typeof"John"//返回stringtypeof3.14//返回numbertypeofNaN//返回numbertypeoffa
java线程之Lock的使用 dimdark
目标:大致介绍一下java.util.concurrent.locks包下的类,接口及其常用方法1.Lock接口Lock接口使用Lock接口的最佳模式:publicvoidmethod()throwInterruptedException{try{lock.lock();//lock.lockUninterruptibly();}finally{lock.unlock();}}用户必须手动释放Lo
第六届蓝桥杯大赛软件赛省赛Java 大学C组题解爱跑步的程序员~ 刷题蓝桥杯省赛
文章目录A隔行变色思路解题方法复杂度CodeB立方尾不变思路解题方法复杂度CodeC无穷分数思路解题方法复杂度CodeD奇妙的数字思路解题方法复杂度CodeE移动距离思路解题方法复杂度CodeF垒骰子思路解题方法复杂度CodeA隔行变色思路这是一个简单的计数问题。我们需要找出21到50之间的奇数数量。奇数行将被染成蓝色，偶数行将被染成白色。解题方法我们可以使用一个for循环从21遍历到50，然后使
Java学习笔记04：Java_数组 JasonYangQ Java java
文章目录1.数组1.1数组介绍1.2数组的定义格式1.2.1第一种格式1.2.2第二种格式1.3数组的动态初始化1.3.1什么是动态初始化1.3.2动态初始化格式1.3.3动态初始化格式详解1.4数组元素访问1.4.1什么是索引1.4.2访问数组元素格式1.4.3示例代码1.5内存分配1.5.1内存概述1.5.2java中的内存分配1.9数组的静态初始化1.9.1什么是静态初始化1.9.2静态初始
【设计模式】Java 设计模式之桥接模式（Bridge）新手村长 Java 设计模式设计模式 java 桥接模式
桥接模式（BridgePattern）是结构型设计模式的一种，它主要解决的是抽象部分与实现部分的解耦问题，使得两者可以独立变化。这种类型的设计模式属于结构型模式，因为该模式涉及如何组合接口和它们的实现。将抽象部分与实现部分分离，使它们都可以独立地变化。一、桥接模式概述桥接模式的主要思想是将抽象与实现进行解耦，使得二者可以独立进行变化。在桥接模式中，抽象部分和实现部分被分离出来，抽象部分定义了一个抽
基于SSM+Vue企业销售培训系统企业人才培训系统企业课程培训管理系统企业文化培训班系统Java 计算机程序老哥
作者主页：计算机毕业设计老哥有问题可以主页问我一、开发介绍1.1开发环境开发语言：Java数据库：MySQL系统架构：B/S后端：SSM(Spring+SpringMVC+Mybatis)前端：Vue工具：IDEA或者Eclipse，JDK1.8，Maven二、系统介绍2.1图片展示注册登录页面：登陆.png前端页面功能：首页、培训班、在线学习、企业文化、交流论坛、试卷列表、系统公告、留言反馈、个
java selenium 元素点击不了马达马达达 selenium 测试工具
最近做了一个页面爬取，很有意思被机缘巧合下解决了。这个元素很奇怪，用xpath可以定位元素，但是就是click()不了。试过了网上搜的一些办法：//尝试一WebElementa_tag=driver.findElement(By.xpath("xxx"));a_tag.click();//点击不了，卡住//尝试二WebDriverWaitwait=newWebDriverWait(driver,1
【Java初阶（三）】方法的使用 PU-YUHAN Java从入门到精通 java 开发语言递归方法
❣博主主页:33的博客❣▶文章专栏分类:Java从入门到精通◀我的代码仓库:33的代码仓库目录1.前言2.方法的概念2.1方法定义2.2实参和形参的关系3.方法的重载3.1方法重载的概念4.递归4.1递归的概念4.2递归过程分析4.3递归练习5.总结1.前言在前面的学习中，我们已经学习了Java的部分知识，包括数据类型与变量，运算符，分支与循环以及输入和输出这些基础知识，我们继续对Java的学习进
解析XML文件的几种方式？人生在勤，不索何获 xml
在Java中解析XML文件可以通过多种方式完成，其中最常用的有DOM（DocumentObjectModel）、SAX（SimpleAPIforXML）和StAX（StreamingAPIforXML）。每种方式有其特点和适用场景。1.DOM解析DOM解析是一种将整个XML文档加载到内存中，构造成一个树形结构，然后你可以很方便地访问任何数据节点的方法。这种方法适用于需要频繁读写操作的场景。impo
javascript实现SM2加密解密人生在勤，不索何获 javascript 前端 jquery
前提JavaWeb环境前端代码window.sm2=function(t){functioni(e){if(r[e])returnr[e].exports;varn=r[e]={i:e,l:!1,exports:{}};returnt[e].call(n.exports,n,n.exports,i),n.l=!0,n.exports}varr={};returni.m=t,i.c=r,i.d=fu
Vue：为什么要使用v-cloak 刻刻帝的海角 vue.js 前端 javascript
Vue.js是一种流行的JavaScript框架，它使我们能够构建交互性强大的用户界面。在Vue.js中，v-cloak是一个指令，用于解决在页面加载时出现的闪烁问题。本文将介绍如何使用v-cloak及代码来优化Vue.js应用程序的渲染效果。引言当我们使用Vue.js构建应用程序时，有时会遇到一个问题：在页面加载时，由于Vue.js需要一定的时间来解析和渲染模板，会导致页面上显示出未经处理的Mu
JavaScript快速入门笔记之二（变量、常量、数据类型） eshineLau 前端开发 javascript 笔记前端
JavaScript快速入门笔记之二（变量、常量、数据类型）1、变量何时使用变量：程序中的一切数据都要保存在变量中，反复使用如何使用变量：2种情况：赋值和取值赋值：2步：1.1创建变量：——声明——创建一个新的空变量语法：var变量名;强调：仅声明，未赋值的变量，默认值是undefined命名：1.不能以数字开头2.不能用保留字。3.一般采用驼峰命名1.2赋值：将数据保存到变量中语法：变量名=数据
jQuery 键盘事件keydown ,keypress ,keyup介绍 107x js jquery keydown keypress keyup
本文章总结了下些关于jQuery 键盘事件keydown ,keypress ,keyup介绍，有需要了解的朋友可参考。一、首先需要知道的是： 1、keydown() keydown事件会在键盘按下时触发. 2、keyup() 代码如下复制代码 $('input').keyup(funciton(){
AngularJS中的Promise bijian1013 JavaScript AngularJS Promise
一.Promise Promise是一个接口，它用来处理的对象具有这样的特点：在未来某一时刻（主要是异步调用）会从服务端返回或者被填充属性。其核心是，promise是一个带有then()函数的对象。为了展示它的优点，下面来看一个例子，其中需要获取用户当前的配置文件： var cu
c++ 用数组实现栈类 CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T, int SIZE = 50> class Stack{ private: T list[SIZE];//数组存放栈的元素 int top;//栈顶位置 public: Stack(
java和c语言的雷同麦田的设计者 java 递归 scaner
软件启动时的初始化代码，加载用户信息2015年5月27号从头学java二 1、语言的三种基本结构：顺序、选择、循环。废话不多说，需要指出一下几点： a、return语句的功能除了作为函数返回值以外，还起到结束本函数的功能，return后的语句不会再继续执行。 b、for循环相比于whi
LINUX环境并发服务器的三种实现模型被触发 linux
服务器设计技术有很多，按使用的协议来分有TCP服务器和UDP服务器。按处理方式来分有循环服务器和并发服务器。 1 循环服务器与并发服务器模型在网络程序里面，一般来说都是许多客户对应一个服务器，为了处理客户的请求，对服务端的程序就提出了特殊的要求。目前最常用的服务器模型有： ·循环服务器：服务器在同一时刻只能响应一个客户端的请求 ·并发服务器：服
Oracle数据库查询指令肆无忌惮_ oracle数据库
20140920 单表查询 -- 查询************************************************************************************************************ -- 使用scott用户登录 -- 查看emp表 desc emp
ext右下角浮动窗口知了ing JavaScript ext
第一种 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/
浅谈REDIS数据库的键值设计矮蛋蛋 redis
http://www.cnblogs.com/aidandan/ 原文地址：http://www.hoterran.info/redis_kv_design 丰富的数据结构使得redis的设计非常的有趣。不像关系型数据库那样，DEV和DBA需要深度沟通，review每行sql语句，也不像memcached那样，不需要DBA的参与。redis的DBA需要熟悉数据结构，并能了解使用场景。
maven编译可执行jar包 alleni123 maven
http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven <build> <plugins> <plugin> <artifactId>maven-asse
人力资源在现代企业中的作用百合不是茶 HR 企业管理
//人力资源在在企业中的作用人力资源为什么会存在，人力资源究竟是干什么的人力资源管理是对管理模式一次大的创新，人力资源兴起的原因有以下点：工业时代的国际化竞争，现代市场的风险管控等等。所以人力资源在现代经济竞争中的优势明显的存在，人力资源在集团类公司中存在着明显的优势(鸿海集团)，有一次笔者亲自去体验过红海集团的招聘，只知道人力资源是管理企业招聘的当时我被招聘上了，当时给我们培训的人
Linux自启动设置详解 bijian1013 linux
linux有自己一套完整的启动体系，抓住了linux启动的脉络，linux的启动过程将不再神秘。阅读之前建议先看一下附图。本文中假设inittab中设置的init tree为： /etc/rc.d/rc0.d /etc/rc.d/rc1.d /etc/rc.d/rc2.d /etc/rc.d/rc3.d /etc/rc.d/rc4.d /etc/rc.d/rc5.d /etc
Spring Aop Schema实现 bijian1013 java spring AOP
本例使用的是Spring2.5 1.Aop配置文件spring-aop.xml <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmln
【Gson七】Gson预定义类型适配器 bit1129 gson
Gson提供了丰富的预定义类型适配器，在对象和JSON串之间进行序列化和反序列化时，指定对象和字符串之间的转换方式， DateTypeAdapter public final class DateTypeAdapter extends TypeAdapter<Date> { public static final TypeAdapterFacto
【Spark八十八】Spark Streaming累加器操作（updateStateByKey) bit1129 update
在实时计算的实际应用中，有时除了需要关心一个时间间隔内的数据，有时还可能会对整个实时计算的所有时间间隔内产生的相关数据进行统计。比如：对Nginx的access.log实时监控请求404时，有时除了需要统计某个时间间隔内出现的次数，有时还需要统计一整天出现了多少次404，也就是说404监控横跨多个时间间隔。 Spark Streaming的解决方案是累加器，工作原理是，定义
linux系统下通过shell脚本快速找到哪个进程在写文件 ronin47
一个文件正在被进程写我想查看这个进程文件一直在增大找不到谁在写使用lsof也没找到这个问题挺有普遍性的，解决方法应该很多，这里我给大家提个比较直观的方法。 linux下每个文件都会在某个块设备上存放，当然也都有相应的inode, 那么透过vfs.write我们就可以知道谁在不停的写入特定的设备上的inode。幸运的是systemtap的安装包里带了inodewatch.stp，位
java-两种方法求第一个最长的可重复子串 bylijinnan java 算法
import java.util.Arrays; import java.util.Collections; import java.util.List; public class MaxPrefix { public static void main(String[] args) { String str="abbdabcdabcx";
Netty源码学习-ServerBootstrap启动及事件处理过程 bylijinnan java netty
Netty是采用了Reactor模式的多线程版本，建议先看下面这篇文章了解一下Reactor模式： http://bylijinnan.iteye.com/blog/1992325 Netty的启动及事件处理的流程，基本上是按照上面这篇文章来走的文章里面提到的操作，每一步都能在Netty里面找到对应的代码其中Reactor里面的Acceptor就对应Netty的ServerBo
servelt filter listener 的生命周期 cngolon filter listener servelt 生命周期
1. servlet 当第一次请求一个servlet资源时，servlet容器创建这个servlet实例，并调用他的 init(ServletConfig config)做一些初始化的工作，然后调用它的service方法处理请求。当第二次请求这个servlet资源时，servlet容器就不在创建实例，而是直接调用它的service方法处理请求，也就是说
jmpopups获取input元素值 ctrain JavaScript
jmpopups 获取弹出层form表单首先，我有一个div，里面包含了一个表单，默认是隐藏的，使用jmpopups时，会弹出这个隐藏的div，其实jmpopups是将我们的代码生成一份拷贝。当我直接获取这个form表单中的文本框时，使用方法：$('#form input[name=test1]').val()；这样是获取不到的。我们必须到jmpopups生成的代码中去查找这个值，$(
vi查找替换命令详解 daizj linux 正则表达式替换查找 vim
一、查找查找命令 /pattern<Enter> ：向下查找pattern匹配字符串 ?pattern<Enter>：向上查找pattern匹配字符串使用了查找命令之后，使用如下两个键快速查找： n：按照同一方向继续查找 N：按照反方向查找字符串匹配 pattern是需要匹配的字符串，例如： 1: /abc<En
对网站中的js,css文件进行打包 dcj3sjt126com PHP 打包
一，为什么要用smarty进行打包 apache中也有给js,css这样的静态文件进行打包压缩的模块，但是本文所说的不是以这种方式进行的打包，而是和smarty结合的方式来把网站中的js,css文件进行打包。为什么要进行打包呢，主要目的是为了合理的管理自己的代码。现在有好多网站，你查看一下网站的源码的话，你会发现网站的头部有大量的JS文件和CSS文件，网站的尾部也有可能有大量的J
php Yii: 出现undefined offset 或者 undefined index解决方案 dcj3sjt126com undefined
在开发Yii 时，在程序中定义了如下方式： if($this->menuoption[2] === 'test')，那么在运行程序时会报：undefined offset:2，这样的错误主要是由于php.ini 里的错误等级太高了，在windows下错误等级
linux 文件格式（1） sed工具 eksliang linux linux sed工具 sed工具 linux sed详解
转载请出自出处： http://eksliang.iteye.com/blog/2106082 简介 sed 是一种在线编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为“模式空间”（pattern space），接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。接着处理下一行，这样不断重复，直到文件末尾
Android应用程序获取系统权限 gqdy365 android
引用如何使Android应用程序获取系统权限第一个方法简单点，不过需要在Android系统源码的环境下用make来编译： 1. 在应用程序的AndroidManifest.xml中的manifest节点
HoverTree开发日志之验证码 hvt .net C#asp.net hovertree webform
HoverTree是一个ASP.NET的开源CMS，目前包含文章系统，图库和留言板功能。代码完全开放，文章内容页生成了静态的HTM页面，留言板提供留言审核功能，文章可以发布HTML源代码，图片上传同时生成高品质缩略图。推出之后得到许多网友的支持，再此表示感谢！留言板不断收到许多有益留言，但同时也有不少广告，因此决定在提交留言页面增加验证码功能。ASP.NET验证码在网上找，如果不是很多，就是特别多
JSON API：用 JSON 构建 API 的标准指南中文版 justjavac json
译文地址：https://github.com/justjavac/json-api-zh_CN 如果你和你的团队曾经争论过使用什么方式构建合理 JSON 响应格式，那么 JSON API 就是你的 anti-bikeshedding 武器。通过遵循共同的约定，可以提高开发效率，利用更普遍的工具，可以是你更加专注于开发重点：你的程序。基于 JSON API 的客户端还能够充分利用缓存，
数据结构随记_2 lx.asymmetric 数据结构笔记
第三章栈与队列一．简答题 1. 在一个循环队列中，队首指针指向队首元素的前一个位置。 2.在具有n个单元的循环队列中，队满时共有 n-1 个元素。 3. 向栈中压入元素的操作是先移动栈顶指针&n
Linux下的监控工具dstat 网络接口 linux
1) 工具说明dstat是一个用来替换 vmstat,iostat netstat,nfsstat和ifstat这些命令的工具, 是一个全能系统信息统计工具. 与sysstat相比, dstat拥有一个彩色的界面, 在手动观察性能状况时, 数据比较显眼容易观察; 而且dstat支持即时刷新, 譬如输入dstat 3, 即每三秒收集一次, 但最新的数据都会每秒刷新显示. 和sysstat相同的是,
C 语言初级入门--二维数组和指针 1140566087 二维数组 c/c++指针
/* 二维数组的定义和二维数组元素的引用二维数组的定义：当数组中的每个元素带有两个下标时，称这样的数组为二维数组； (逻辑上把数组看成一个具有行和列的表格或一个矩阵); 语法：类型名数组名[常量表达式1][常量表达式2] 二维数组的引用：引用二维数组元素时必须带有两个下标，引用形式如下：例如： int a[3][4]; 引用：
10点睛Spring4.1-Application Event wiselyman application
10.1 Application Event Spring使用Application Event给bean之间的消息通讯提供了手段应按照如下部分实现bean之间的消息通讯继承ApplicationEvent类实现自己的事件实现继承ApplicationListener接口实现监听事件使用ApplicationContext发布消息