本文分析String类,从源码的角度出发分析了Java中String的内部存储方式
String类中,字符串是以 char[]的形式被保存
/** The value is used for character storage. */
private final char value[];
/** Cache the hash code for the string */
private int hash; // Default to 0
要了解value这个char[]中到底存储了什么,需要找一个String类中具有代表性的构造方法。下面是String类的一个构造方法。其中最重要的参数就是int[] codePoints,另外两个参数只是指定了截取的位置和长度。
public String(int[] codePoints, int offset, int count) {
if (offset < 0) {
throw new StringIndexOutOfBoundsException(offset);
}
if (count <= 0) {
if (count < 0) {
throw new StringIndexOutOfBoundsException(count);
}
if (offset <= codePoints.length) {
this.value = "".value;
return;
}
}
// Note: offset or count might be near -1>>>1.
if (offset > codePoints.length - count) {
throw new StringIndexOutOfBoundsException(offset + count);
}
final int end = offset + count;
// Pass 1: Compute precise size of char[]
int n = count;
for (int i = offset; i < end; i++) {
int c = codePoints[i];
if (Character.isBmpCodePoint(c))
continue;
else if (Character.isValidCodePoint(c))
n++;
else throw new IllegalArgumentException(Integer.toString(c));
}
// Pass 2: Allocate and fill in char[]
final char[] v = new char[n];
for (int i = offset, j = 0; i < end; i++, j++) {
int c = codePoints[i];
if (Character.isBmpCodePoint(c))
v[j] = (char)c;
else
Character.toSurrogates(c, v, j++);
}
this.value = v;
}
要理解上述构造方法,首先要知道Code Piont是什么。Code Point就是一个完整的Unicode字符。由于不是所有的Code Point都能用16bit(java中char是16bit)表示,所以CodePoints参数为int数组,并且在构造函数中需要转换才能存入char value[] 中。
Unicode:
(统一码、万国码、单一码)是计算机科学领域里的一项业界标准,包括字符集、编码方案等。Unicode 是为了解决传统的字符编码方案的局限而产生的,它为每种语言中的每个字符设定了统一并且唯一的二进制编码,以满足跨语言、跨平台进行文本转换、处理的要求。1990年开始研发,1994年正式公布(维基百科)。
Unicode简单地说是一个能全球通用的字符编码,它为每个字符指定了一个唯一的编号。采用U+后面接一组十六进制数来表示
BMP(Basic Multilingual Plane,基本多文种平面):
只需要知道BMP代表了一个字符范围,在BMP范围内的字符,可以用4位十六进制数表示(16bit),而在BMP以外的字符,需要不止4位十六进制数表示。
CodePoints到char[]转换过程:
主要转换过程如下,最重要是判断 codePoints[i] 是否为BMP范围内的编码,如果是则可以用char表示,否则需要用两个char来表示 Character.toSurrogates(c, v, j++)。
// Pass 2: Allocate and fill in char[]
final char[] v = new char[n];
for (int i = offset, j = 0; i < end; i++, j++) {
int c = codePoints[i];
if (Character.isBmpCodePoint(c))
v[j] = (char)c;
else
Character.toSurrogates(c, v, j++);
}
this.value = v;
/*********************************************************/
Character.toSurrogates(c, v, j++);
static void toSurrogates(int codePoint, char[] dst, int index) {
// We write elements "backwards" to guarantee all-or-nothing
dst[index+1] = lowSurrogate(codePoint);
dst[index] = highSurrogate(codePoint);
}
/*********************************************************/
Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)
High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.)
Low-Surrogate Code Unit. A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D74 in Section 3.8, Surrogates.)
String内部采用char数组形式存储Unicode字符串,由于char是16位,也可以说是UTF-16编码。但并不是一个char存储一个字符,当字符在BMP范围以外时,会用两个char存储一个字符。