String内部存储方式与Unicode

本文分析String类,从源码的角度出发分析了Java中String的内部存储方式

String类中的私有域

String类中,字符串是以 char[]的形式被保存

/** The value is used for character storage. */
private final char value[];
/** Cache the hash code for the string */
private int hash; // Default to 0

String类的构造方法

要了解value这个char[]中到底存储了什么,需要找一个String类中具有代表性的构造方法。下面是String类的一个构造方法。其中最重要的参数就是int[] codePoints,另外两个参数只是指定了截取的位置和长度。

    public String(int[] codePoints, int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count <= 0) {
            if (count < 0) {
                throw new StringIndexOutOfBoundsException(count);
            }
            if (offset <= codePoints.length) {
                this.value = "".value;
                return;
            }
        }
        // Note: offset or count might be near -1>>>1.
        if (offset > codePoints.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }

        final int end = offset + count;

        // Pass 1: Compute precise size of char[]
        int n = count;
        for (int i = offset; i < end; i++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                continue;
            else if (Character.isValidCodePoint(c))
                n++;
            else throw new IllegalArgumentException(Integer.toString(c));
        }

        // Pass 2: Allocate and fill in char[]
        final char[] v = new char[n];

        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;
            else
                Character.toSurrogates(c, v, j++);
        }

        this.value = v;
    }

参数CodePionts是什么

要理解上述构造方法,首先要知道Code Piont是什么。Code Point就是一个完整的Unicode字符。由于不是所有的Code Point都能用16bit(java中char是16bit)表示,所以CodePoints参数为int数组,并且在构造函数中需要转换才能存入char value[] 中。

Unicode:
(统一码、万国码、单一码)是计算机科学领域里的一项业界标准,包括字符集、编码方案等。Unicode 是为了解决传统的字符编码方案的局限而产生的,它为每种语言中的每个字符设定了统一并且唯一的二进制编码,以满足跨语言、跨平台进行文本转换、处理的要求。1990年开始研发,1994年正式公布(维基百科)。
Unicode简单地说是一个能全球通用的字符编码,它为每个字符指定了一个唯一的编号。采用U+后面接一组十六进制数来表示

BMP(Basic Multilingual Plane,基本多文种平面):
只需要知道BMP代表了一个字符范围,在BMP范围内的字符,可以用4位十六进制数表示(16bit),而在BMP以外的字符,需要不止4位十六进制数表示。

CodePoints到char[]转换过程:
主要转换过程如下,最重要是判断 codePoints[i] 是否为BMP范围内的编码,如果是则可以用char表示,否则需要用两个char来表示 Character.toSurrogates(c, v, j++)

  // Pass 2: Allocate and fill in char[]
        final char[] v = new char[n];

        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;
            else
                Character.toSurrogates(c, v, j++);
        }

        this.value = v;
 /*********************************************************/
    Character.toSurrogates(c, v, j++);    
    static void toSurrogates(int codePoint, char[] dst, int index) {
        // We write elements "backwards" to guarantee all-or-nothing
        dst[index+1] = lowSurrogate(codePoint);
        dst[index] = highSurrogate(codePoint);
    }
 /*********************************************************/

Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)

High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.)

Low-Surrogate Code Unit. A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D74 in Section 3.8, Surrogates.)

总结

String内部采用char数组形式存储Unicode字符串,由于char是16位,也可以说是UTF-16编码。但并不是一个char存储一个字符,当字符在BMP范围以外时,会用两个char存储一个字符。

你可能感兴趣的:(JAVA,java)