Java 字符串与字符串常量池

运行环境为 JDK1.8。

1. String 的基本介绍

首先看一下 String 类源码文件的文档注释:
The String class represents character strings. All string literals in Java programs, such as "abc", are implemented as instances of this class.

Strings are constant; their values cannot be changed after they are created. String buffers support mutable strings. Because String objects are immutable they can be shared. For example:

String str = "abc";

is equivalent to:

char data[] = {'a', 'b', 'c'};
String str = new String(data);

Here are some more examples of how strings can be used:

System.out.println("abc");
String cde = "cde";
System.out.println("abc"+cde);
String c = "abc".substring(2, 3);
String d = cde.substring(1, 2);

The class String includes methods for examining individual characters of the sequence, for comparing strings, for searching strings, for extracting substrings, and for creating a copy of a string with all characters translated to uppercase or to lowercase. Case mapping is based on the Unicode Standard version specified by the java.lang.Character class.

The Java language provides special support for the string concatenation operator (+), and for conversion of other objects to strings. String concatenation is implemented through the StringBuilder(or StringBuffer) class and its append method.

String conversions are implemented through the method toString, defined by Object and inherited by all classes in Java. For additional information on string concatenation and conversion, see Gosling, Joy, and Steele, The Java Language Specification.

Unless otherwise noted, passing a null argument to a constructor or method in this class will cause a NullPointerException to be thrown.

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information).

Index values refer to char code units, so a supplementary character uses two position in a String. The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values).

从文档注释中,可以得到以下四点关键信息:

  1. java 程序中所有的字符串字面量都是这个类的实例。
  2. 因为字符串对象是不可变的,所以可以共享。
  3. 字符串连接(+)是通过 StringBuilder (或 StringBuffer)类及其 append 方法实现的。
  4. 字符串的编码格式为 UTF-16,Unicode 基本平面外的补充字符使用代理对(两个 char)来表示。

String 类的关键字段,以及构造函数:

public final class String
    implements java.io.Serializable, Comparable, CharSequence {
    /** The value is used for character storage. */
    private final char value[];
    /** Cache the hash code for the string */
    private int hash; // Default to 0
    // 注意,使用这个构造方法是不必要的,因为字符串是不可变的。
    public String() {
        this.value = "".value;
    }

    public String(String original) {
        this.value = original.value;
        this.hash = original.hash;
    }
    // 因为字符数组参数有可能被改变,而 String 是希望不可变的,所以要深拷贝
    public String(char value[]) {
        this.value = Arrays.copyOf(value, value.length);
    }
    public String(char value[], int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count <= 0) {
            if (count < 0) {
                throw new StringIndexOutOfBoundsException(count);
            }
            if (offset <= value.length) {
                this.value = "".value;
                return;
            }
        }
        // Note: offset or count might be near -1>>>1.(源码在这一点上考虑的很细致)
        if (offset > value.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }
        this.value = Arrays.copyOfRange(value, offset, offset+count);
    }
    
    // 整型数组参数转字符串,改变整形数组参数不会影响字符串
    public String(int[] codePoints, int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count <= 0) {
            if (count < 0) {
                throw new StringIndexOutOfBoundsException(count);
            }
            if (offset <= codePoints.length) {
                this.value = "".value;
                return;
            }
        }
        // Note: offset or count might be near -1>>>1.
        if (offset > codePoints.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }

        final int end = offset + count;

        // Pass 1: Compute precise size of char[]
        int n = count;
        for (int i = offset; i < end; i++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c)) // 两字节字符
                continue;
            else if (Character.isValidCodePoint(c)) // 四字节字符
                n++;
            else throw new IllegalArgumentException(Integer.toString(c));
        }

        // Pass 2: Allocate and fill in char[]
        final char[] v = new char[n];

        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;
            else
                Character.toSurrogates(c, v, j++);
        }

        this.value = v;
    }
    private static void checkBounds(byte[] bytes, int offset, int length) {
        if (length < 0)
            throw new StringIndexOutOfBoundsException(length);
        if (offset < 0)
            throw new StringIndexOutOfBoundsException(offset);
        if (offset > bytes.length - length)
            throw new StringIndexOutOfBoundsException(offset + length);
    }
    public String(byte bytes[], int offset, int length, String charsetName)
            throws UnsupportedEncodingException {
        if (charsetName == null)
            throw new NullPointerException("charsetName");
        checkBounds(bytes, offset, length);
        this.value = StringCoding.decode(charsetName, bytes, offset, length);
    }
    public String(byte bytes[], int offset, int length, Charset charset) {
        if (charset == null)
            throw new NullPointerException("charset");
        checkBounds(bytes, offset, length);
        this.value =  StringCoding.decode(charset, bytes, offset, length);
    }
    public String(byte bytes[], String charsetName)
            throws UnsupportedEncodingException {
        this(bytes, 0, bytes.length, charsetName);
    }
    public String(byte bytes[], Charset charset) {
        this(bytes, 0, bytes.length, charset);
    }
    public String(byte bytes[], int offset, int length) {
        checkBounds(bytes, offset, length);
        this.value = StringCoding.decode(bytes, offset, length);
    }
    public String(byte bytes[]) {
        this(bytes, 0, bytes.length);
    }
    
    public String(StringBuffer buffer) {
        synchronized(buffer) {
            this.value = Arrays.copyOf(buffer.getValue(), buffer.length());
        }
    }
    
    public String(StringBuilder builder) {
        this.value = Arrays.copyOf(builder.getValue(), builder.length());
    }
    
    /*
    * Package private constructor which shares value array for speed.
    * this constructor is always expected to be called with share==true.
    * a separate constructor is needed because we already have a public
    * String(char[]) constructor that makes a copy of the given char[].
    */
    String(char[] value, boolean share) {
        // assert share : "unshared not supported";
        this.value = value;
    }
}

在 Java8 中,字符串使用 final 修饰的 char 数组存储。

2. String 不可改变性

String 不可改变性是指,一旦创建了一个 String 对象后,String 不提供任何修改字符串内容的方法。String 不可改变性的设计,一方面为了实现字符串对象共享,另一方面则是为了能够作为哈希表的 key。

String 不可改变性,绝对不是因为 String 的 char 数组被 final 修饰,final 修饰 char 数组仅仅表示,数组引用指向的内存地址不可变,但指向的内存区域,其存储内容是可变的。真正做到 String 不可变的原因是源码没有提供任何改变 char 数组的公有方法,而且 String 内所有会改变字符串的方法都会返回一个新的字符串对象。
例如字符串连接操作:

    public String concat(String str) {
        int otherLen = str.length();
        if (otherLen == 0) {
            return this;
        }
        int len = value.length;
        char buf[] = Arrays.copyOf(value, len + otherLen);
        str.getChars(buf, len);
        return new String(buf, true);
    }

例如字符串替换字符操作:

    public String replace(char oldChar, char newChar) {
        if (oldChar != newChar) {
            int len = value.length;
            int i = -1;
            char[] val = value; /* avoid getfield opcode */

            while (++i < len) {
                if (val[i] == oldChar) {
                    break;
                }
            }
            if (i < len) {
                char buf[] = new char[len];
                for (int j = 0; j < i; j++) {
                    buf[j] = val[j];
                }
                while (i < len) {
                    char c = val[i];
                    buf[i] = (c == oldChar) ? newChar : c;
                    i++;
                }
                return new String(buf, true);
            }
        }
        return this;
    }

例如去除字符串两端空格的操作:

    public String trim() {
        int len = value.length;
        int st = 0;
        char[] val = value;    /* avoid getfield opcode */

        while ((st < len) && (val[st] <= ' ')) {
            st++;
        }
        while ((st < len) && (val[len - 1] <= ' ')) {
            len--;
        }
        // substring 方法只要截取的不是整个字符串,返回的也是一个新的字符串对象
        return ((st > 0) || (len < value.length)) ? substring(st, len) : this;
    }

那么 Java 字符串真的没有任何办法修改它吗?答案是否,我们可以利用 java 的反射机制可以修改 String 对象的值。

package test;

import java.lang.reflect.Field;

public class Test {
    public static void main(String[] args) {
        String a = "1+1=3";
        try {
            Field valueField = String.class.getDeclaredField("value");
            valueField.setAccessible(true); // value字段本来是private的,这里设为可访问的
            char[] value = (char[])valueField.get(a);
            value[value.length-1] = '2'; // 原本是'3',这里改成'2'
        } catch (NoSuchFieldException | SecurityException | 
                IllegalArgumentException | IllegalAccessException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        System.out.println(a); // 输出:1+1=2
    }
}

虽然可以通过反射修改字符串对象的值,但是在实际编码中不建议这么做。这可能会导致字符串长度函数和哈希函数的返回值与实际字符串不一致。

    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

从源码中可以看到,字符串的哈希值只会在第一次调用 hashCode 函数时计算,之后再调用 hashCode 不会重新计算哈希值。

3. 字符串相加的编译期优化

Java 对直接相加的字符串常量,例如 "hello"+"world",会在编译期优化为 "helloworld"。而间接相加(与字符串的引用相加)则不会优化。
所以有:

String s1 = "ab";
String s2 = "a" + "b";
System.out.println(s1 == s2); // true

其中,变量 s1 和 s2 都是字符串常量池中 "ab" 的引用。
以及

String s1 = "ab";
String a = "a";
String s2 = a + "b";
System.out.println(s1 == s2); // false

但是,如果字符串变量被 final 修饰,则在编译期间,所有使用该 final 变量的地方都会直接替换为 final 变量的真实值。
所以有:

String s1 = "ab";
final String a = "a";
String s2 = a + "b"; // 相当于 "a" + "b"
System.out.println(s1 == s2); // true

对于 final 修饰的变量,只有在声明时就赋值为常量表达式的,才会被编译期优化。
例如:

final String a;
a = "a";

没有在声明时就赋值,不会被编译期优化。
例如:

final String a = null == null ? "a":null;

赋值为非常量表达式,不会被编译期优化。

4. 字符串常量池

字符串常量池存在的意义就是实现字符串的共享,节省内存空间。像 Integer 等部分基本类型的包装类也实现了常量池技术,但是它们都是直接在 Java 源码层面实现的,而字符串常量池是在 JVM 层面使用 C 语言实现的。字符串常量池的底层实现其实就是一个哈希表,可以把它理解成不能自动扩容的 HashMap。

4.1 intern() 函数

字符串添加到字符串常量池的途径有两种:

  1. 经过编译期优化的字符串字面量在运行时会自动添加到常量池。
  2. 运行时通过 intern 函数向常量池添加字符串。

String 类中的 intern 函数:

    /**
     * Returns a canonical representation for the string object.
     * 

* A pool of strings, initially empty, is maintained privately by the * class {@code String}. *

* When the intern method is invoked, if the pool already contains a * string equal to this {@code String} object as determined by * the {@link #equals(Object)} method, then the string from the pool is * returned. Otherwise, this {@code String} object is added to the * pool and a reference to this {@code String} object is returned. *

* It follows that for any two strings {@code s} and {@code t}, * {@code s.intern() == t.intern()} is {@code true} * if and only if {@code s.equals(t)} is {@code true}. *

* All literal strings and string-valued constant expressions are * interned. String literals are defined in section 3.10.5 of the * The Java™ Language Specification. * * @return a string that has the same contents as this string, but is * guaranteed to be from a pool of unique strings. */ public native String intern();

可以看到,intern 函数是一个本地方法。注释中说到,如果常量池中已经存在当前字符串,则返回常量池中的该字符串,如果不存在则将当前字符串添加到常量池,并返回其引用。

关于 intern 函数的一段代码:

String s1 = new String("1");
s1.intern();
String s2 = "1";
System.out.println(s1 == s2); // false

String s3 = new String("1") + new String("1");
s3.intern();
String s4 = "11";
System.out.println(s3 == s4); // true

在 JDK1.8 中上面代码的运行结果为 false true。但是如果把这段代码拿到 JDK1.6 中运行,你得到的结果将会是 false false。
原因是 JVM 在字符串常量池的实现上有所改动。在 JDK1.6 中,字符串常量池不在堆中,在永久代中,intern 函数向字符串常量池添加的是字符串对象,而在 JDK1.8 中,字符串常量池在堆中,intern 函数向常量池添加的不再是字符串对象,而是字符串对象在堆中的引用。
关于这部分的详细内容可以阅读美团的这篇技术文章深入解析String#intern。

对于第一种字符串添加到常量池的途径,强调的“经过编译期优化的字符串字面量”指的是,对于 String s = "a"+"a" 这种写法,经过编译后,相当于 String s = "aa"。所以运行时被放入常量池的只有 "aa",而不会有 "a"。
可以使用 intern 函数来验证:

String s = "a"+"a";
String s1 = new String(s);
System.out.println(s1 == s1.intern()); // false
String s2 = s.substring(1);
System.out.println(s2 == s2.intern()); // true

运行结果为 false true,第一个 false 说明在执行 s1.intern() 前,字符串常量池中已经有 "aa" ,所以 s1 不会被放入到常量池。s1.intern() 返回的是常量池中字符串的引用,当然不等于 s1。第二个 true 表明执行 s2.intern() 前,字符串常量池中没有 "a",执行 s2.intern() 时,"a" 的引用 s2 被添加到字符串常量池,并返回 s2,所以比较的是同一个引用,当然相等。

4.2 字符串常量池的位置

在 JDK1.8 中,字符串常量池的位置在堆中。
下面通过代码来验证这个结论:
运行代码前,为了尽快看到结果,需要设置虚拟机参数 -Xmx2m,含义是设置堆的最大值为 2M。
设置步骤为:

  1. 右键 => Run As => Run Configurations
  2. 切换到第二个选项卡:(x)= Arguments => 在 VM arguments 里填入 -Xmx2m

设置界面如下:


image.png

验证代码:

package test;

import java.util.ArrayList;
import java.util.List;

public class Test {
    public static void main(String[] args) {
        List list = new ArrayList(); // 保持引用,防止垃圾回收
        int i=0;
        while(true){
            list.add(String.valueOf(i++).intern()); // 不停地向字符串常量池中添加字符串
        }
    }
}

运行结果:


image.png

结果表明堆空间溢出,字符串常量池在堆中。

附录:final 关键字

  1. final + 属性(或局部变量)
    如果属性是基本数据类型,则变为常量,值不能修改;如果属性是引用类型,则引用地址不能修改。
  2. final + 方法
    该方法变为“最终方法”,不能被子类重写。
  3. final + 类
    该类变为“最终类”,不能被子类继承。如 Java 中的 String、System 等均为最终类。

你可能感兴趣的:(Java 字符串与字符串常量池)