String源码阅读

wiki

通过反编译深入理解Java String及intern
成神之路-基础篇 Java基础知识——String相关
我终于搞清楚了和String有关的那点事儿。
深入解析String#intern
请别再拿“String s = new String("xyz");创建了多少个String实例”来面试了吧
Java 中new String("字面量") 中 "字面量" 是何时进入字符串常量池的?

学习目标

String在内存中存在形式；
- intern的用法，不同版本实现的差异
String的基本用法；

内存角度理解String

String str = new String("uranus"); // 堆中有String实例对象，常量池中有String实例

String str1 = "uranus"; //常量池中有string实例

String str2 = new String("uranus").intern(); //堆中有实例对象，常量池中有String实例对象

String str3 = "uranus" + "leon";//常量池中有uranusleon字符串的实例，没有uranus和leon的实例

String str1 = "uranus";
String str4 = str1 + "leon";//堆中有uranusleon的实例，常量池中没有uranusleon的实例，没有leon的实例

String str5 = new String("uranus") + new String("leon"); // 堆中有uranus和leon的实例，在常量池中是否有uranusleon这个字符串

new String("uranusleon")

package StringTest;

public class StringBasic {
    public static void main(String[] args)
    {
        String s1 = "uranusleon";
        String s2 = new String("uranusleon");
        String s3 = new String("uranusleon").intern();

        System.out.println(s1 == s2); // false
        System.out.println(s1 == s3); // true
    }
}

new String()生成了几个变量

若常量池中已经存在字面量，则直接引用，也就是此时只会创建一个对象，如果常量池中不存在字面量，则先创建后引用，也就是有两个

String字面量进入字符串常量池的时机

new String(“字面量”) 中 “字面量” 是何时进入字符串常量池的?
HotSpot VM的实现来说，加载类的时候，那些字符串字面量会进入到当前类的运行时常量池，不会进入全局的字符串常量池 ;
ldc指令触发lazy resolution动作
- 到当前类的运行时常量池（runtime constant pool，HotSpot VM里是ConstantPool + ConstantPoolCache）去查找该index对应的项
- 如果该项尚未resolve则resolve之，并返回resolve后的内容。
- 在遇到String类型常量时，resolve的过程如果发现StringTable已经有了内容匹配的java.lang.String的引用，则直接返回这个引用;
- 如果StringTable里尚未有内容匹配的String实例的引用，则会在Java堆里创建一个对应内容的String对象，然后在StringTable记录下这个引用，并返回这个引用出去。

实验代码

package StringTest;

public class StringBasic {
    public static void main(String[] args) throws Exception {
        String string = new String("uranusleon");
    }
}

字节码编译

public static void main(java.lang.String[]) throws java.lang.Exception;
    descriptor: ([Ljava/lang/String;)V
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=1, locals=2, args_size=1
         0: ldc           #2                  // String uranusleon
         2: astore_1
         3: return
      LineNumberTable:
        line 5: 0
        line 6: 3
      LocalVariableTable:
        Start  Length  Slot  Name   Signature
            0       4     0  args   [Ljava/lang/String;
            3       1     1 string   Ljava/lang/String;
    Exceptions:
      throws java.lang.Exception

ldc指令将在堆中创建一个堆中对应于“uranusleon”的实例，在字符串常量池中记录引用。

String.intern()的理解

intern的作用；
intern什么时候使用；
深入解析String#intern

intern()的作用

当一个String实例调用intern()方法时，Java查找常量池中是否有相同Unicode的字符串常量，如果有，则返回其的引用，如果没有，则在常量池中增加一个Unicode等于str的字符串并返回它的引用；

public static void main(String[] args) {
    String s = new String("1");
    s.intern();
    String s2 = "1";
    System.out.println(s == s2);

    String s3 = new String("1") + new String("1");// 堆中创建对象“11”，String常量池中没有“11”
    s3.intern(); //JDK6中在String常量池中新创建一个对象，JDK7中在String常量池中保存堆中对象的引用
    String s4 = "11"; //JDK6中S4是常量池中对象的引用，JDK7中s4是堆中对象的引用
    System.out.println(s3 == s4); //JDK6中s3!=s4，JDK7中s3 == s4
}

打印结果

jdk6 下false false
jdk7下false true

public static void main(String[] args) {
    String s = new String("1");
    String s2 = "1";
    s.intern();
    System.out.println(s == s2);

    String s3 = new String("1") + new String("1"); // 堆中创建对象“11”，String常量池中没有“11”
    String s4 = "11"; //jdk6和jdk7都会在常量池中创建“11”的对象
    s3.intern(); // 发现常量池中有“11”，不需要新建或者保存堆中对象的引用
    System.out.println(s3 == s4);
}

打印结果

jdk6 下false false
jdk7下false false

原因

jdk7版本中将String常量池从Perm区移动到了java heap区域；
String.intern()方法执行时，如果堆中存在对象，会直接在String常量池中保存对象的引用，不会重新创建新的对象；

intern()的使用时机

对于可能经常使用的字符串，并且这些字符串在编译期无法确定，只能在运行期才可以确定，可以使用intern()将字符串加入字符串常量池。

static final int MAX = 1000 * 10000;
static final String[] arr = new String[MAX];

public static void main(String[] args) throws Exception {
    Integer[] DB_DATA = new Integer[10];
    Random random = new Random(10 * 10000);
    for (int i = 0; i < DB_DATA.length; i++) {
        DB_DATA[i] = random.nextInt();
    }
    long t = System.currentTimeMillis();
    for (int i = 0; i < MAX; i++) {
        //arr[i] = new String(String.valueOf(DB_DATA[i % DB_DATA.length]));
         arr[i] = new String(String.valueOf(DB_DATA[i % DB_DATA.length])).intern();
    }

    System.out.println((System.currentTimeMillis() - t) + "ms");
    System.gc();
}

intern()理解的要点

String str1 = new String("uranus") + new String("leon")执行后，String intern1之前，字符串常量池中有没有"uranusleon"字符串的引用；

类的常量池中没有“uranusleon”字面量，所以字符串常量池中没有"uranusleon"字符串的引用
String str1 = new String("uranus") + new String("leon")和String str1 = new String("uranusleon");的区别，单独执行后字符串常量池内有哪些字符串；
- String str1 = new String("uranus") + new String("leon")执行后字符串常量池有'uranus'和'leon'的引用，没有'uranusleon'的引用
- String str1 = new String("uranusleon");执行后字符串常量池有'uranusleon'的引用
intern()方法执行时字符串常量池内是否有对应的字符串；

由上述讲解可以看出，在判断字符串的引用在字符串常量池中是否存在主要看class文件的常量池中是否存在字符串的字面量。

String + 的实现

String + 是通过StringBuilder实现的
String str3 = str1 + str2

对于字符串拼接，如果有一个参数是变量，拼接是使用Stringbuilder.append，编译期无法知道具体的字面量值，无法在字符串常量池中生成。
String str5 = "uranus" + "leon"

对应字符串拼接，如果两个参数都是字面量，则直接编译为拼接后的字符串，字符串常量池中会生产uranusleon。

String源码

构造方法

String是用字符数组char[]表示的

/** The value is used for character storage. */
    private final char value[];

使用字符数组构造String

public String(char value[]) {
    this.value = Arrays.copyOf(value, value.length);
}

public String(char value[], int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count <= 0) {
            if (count < 0) {
                throw new StringIndexOutOfBoundsException(count);
            }
            if (offset <= value.length) {
                this.value = "".value;
                return;
            }
        }
        // Note: offset or count might be near -1>>>1.
        if (offset > value.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }
        this.value = Arrays.copyOfRange(value, offset, offset+count);
    }

使用Arrays.copyOf和Arrays.copyOfRange方法将字符数组的内容复制到value[]中
可以只使用字符数组的一部分初始化String

String(char[] value, boolean share) {
        // assert share : "unshared not supported";
        this.value = value;
    }
此构造方法参数share的作用是为了区分String(char[] value)方法；

此方法构造出来的String和参数传过来的char[]共享一个数组；

优点

性能好：直接将数组引用赋值，不需要复制数组内容，速度快；

共享数组节约内存；

由于方法是protected的，对于调用他的方法来说，由于无论是原字符串还是新字符串，其value数组本身都是String对象的私有属性，从外部是无法访问的，因此对两个字符串来说都很安全。

使用字符串构造String

public String(String original) {
        this.value = original.value;
        this.hash = original.hash;
    }

直接将value和hash赋值给新String

使用字节数组构造String

byte是网络传输或存储的序列化形式。所以在很多传输和存储的过程中需要将byte[]数组和String进行相互转化。

String(byte bytes[]);
String(byte bytes[], Charset charset);
String(byte bytes[], int offset, int length);
String(byte bytes[], int offset, int length, Charset charset);
String(byte bytes[], int offset, int length, String charsetName);
String(byte bytes[], String charsetName);

调用构造方法时需要指定编码格式，如果不指定，默认为ISO-8859-1

使用StringBuilder和StringBuffer构造字符串

public String(StringBuffer buffer) {
        synchronized(buffer) {
            this.value = Arrays.copyOf(buffer.getValue(), buffer.length());
        }
    }

public String(StringBuilder builder) {
        this.value = Arrays.copyOf(builder.getValue(), builder.length());
    }

基本不会使用到，一般使用StringBuilder和StringBuffer的toString()方法。

比较方法

public boolean equals(Object anObject);
public boolean contentEquals(CharSequence cs);
public boolean contentEquals(StringBuffer sb);
public boolean equalsIgnoreCase(String anotherString);
public int compareTo(String anotherString);
public int compareToIgnoreCase(String str);
public boolean regionMatches(boolean ignoreCase, int toffset,
            String other, int ooffset, int len); //Tests if two string regions are equal.
public boolean regionMatches(int toffset, String other, int ooffset,int len);

equal()方法

public boolean equals(Object anObject) {
        if (this == anObject) {
            return true;
        }
        if (anObject instanceof String) {
            String anotherString = (String)anObject;
            int n = value.length;
            if (n == anotherString.value.length) { // value是私有的，怎么可以直接访问？
                char v1[] = value;
                char v2[] = anotherString.value;
                int i = 0;
                while (n-- != 0) {
                    if (v1[i] != v2[i])
                        return false;
                    i++;
                }
                return true;
            }
        }
        return false;
    }

private是针对类来说的，同一个类内可以访问相同类其他实例的私有变量；
代码提高效率的方法

字符串相同：地址相同；地址不同，但是内容相同

策略：将比较快速的部分（地址，比较对象类型）放在前面比较，速度慢的部分（比较字符数组）放在后面执行。

contentEquals()方法

public boolean contentEquals(CharSequence cs) {
        // Argument is a StringBuffer, StringBuilder
        if (cs instanceof AbstractStringBuilder) {
            if (cs instanceof StringBuffer) {
                synchronized(cs) {
                   return nonSyncContentEquals((AbstractStringBuilder)cs);
                }
            } else {
                return nonSyncContentEquals((AbstractStringBuilder)cs);
            }
        }
        // Argument is a String
        if (cs instanceof String) {
            return equals(cs);
        }
        // Argument is a generic CharSequence
        char v1[] = value;
        int n = v1.length;
        if (n != cs.length()) {
            return false;
        }
        for (int i = 0; i < n; i++) {
            if (v1[i] != cs.charAt(i)) {
                return false;
            }
        }
        return true;
    }

public boolean contentEquals(StringBuffer sb);实际调用了contentEquals(CharSequence cs)方法；
AbstractStringBuilder和String都是接口CharSequence的实现，通过判断输入是AbstractStringBuilder还是String的实例，执行不同的方法；

比较的核心代码

for (int i = 0; i < n; i++) {
            if (v1[i] != v2[i]) {
                return false;
            }
        }
对字符串的字符数组中字符依次进行比较

equalsIgnoreCase(String anotherString)

public boolean equalsIgnoreCase(String anotherString) {
        return (this == anotherString) ? true
                : (anotherString != null)
                && (anotherString.value.length == value.length)
                && regionMatches(true, 0, anotherString, 0, value.length);
    }

compareTo()和compareToIgnoreCase()方法

核心代码是比较字符数组的每一个字符；
compareToIgnoreCase()方法使用String的内部类CaseInsensitiveComparator;

Hashcode()方法

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

使用的公式：
选择31作为因子的原因
- 为什么 String hashCode 方法选择数字31作为乘子

subString()方法

public String substring(int beginIndex, int endIndex);
public String substring(int beginIndex);

调用public String(char value[], int offset, int count);方法生成一个新的String实例

replace方法

public String replace(char oldChar, char newChar);
public String replace(CharSequence target, CharSequence replacement);
public String replaceAll(String regex, String replacement);
public String replaceFirst(String regex, String replacement);

1)replace的参数是char和CharSequence,即可以支持字符的替换,也支持字符串的替换 2)replaceAll和replaceFirst的参数是regex,即基于规则表达式的替换,比如,可以通过replaceAll(“\d”, “*”)把一个字符串所有的数字字符都换成星号; 相同点是都是全部替换,即把源字符串中的某一字符或字符串全部换成指定的字符或字符串, 如果只想替换第一次出现的,可以使用 replaceFirst(),这个方法也是基于规则表达式的替换,但与replaceAll()不同的是,只替换第一次出现的字符串; 另外,如果replaceAll()和replaceFirst()所用的参数据不是基于规则表达式的,则与replace()替换字符串的效果是一样的,即这两者也支持字符串的操作;

codePointAt方法

public int codePointAt(int index);
public int codePointBefore(int index);
public int codePointCount(int beginIndex, int endIndex);
public int offsetByCodePoints(int index, int codePointOffset)；

wiki
- Java反转字符串和相关字符编码问题
- Class Character
- 字符编码笔记：ASCII，Unicode 和 UTF-8
Code Point：A code point or code position is any of the numerical values that make up the code space.
String对象使用UTF-16表示Unicode字符，一般的符号只需要一个字符（两个字节）表示，但是一些符号需要两个字符（四个字节）表示，这种表示方法称为Surrogate，第一个字符叫Surrogate High，第二个字符叫Surrogate Low。
Surrogate High的范围是\uD800-\uDBFF,Surrogate Low的范围是\uDC00-\uDFFF，在Unicode码表中，\uD800-\uDFFF只用来表示Surrogate Pair，不代表实际符号。

Unicode range D800–DFFF is used for surrogate pairs in UTF-16 (used by Windows) and CESU-8 transformation formats, allowing these encodings to represent the supplementary plane code points, whose values are too large to fit in 16 bits. A pair of 16-bit code points — the first from the high surrogate area (D800–DBFF), and the second from the low surrogate area (DC00–DFFF) — are combined to form a 32-bit code point from the supplementary planes. Unicode and ISO/IEC 10646 do not assign actual characters to any of the code points in the D800–DFFF range — these code points only have meaning when used in surrogate pairs. Hence an individual code point from a surrogate pair does not represent a character, is invalid unless used in a surrogate pair, and is
unconditionally invalid in UTF-32 and UTF-8 (if strict conformance to the standard is applied).

public int codePointAt(int index);的作用是返回索引出字符的Code Point，如果此索引出的字符是Surrogate High，下一个索引的字符是Surrogate Low，则返回Surrogate Pair对应的Code Point。

public static void main(String[] args)
{
    int uni = 0x1F691;
    String str = new String(Character.toChars(uni));
    System.out.println(str.codePointAt(0)); //输出128657
}

符号的Unicode码为U+1F691,对应的十进制数为128657，在String中需要两个字符（Surrogate High和Surrogate Low）表示；
使用codePointAt()可以读取两个字符组成的Surrogate Pair对应的Code Point。

public int codePointCount(int beginIndex, int endIndex);计算字符串的Char[]从beginIndex到endIndex-1之间Code Point的数目（Surrogate Pair算为一个），Unpaired surrogates算为一个;

public static void main(String[] args)
{
    int uni = 0x1F691;
    String str = new String(Character.toChars(uni));
    System.out.println(str.codePointCount(0,1)); //输出1
    System.out.println(str.length()); //输出2
}

String.length()查询的是字符数组的长度，由于U+1F691需要两个字符表示，所有str.length() = 2;
codePointCount返回的是Code Point的个数，由于字符串只有一个符号，所有str.codePointCount(0,2)=1；

public int codePointBefore(int index)，如果字符数组中index-2的值是Surrogate High，index-1的值是Surrogate Low，则返回index-2和index-1组成的Surrogate Pair的Code Point，否则只返回index-1对应的code point。

public static void main(String[] args)
{
    int uni = 0x1F691;
    String str = new String(Character.toChars(uni)) + "unicode";
    System.out.println(str.codePointBefore(2)); //输出 128657
}

value[0]和value[1]可以组成Surrogate Pair，code point为128657

public int offsetByCodePoints(int index, int codePointOffset)方法返回String中从给定的index偏移codePointOfferSet个code points的索引

public static void main(String[] args)
{
    int uni = 0x1F691;
    String str = "uni" + new String(Character.toChars(uni)) + "code";
    System.out.println(str.offsetByCodePoints(0,4)); //输出 5
}

偏移4个code points后（uni占两个字符，但是只有一个code point），所以偏移了5个字符，输出结果为5.

concat()方法

public String concat(String str) {
        int otherLen = str.length();
        if (otherLen == 0) {
            return this;
        }
        int len = value.length;
        char buf[] = Arrays.copyOf(value, len + otherLen);
        str.getChars(buf, len);
        return new String(buf, true);
    }

首先生成了一个字符数组buf[]，将所有的字符放入新的字符数组；
生产新的字符串，其中的value[]直接指向buf[];

indexOf()

public int indexOf(int ch);
public int indexOf(int ch, int fromIndex);
private int indexOfSupplementary(int ch, int fromIndex);//indexOf(int ch, int fromIndex)调用
public int indexOf(String str);
public int indexOf(String str, int fromIndex);
static int indexOf(char[] source, int sourceOffset, int sourceCount,
            char[] target, int targetOffset, int targetCount,
            int fromIndex);

indxeOf()作用是找出字符在字符串中第一次出现的位置；
假设indexOf(int ch)和indexOf(int ch, int fromIndex)的结果为k，如果ch > 0xFFFF，则codePointAt(K) = ch，否则charAt(K) = ch。

matches()方法

wiki
- String.matches()的用法
- Regex doesn't work in String.matches()

String.matches()方法匹配整个字符串是否符合正则表达式，不是匹配部分字符串

String源码阅读

String源码阅读

wiki

学习目标

内存角度理解String

new String("uranusleon")

new String()生成了几个变量

String字面量进入字符串常量池的时机

String.intern()的理解

intern()的作用

intern()的使用时机

intern()理解的要点

String + 的实现

String源码

构造方法

比较方法

equal()方法

contentEquals()方法

equalsIgnoreCase(String anotherString)

compareTo()和compareToIgnoreCase()方法

Hashcode()方法

subString()方法

replace方法

codePointAt方法

concat()方法

indexOf()

matches()方法

你可能感兴趣的:(String源码阅读)