java.lang.String内部结构的变化

原文:http://java-performance.info/changes-to-string-java-1-7-0_06/

作者:Mikhail Vorontsov

IMPORTANT: Java 7 update 17 still has no changes in the subject of this article.
一直到jdk7u17,本文所说的仍然适用。
Sharing an underlying char[]
共用底层的char[]
An original String implementation has 4 non static field: 
char[] value with string characters, 
int offset and int count with an index of the first character to use from value and a number of characters to use and 
int hash with a cached value of a String hash code. 
As you can see, in a very large number of cases a String will have offset = 0 and count = value.length. 
The only exception to this rule were the strings created via String.substring calls and all API calls using this method internally (like Pattern.split).
String.substring created a String, which shared an internal char[] value with an original String, which allowed you:
To save some memory by sharing character data
To run String.substring in a constant time ( O(1) )
At the same time such feature was a source of a possible memory leak: 
if you extract a tiny substring from an original huge string and discard that original string, 
you will still have a live reference to the underlying huge char[] value taken from an original String. 
The only way to avoid it was to call a new String( String ) constructor on such string – 
it made a copy of a required section of underlying char[], thus unlinking your shorter string from its longer “parent”.
From Java 1.7.0_06 (as well as in early versions of Java 8) offset and count fields were removed from a String. 
This means that you can’t share a part of an underlying char[] value anymore. 
Now you can forget about a memory leak described above and never ever use new String(String) constructor anymore. 
As a drawback, you now have to remember that String.substring has now a linear complexity instead of a constant one.
原来String的实现有4个非static的字段:
char[]字符数组
int offset(数组里面第一个有效字符的偏移量)+int count(有效字符的个数)
int hash:缓存了hash值
实际上,很大一部分的String的offset是0,count是字符数组的长度,唯一例外的是,调用String.substring或者其他内部调用了这个方法的api生成出来的字符串(比如:Pattern.split)。
String.substring会创建一个String,新创建出来的String会和之前的String底层会共享同一个char[],这就允许:
(1)通过共享字符数组来节省内存
(2)String.substring的时间复杂度是O(1)
但是,这也带来了潜在的内存泄露的风险:
如果你从一个非常巨大的字符串里面截取一个很小的子串,就算把原始的大串丢弃掉,你仍然持有原始字符串的char[]数组的活的引用。
唯一避免这个问题的方式是在截取出来的字符串上调用new String( String ),这会对底层char[]的有效部分做拷贝,就把子串和原始的长串分离了。
从1.7.0_06(一直到早期的jdk8),String内部都移除了offset和count,这就意味着,你无法再像以前和一样在多个字符串之间共享char[]数组。
现在你可以忘掉刚才描述的内存泄露这件事了,也不用再使用new String(String)这个构造函数了。
但是,很遗憾,现在String.substring的时间复杂度变成线性的而不再是常量了。


Changes to hashing logic
hash算法逻辑的改变
There is another change introduced to String class in the same update: a new hashing algorithm. 
Oracle suggests that a new algorithm gives a better distribution of hash codes, 
which should improve performance of several hash-based collections: HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet, WeakHashMap and ConcurrentHashMap.
Unlike changes from the first part of this article, these changes are experimental and turned off by default.
String里面还有另一个改变:一个新的hash算法。
oracle说一个新的hash算法会提供更好的hashcode分布,这会改善许多基于hashcode的集合的性能,比如:HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet, WeakHashMap and ConcurrentHashMap。
不像上面说的那个改变,这个改变还处在试验阶段,默认是关闭的。
As you may guess, these changes are only for String keys. If you want to turn them on, 
you’ll have to set a jdk.map.althashing.threshold system property to a non-negative value (it is equal to -1 by default). 
This value will be a collection size threshold, after which a new hashing method will be used. 
A small remark here: hashing method will be changed on rehashing only (when there is no more free space). 
So, if a collection was rehashed last time at size = 160 and jdk.map.althashing.threshold = 200, 
then a method will only be changed when your collection will grow to size of 320 (approximately).
正如你猜测的那样,这些改变只是对String当键值的有影响。如果你想启用这个特性,你需要设置jdk.map.althashing.threshold这个系统属性的值为一个非负数(默认是-1)
这个值代表了一个集合大小的threshold,超过这个值,就会使用新的hash算法。
需要注意的一点,只有当re-hash的时候,新的hash算法才会起作用。
因此,如果集合上次rehash的时候的大小是160,jdk.map.althashing.threshold的值是200,那么,只有当集合的大小大概在320的时候,才会使用新的hash算法。


String now has a hash32() method, which result is cached in int hash32 field. 
The biggest difference of this method is that the result of hash32() on the same string may be different on various JVM runs 
(actually, it will be different in most cases, because it uses a single System.currentTimeMillis() and two System.nanoTime calls for seed initialization). 
As a result, iteration order on some of your collections will be different each time you run your program.
现在的String有一个hash32()方法,它的值被缓存在hash32这个字段里面(jdk8不是这样的)
这个方法最大的不同是对同一个字符串,在不同的jvm上的hash32()的结果可能会不一样
(实际上大多数情况下都不一样,因为它使用了一个System.currentTimeMillis()和两个System.nanoTime来初始化种子)
结果,每次执行程序的时候,对同一个集合的迭代顺序都是不一样的。


Actually, I was a little surprised by this method. Why do we need it if an original hashCode method works very good? 
I decided to try a test program from hashCode method performance tuning article in order to find out 
how many duplicate hash codes we will have with a hash32 method.
实际上我对这个方法感到有点吃惊。如果之前的hashcode工作的好好的,为啥要改呢?
我决定做一个实验,来找出使用了hash32()以后,会有多少重复的hash值。


String.hash32() method is not public, so I had to take a look at a HashMap implementation in order to find out how to call this method. 
The answer is sun.misc.Hashing.stringHash32(String).
String.hash32()不是public的,所以我不得不查看了下HashMap的内部实现,来找到怎么调用这个方法,答案就是sun.misc.Hashing.stringHash32(String)。


The same test on 1 million distinct keys has shown 304 duplicate hash values, compared to no duplicates while using String.hashCode. 
I think, we need to wait for further improvements or use case descriptions from Oracle.


同样的测试100万个不同的键值,hash32()有304个重复的hashcode,但是String.hashCode却没有一个重复的。
我认为,我们得等oracle做进一步的优化才可以使用。


New hashing may severely affect highly multithreaded code
Oracle has made a bug in the implementation of hashing in the following classes: HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet and WeakHashMap. 
Only ConcurrentHashMap is not affected. The problem is that all non-concurrent classes now have the following field:
新的hash算法可能会严重影响多线程的代码。
oracle对HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet and WeakHashMap这几个类的hash算法的实现有bug,只有ConcurrentHashMap没有bug。
原因在于所有的这几个非同步的类都有一个字段:
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 * 和本实例相关联的一个随机数,用来计算key的hash值,它能减少hash碰撞。
 */
transient final int hashSeed = sun.misc.Hashing.randomHashSeed(this);


This means that for every created map/set instance sun.misc.Hashing.randomHashSeed method will be called. randomHashSeed, 
in turn, calls java.util.Random.nextInt method. Random class is well known for its multithreaded unfriendliness: 
it contains private final AtomicLong seed field. Atomics behave well under low to medium contention, but work extremely bad under heavy contention.
这意味着,每一个map/set的实例,都会调用sun.misc.Hashing.randomHashSeed()方法。randomHashSeed会调用java.util.Random.nextInt().
Random这个类是多线程不友好的,因为它含有private final AtomicLong的字段作为种子。在竞争少的情况下Atomics工作的很好,但是,竞争大的情况下性能极差。


As a result, many highly loaded web applications processing HTTP/JSON/XML requests may be affected by this bug, 
because nearly all known parsers use one of the affected classes for “name-value” representation. 
All these format parsers may create nested maps, which further increases the number of maps created per second.
结果,大多数处理HTTP/JSON/XML这样的请求的web应用都会被这个bug所影响。
因为,几乎所有能叫得出口的解析器都会使用受影响的这几个类来表示name-value。
所有这些解析器内部可能都会创建嵌套的map,这会进一步增加每秒钟创建的map的数量。


How to solve this problem?
怎么解决这个问题呢:
1. ConcurrentHashMap way: call randomHashSeed method only when jdk.map.althashing.threshold system property was defined. 
Unfortunately, it is available only for JDK core developers.
1.向ConcurrentHashMap那样:只有当jdk.map.althashing.threshold系统属性定义的时候才调用randomHashSeed()方法,很不幸,只有jdk的核心开发者才能这么干。
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 */
private transient final int hashSeed = randomHashSeed(this);
 
private static int randomHashSeed(ConcurrentHashMap instance) {
    if (sun.misc.VM.isBooted() && Holder.ALTERNATIVE_HASHING) {
        return sun.misc.Hashing.randomHashSeed(instance);
    }
 
    return 0;
}
2. Hacker way: fix sun.misc.Hashing class. Highly not recommended. If you still wish to go ahead, here is an idea: java.util.Random class is not final. 
You can extend it and override its nextInt method, returning something thread-local (a constant, for example). 
Then you will have to update sun.misc.Hashing.Holder.SEED_MAKER field – set it to your extended Random class instance. 
Don’t worry that this field is private, static and final – reflection can help you:
2.手动打补丁:修改sun.misc.Hashing这个类。非常的不推荐。如果你想这么干,可以这样:java.util.Random类不是final的,你可以继承这个类,覆盖里面的nextInt()方法,
返回thread-local的值(甚至可以是常量)。然后修改sun.misc.Hashing.Holder.SEED_MAKER这个字段,让它成为你继承的Random类的实例。
别担心这个字段是private static final的,用反射来搞定:
public class Hashing {
    private static class Holder {
        static final java.util.Random SEED_MAKER;
3. Buddhist way – do not upgrade to Java 7u6 and higher. Check new Java 7 update sources for this bug fix. 
Unfortunately, nothing has changed even in Java 7u17…
3.干脆不要升级到jdk7u6和更高版本,检查java的更新版本看看是否修复了这个bug。
Summary
总结
From Java 1.7.0_06 String.substring always creates a new underlying char[] value for every String it creates. 
This means that this method now has a linear complexity compared to previous constant complexity. 
The advantage of this change is a slightly smaller memory footprint of a String (4 bytes less than before) 
and a guarantee to avoid memory leaks caused by String.substring.
Starting from the same Java update, String class got a second hashing method called hash32. 
This method is currently not public and could be accessed without reflection only via sun.misc.Hashing.stringHash32(String) call. 
This method is used by 7 JDK hash-based collections if their size will exceed jdk.map.althashing.threshold system property.
This is an experimental function and currently I don’t recommend using it in your code.
Starting from Java 1.7.0_06 all standard JDK non-concurrent maps and sets are affected by a performance bug caused by new hashing implementation. 
This bug affects only multithreaded applications creating heaps of maps per second. See this article for more details.
从jdk1.7.0_06开始,String.substring()总会创建一个新的字符串,包括底层的char[]也是新的。
也就是说,这个方法现在的时间复杂度是线性的,而之前是常量。
这个而改变的好处是每一个string占的内存字节数减少了4个字节,而且还避免了因为调用String.substring导致的内存泄露。
从这个版本开始,String类有了第二个hash方法,叫做hash32()
这个方法暂时还不是公开的,如果不用反射,只能sun.misc.Hashing.stringHash32(String)才能调用。这个方法用在jdk7的使用hash值的集合类上,如果集合的大小超过了jdk.map.althashing.threshold这个系统变量的值。
这个方法现在还处于试验阶段,当前不推荐使用。
从Java 1.7.0_06开始,因为新的hash算法实现的bug,所有标准的jdk的非同步的map和set的性能都会受影响。这个bug只会影响每秒创建大量map的多线程的应用。
现在的jdk8中有两个成员:
private final char value[];
private int hash;
hashcode()算法和jdk6中是一样的:
jdk8:
public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;


            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }
jdk6:
public int hashCode() {
int h = hash;
        int len = count;
if (h == 0 && len > 0) {
   int off = offset;
   char val[] = value;


            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
        return h;
    }
参考这个:http://www.iteye.com/topic/626801,String.substring()导致内存泄露,写得非常好!

你可能感兴趣的:(java)