String.intern in Java 6, 7 and 8 – string pooling
String.intern 在java 6,7 和8中的使用- 字符串常量池
This article will describe how String.intern method was implemented in Java 6 and what changes were made in it in Java 7 and Java 8.
这会文章将会描叙String.intern方法在java6中如何实现的,在java7,java8中做了哪些改动
First of all I want to thank Yannis Bres for inspiring me to write this article.
首先我想感谢Yannis Bres激励我写这篇文章
This is an updated version of this article including -XX:StringTableSize=N JVM parameter description. This article is followed by String.intern in Java 6, 7 and 8 – multithreaded access article describing the performance characteristics of the multithreaded access to String.intern().
String pooling
字符串池你能使用String.intern()方法
String pooling (sometimes also called as string canonicalisation) is a process of replacing several String objects with equal value but different identity with a single shared String object. You can achieve this goal by keeping your own Map
字符串池(有时我们也叫常量池) , 你能实现这个目标使用自己的MAP(根据你的需求使用软或弱引用) and 用map 的值来替代常量值. 你能使用String.inter
n()方法
At times of Java 6 using String.intern() was forbidden by many standards due to a high possibility to get an OutOfMemoryException if pooling went out of control. Oracle Java 7 implementation of string pooling was changed considerably. You can look for details in http://bugs.sun.com/view_bug.do?bug_id=6962931 and http://bugs.sun.com/view_bug.do?bug_id=6962930.
在许多标准中,java6禁止使用String.intern(),因为频繁的使用,池会失去控制,并且得到OutOfMemoryException. oracle java 7 重新实现了字符串池,
你能查看在这文献中查看细节 http://bugs.sun.com/view_bug.do?bug_id=6962931 and http://bugs.sun.com/view_bug.do?bug_id=6962930.
String.intern() in Java 6
String.intern() 在java6中的使用
In those good old days all interned strings were stored in the PermGen – the fixed size part of heap mainly used for storing loaded classes and string pool. Besides explicitly interned strings, PermGen string pool also contained all literal strings earlier used in your program (the important word here is used – if a class or method was never loaded/called, any constants defined in it will not be loaded).
所有Interned的String对象存储在PermGen中-堆中固定大小的一部分主意用于存储加载的类对象和字符串池.字符串池包含所有的字符(
重要的是使用过的-如果一个classs或者方法从来没有被加载/调用,定义在其中的任何常量都不会被加载)
)
The biggest issue with such string pool in Java 6 was its location – the PermGen. PermGen has a fixed size and can not be expanded at runtime. You can set it using -XX:MaxPermSize=96m option. As far as I know, the default PermGen size varies between 32M and 96M depending on the platform. You can increase its size, but its size will still be fixed. Such limitation required very careful usage of String.intern – you’d better not intern any uncontrolled user input using this method. That’s why string pooling at times of Java 6 was mostly implemented in the manually managed maps.
在JAVA6中字符串池最大的问题是他的位置-永久代.永久代具有固定尺寸并且在运行时不能被扩展.它能使用参数 -XX:MaxPermSize=96m。 据我所知, 永久代默认的大小位于32M至96M间依赖平台.你能增大尺寸.但是字符串池的尺寸依然是固定的.这个限制需要我我们小心的使用String.intern-你最好对不能控制的字符不要使用intern这方法. 这是为什么在JAVA6中大部分使用手动管理map来实现字符串池
String.intern() in Java 7
Oracle engineers made an extremely important change to the string pooling logic in Java 7 – the string pool was relocated to the heap. It means that you are no longer limited by a separate fixed size memory area. All strings are now located in the heap, as most of other ordinary objects, which allows you to manage only the heap size while tuning your application. Technically, this alone could be a sufficient reason to reconsider using String.intern() in your Java 7 programs. But there are other reasons.
Oracle 工程师在java7中对字符串池作了一个极其重要的决定-把字符串池移动到堆中.意味着你不再被限制在固定的内存中啦.所有的字符对象将和其他普通对象一样位于堆中.你可以通过调整堆大小来进行调整应用程序. 这个改动有足够的理由让我们重新考虑使用String.intern().
String pool values are garbage collected
字符串池的数据可以被垃圾回收
Yes, all strings in the JVM string pool are eligible for garbage collection if there are no references to them from your program roots. It applies to all discussed versions of Java. It means that if your interned string went out of scope and there are no other references to it – it will be garbage collected from the JVM string pool.
是的,所有在字符串池的字符对象如果没有任何引用指向他们就会适时的被垃圾回收.当前讨论的所有版本都是这么做的. 如果你要对一个字符进行intern操作 并且没有任何引用指向它-那么它将会在字符串池中被垃圾回收掉.
Being eligible for garbage collection and residing in the heap, a JVM string pool seems to be a right place for all your strings, isn’t it? In theory it is true – non-used strings will be garbage collected from the pool, used strings will allow you to save memory in case then you get an equal string from the input. Seems to be a perfect memory saving strategy? Nearly so. You must know how the string pool is implemented before making any decisions.
适时的被垃圾回收和位于堆中.字符串池看起来在一个正确的地方.对吗?理论上市OK的-池中无用的对象将进行垃圾回收.当外部输入一个字符对象,且池中存在时,可以节省内存。看起来是一个完美的节省内存的策略?可以肯定的是.你得知道字符串池是如何实现的在你回答这个之前.
JVM string pool implementation in Java 6, 7 and 8
JVM 字符串池在JAVA 6, 7 , 8中的实现
The string pool is implemented as a fixed capacity hash map with each bucket containing a list of strings with the same hash code.
字符串池是使用一个拥有固定容量的hashmap,
The default pool size is 1009 (it is present in the source code of the above mentioned bug report). It was a constant in the early versions of Java 6 and became configurable between Java6_30 and Java6_41. It is configurable in Java 7 from the beginning (at least it is configurable in Java7_02). You need to specify -XX:StringTableSize=N, where N is the string pool map size. Ensure it is a prime number for the better performance.
默认的池大小是1009.(出现在上面提及的bug 报告的源码中).是一个常量在JAVA6早期版本中,随后在java6_30至java6_41中开始为可配置的.而在java 7中一开始就是可以配置的(至少在java7_02中是可以配置的).你需要指定参数 -XX:StringTableSize=N, N是字符串池map的大小. 确宝他是一个为更好的性能预先准备的数字.
This parameter will not help you a lot in Java 6, because you are still limited by a fixed size PermGen size. The further discussion will exclude Java 6.
在JAVA6中这个参数将不能帮助你。因为你在永久代中依然是被限制在一个固定的大小里.我们接下来的讨论将不涉及java6
In Java 7, on the other hand, you are limited only by a much higher heap size. It means that you can set the string pool size to a rather high value in advance (this value depends on your application requirements). As a rule, one starts worrying about the memory consumption when the memory data set size grows to at least several hundred megabytes. In this situation, allocating 8-16 MB for a string pool with one million entries seems to be a reasonable trade off (do not use 1,000,000 as a -XX:StringTableSize value – it is not prime; use 1,000,003 instead).
在java7中,换句话说。你被限制在一个更大的堆内存中.意味着你可以预先设置好String池的大小(这个值取决于你的应用程序需求).通常说来,一旦程序开始内存消耗,内存都是成百M的增长.在这种情况下.给一个拥有100万的String对象的字符串池分8-16M的内存看起来是比较适合的(不要使用1,000,000 作为-XX:StringTaleSize 的值 - 它不是质数;使用1,000,003代替)
You may expect a uniform distribution of interned strings in the buckets – read my experiments in the hashCode method performance tuning article.
你可能期待关于String在桶中的分配-可以阅读我之前关于hadhCode方法调优的经验
You must set a higher -XX:StringTableSize value (compared to the default 1009) if you intend to actively use String.intern() – otherwise this method performance will soon degrade to O(pool size).
你必须设置一个更高的 -XX:StringTalbeSize 值(相比较默认的1009),如果你趋向于积极的使用String.intern()-否则这个方法将很快递减到0(池大小)。
I have not noticed a dependency from a string length to a time to intern a string for string lengths under 100 characters (I feel that duplicates of even 50 character long strings are rather unlikely in the real world data, so 100 chars seems to be a good test limit for me).
我已经注意到依赖
Here is an extract from the test application log with the default pool size: time to intern 10.000 strings (second number) after a given number of strings was already interned (first number); Integer.toString( i ), where i between 0 and 999,999 were interned:
提取一个使用默认字符串池的测试日志:
0; time = 0.0 sec
50000; time = 0.03 sec
100000; time = 0.073 sec
150000; time = 0.13 sec
200000; time = 0.196 sec
250000; time = 0.279 sec
300000; time = 0.376 sec
350000; time = 0.471 sec
400000; time = 0.574 sec
450000; time = 0.666 sec
500000; time = 0.755 sec
550000; time = 0.854 sec
600000; time = 0.916 sec
650000; time = 1.006 sec
700000; time = 1.095 sec
750000; time = 1.273 sec
800000; time = 1.248 sec
850000; time = 1.446 sec
900000; time = 1.585 sec
950000; time = 1.635 sec
1000000; time = 1.913 sec
These test results were obtained on Core [email protected] CPU. As you can see, they grow linearly and I was able to intern only approximately 5,000 strings per second when the JVM string pool size contained one million strings. It is unacceptably slow for most of applications having to handle a large amount of data in memory.
这个测试结果在Core [email protected] CPU.取得的.正如你所看到的.他们就呈线性增长.我将每秒5,000个String对象intern当这个池中包含1,000,000个String对象时.大部分系统面对这么大的数据在内存中处理时都是不可接受的慢.
Now the same test results with -XX:StringTableSize=100003 option:
使用参数 -XX:StringTableSize=100003的结果.
50000; time = 0.017 sec
100000; time = 0.009 sec
150000; time = 0.01 sec
200000; time = 0.009 sec
250000; time = 0.007 sec
300000; time = 0.008 sec
350000; time = 0.009 sec
400000; time = 0.009 sec
450000; time = 0.01 sec
500000; time = 0.013 sec
550000; time = 0.011 sec
600000; time = 0.012 sec
650000; time = 0.015 sec
700000; time = 0.015 sec
750000; time = 0.01 sec
800000; time = 0.01 sec
850000; time = 0.011 sec
900000; time = 0.011 sec
950000; time = 0.012 sec
1000000; time = 0.012 sec
As you can see, in this situation it takes nearly constant time to insert strings in the pool (there is no more than 10 strings in the bucket on average). Here are results with the same settings, but now we will insert up to 10 million strings in the pool (which means 100 strings in the bucket on average)
正如你所看到的,在这种情况下他将花费几乎相同的时间插入字符串池(平均每个桶上不超过10个字符串). 同样的设计。但现在我们将插入高达10,000,000的字符对象入池中(意味着平均每个桶上有100个字符对象)
2000000; time = 0.024 sec
3000000; time = 0.028 sec
4000000; time = 0.053 sec
5000000; time = 0.051 sec
6000000; time = 0.034 sec
7000000; time = 0.041 sec
8000000; time = 0.089 sec
9000000; time = 0.111 sec
10000000; time = 0.123 sec
Now let’s increase the pool size to one million buckets: (1,000,003 to be precise):
现在 让我们把池的尺寸增加到1,000,000个桶:(1,000,003将被设置):
1000000; time = 0.005 sec
2000000; time = 0.005 sec
3000000; time = 0.005 sec
4000000; time = 0.004 sec
5000000; time = 0.004 sec
6000000; time = 0.009 sec
7000000; time = 0.01 sec
8000000; time = 0.009 sec
9000000; time = 0.009 sec
10000000; time = 0.009 sec
As you can see, times are flat and do not look much different from “zero to one million” table for the ten times small string pool. Even my slow laptop can add one million new strings to the JVM string pool per second provided that the pool size is high enough.
如你所看到的,时间非常平缓 .甚至我的慢笔记本能添加1,000,000个字符对象每
Shall we still use manual string pools?
我们将依然使用字符串池
Now we need to compare this JVM string pool with a WeakHashMap
现在我们需要把字符串池与一个WeakHahMap比较,他经常被用于模拟字符串池.下面的方法常常用于作String.inern的替代方式:
private static final WeakHashMap
new WeakHashMap
private static String manualIntern( final String str )
{
final WeakReference
if ( cached != null )
{
final String value = cached.get();
if ( value != null )
return value;
}
s_manualCache.put( str, new WeakReference
return str;
}
This is the output for the same test using this manual pool:
0; manual time = 0.001 sec
50000; manual time = 0.03 sec
100000; manual time = 0.034 sec
150000; manual time = 0.008 sec
200000; manual time = 0.019 sec
250000; manual time = 0.011 sec
300000; manual time = 0.011 sec
350000; manual time = 0.008 sec
400000; manual time = 0.027 sec
450000; manual time = 0.008 sec
500000; manual time = 0.009 sec
550000; manual time = 0.008 sec
600000; manual time = 0.008 sec
650000; manual time = 0.008 sec
700000; manual time = 0.008 sec
750000; manual time = 0.011 sec
800000; manual time = 0.007 sec
850000; manual time = 0.008 sec
900000; manual time = 0.008 sec
950000; manual time = 0.008 sec
1000000; manual time = 0.008 sec
Manually written pool has provided comparable performance when JVM has sufficient memory. Unfortunately, for my test case (interning String.valueOf(0 < N < 1,000,000,000) ) of very short strings to intern, it allowed me to keep only ~2.5M such strings with -Xmx1280M. JVM string pool (size=1,000,003), on the other hand, provided the same flat performance characteristics until JVM ran out of memory with 12,72M strings in the pool (5 times more). As I think, it is a valuable hint to get rid of manual string pooling in your programs.
能提供比较好的性能当JVM有足够的内存时.不幸的是.
String.intern() in Java 8
I have tried to run the same tests on the current early access build (b102) of Java 8. Java 8 still accepts -XX:StringTableSize parameter and provides the comparable to Java 7 performance. The only important difference is that the default pool size was increased in Java 8 to something around 25-50K:
我试图运行同样的测试在JAVA8版本中. java8依然接受 -XX:StringTableSize. 提供可以与JAVA7媲美的性能. 唯一不同的是默认的池大小增加到25-50K
50000; time = 0.019 sec
100000; time = 0.009 sec
150000; time = 0.009 sec
200000; time = 0.009 sec
250000; time = 0.009 sec
300000; time = 0.009 sec
350000; time = 0.011 sec
400000; time = 0.012 sec
450000; time = 0.01 sec
500000; time = 0.013 sec
550000; time = 0.013 sec
600000; time = 0.014 sec
650000; time = 0.018 sec
700000; time = 0.015 sec
750000; time = 0.029 sec
800000; time = 0.018 sec
850000; time = 0.02 sec
900000; time = 0.017 sec
950000; time = 0.018 sec
1000000; time = 0.021 sec
Test code
Test code for this article is rather simple: a method creates and interns new strings in a loop. We also measure time it took to intern the current 10.000 strings. It worth to run this program with -verbose:gc JVM parameter to see when and what garbage collections will happen. You may also want to specify the maximal heap size using -Xmx parameter.
测试代码相当的简单: 一个方法被创建。并且循环使用新字符串intern. .运行这个程序使用 -verbose:gc来查看何时被垃圾回收.你也可以使用-Xmx参数指定最大堆内存
There are 2 tests: testStringPoolGarbageCollection will show you that a JVM string pool is actually garbage collected - check the garbage collection log messages as well as time it took to intern the strings on the second pass. This test will fail on Java 6 default PermGen size, so either update it, or update the test method argument, or use Java 7.
下面2个测试: testStringPoolGarbageCollection 将展示字符串池是如何垃圾回收的-检查垃圾回收日志 . 这个使用默认的永久代大小在JAVA6中将会失败.
所以你需要测试它.然后更新测试方法的参数,使用JAVA7
Second test will show you how many interned strings could be stored in memory. Run it on Java 6 with 2 different memory settings - for example -Xmx128M and -Xmx1280M (10 times more). Most likely you will see that it will not affect the number of strings you can put in the pool. On the other hand, in Java 7 you will be able to fill the whole heap with your strings.
第二哥测试将展示多少interned字符对象能被存储在内存中.在JAVA6中运行用2个不同的内存设置-例如-Xmx128M 和-Xml1290M(10倍或者更多)
你会发现几乎没有有啥效果 另一方面,java7将堆内存尽可能的被字符装满
/**
* Testing String.intern.
*
* Run this class at least with -verbose:gc JVM parameter.
*/
public class InternTest {
public static void main( String[] args ) {
testStringPoolGarbageCollection();
testLongLoop();
}
/**
* Use this method to see where interned strings are stored
* and how many of them can you fit for the given heap size.
*/
private static void testLongLoop()
{
test( 1000 * 1000 * 1000 );
//uncomment the following line to see the hand-written cache performance
//testManual( 1000 * 1000 * 1000 );
}
/**
* Use this method to check that not used interned strings are garbage collected.
*/
private static void testStringPoolGarbageCollection()
{
//first method call - use it as a reference
test( 1000 * 1000 );
//we are going to clean the cache here.
System.gc();
//check the memory consumption and how long does it take to intern strings
//in the second method call.
test( 1000 * 1000 );
}
private static void test( final int cnt )
{
final List
long start = System.currentTimeMillis();
for ( int i = 0; i < cnt; ++i )
{
final String str = "Very long test string, which tells you about something " +
"very-very important, definitely deserving to be interned #" + i;
//uncomment the following line to test dependency from string length
// final String str = Integer.toString( i );
lst.add( str.intern() );
if ( i % 10000 == 0 )
{
System.out.println( i + "; time = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
start = System.currentTimeMillis();
}
}
System.out.println( "Total length = " + lst.size() );
}
private static final WeakHashMap
new WeakHashMap
private static String manualIntern( final String str )
{
final WeakReference
if ( cached != null )
{
final String value = cached.get();
if ( value != null )
return value;
}
s_manualCache.put( str, new WeakReference
return str;
}
private static void testManual( final int cnt )
{
final List
long start = System.currentTimeMillis();
for ( int i = 0; i < cnt; ++i )
{
final String str = "Very long test string, which tells you about something " +
"very-very important, definitely deserving to be interned #" + i;
lst.add( manualIntern( str ) );
if ( i % 10000 == 0 )
{
System.out.println( i + "; manual time = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
start = System.currentTimeMillis();
}
}
System.out.println( "Total length = " + lst.size() );
}
}
Summary
总结
Stay away from String.intern() method on Java 6 due to a fixed size memory area (PermGen) used for JVM string pool storage.在JAVA6中少使用Sring.
在JAVA6中少使用Sring.intern()方法
Java 7 and 8 implement the string pool in the heap memory. It means that you are limited by the whole application memory for string pooling in Java 7 and 8.
java 7 和 8 的对字符串池的实现在堆中. 那意味着你被限制在整个应用程序的内存中.
Use -XX:StringTableSize JVM parameter in Java 7 and 8 to set the string pool map size. It is fixed, because it is implemented as a hash map with lists in the buckets. Approximate the number of distinct strings in your application (which you intend to intern) and set the pool size equal to some prime number close to this value. It will allow String.intern to run in the constant time and requires a rather small memory consumption per interned string (explicitly used Java WeakHashMap will consume 4-5 times more memory for the same task).
使用-XX:StringTableSize 参数在JAVA7和8中设置字符串池的大小.它是固定的.因为他的实现是一个由桶带链表组成的hashmap.靠近这个数并且设置池的大小等于靠近这个数的质数.他会使String.intern运行在一个常量时间里并且只需要消耗相当小的内存(同样的任务,使用java WeakHashMap将消耗4-5倍的内存)
The default value of -XX:StringTableSize parameter is 1009 in Java 7 and around 25-50K in Java 8.
在java7中默认-XX:StringTableSize参数为1009,java 8中为25-50k