众所周知,由于虚拟内存技术的存在使得,一个只有1G物理内存的机器可以运行总共需要4G内存的任务,其方法就通过虚拟内存与物理内存的映射来实现,当物理内存不够使用的时候,其可以通过swap内存(存在于磁盘)于物理内存的交换来释放刚交换的物理内存,使得其可以被重新分配,当需要使用以前换出的内存时,再进行换入操作。但是注意:从内存到磁盘的换入换出操作是十分占用CPU时间的,因此在线上环境应该限制swap区域的大小,如果swap占用比例较大就应该进行排查和解决问题。
jvm参数配置有问题(java8):java8已经移除掉perm内存,使得-XX:MaxPermSize=256m无效
;JVM配置
CUSTOM_JVM = -Xmx5g -Xms5g -Xmn2g -server -XX:PermSize=128m -XX:MaxPermSize=256m -XX:+PrintCommandLineFlags -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=0 -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=68 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
可以发现MetaspaceSize = 21807104 (20.796875MB) MaxMetaspaceSize = 17592186044415 MB 会不会是因为metaspace没有达到最大的内存限制,因此无限增长并且不进行fullgc,从而造成新分配young对象分配时没有达到最大的NewSize,从而引起物理内存与虚拟内存的swap操作。
[xxx@xxxx bin]$ ./jmap -heap 16907
Attaching to process ID 16907, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 5368709120 (5120.0MB)
NewSize = 2147483648 (2048.0MB)
MaxNewSize = 2147483648 (2048.0MB)
OldSize = 3221225472 (3072.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 1932787712 (1843.25MB)
used = 1650646368 (1574.1790466308594MB)
free = 282141344 (269.0709533691406MB)
85.402362491841% used
Eden Space:
capacity = 1718091776 (1638.5MB)
used = 1642435424 (1566.3484802246094MB)
free = 75656352 (72.15151977539062MB)
95.59648948578635% used
From Space:
capacity = 214695936 (204.75MB)
used = 8210944 (7.83056640625MB)
free = 206484992 (196.91943359375MB)
3.8244524572649574% used
To Space:
capacity = 214695936 (204.75MB)
used = 0 (0.0MB)
free = 214695936 (204.75MB)
0.0% used
concurrent mark-sweep generation:
capacity = 3221225472 (3072.0MB)
used = 5149533810650707576 (4.910978136683186E12MB)
free = 12681207910804 MB
1.598625695534891E11% used
44196 interned Strings occupying 4956280 bytes.
Metaspace内存占用情况分析:可以发现最近一次gc原因是Allocation Failure失败,而young和old代占比都不高。
[xxx@xxx bin]$ ./jstat -gccause 16907
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT LGCC GCC
0.00 4.13 66.98 44.01 96.62 92.52 9015 177.057 8 4.612 181.668 Allocation Failure No GC
当前gc容量:young代容量2G,old代容量3G M代最大容量1G 当前容量148M,因此可以推断的确有段时间M容量到达过1G
[xxx@xxxx bin]$ ./jstat -gccapacity 16907
NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC
2097152.0 2097152.0 2097152.0 209664.0 209664.0 1677824.0 3145728.0 3145728.0 3145728.0 3145728.0 0.0 1181696.0 148556.0 0.0 1048576.0 15580.0 9031 8
top命令查看内存使用情况:java进程只用了6.5个G的内存,但是是虚拟内存却达到了13.0G。
Tasks: 136 total, 1 running, 135 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.0%us, 3.7%sy, 0.0%ni, 88.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.4%st
Mem: 8059416k total, 7747344k used, 312072k free, 20048k buffers
Swap: 2096440k total, 1993920k used, 102520k free, 368492k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16907 sankuai 20 0 13.0g 6.5g 6880 S 41.6 84.4 1965:06 java
5063 sankuai 20 0 1916m 24m 2884 S 6.3 0.3 171:42.18 cplugin
676 sankuai 20 0 839m 6568 1600 S 0.3 0.1 288:01.72 log_agent
9317 root 20 0 1350m 9348 2940 S 0.3 0.1 24:25.22 falcon-agent
1 root 20 0 39952 300 132 S 0.0 0.0 0:01.92 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:11.90 migration/0
为什么过多的分配虚拟内存会引起swap操作而不引起gc操作?
首先看下为什么会产生swap操作:
从硬件上看,Linux系统的内存空间由两个部分构成:物理内存和SWAP(位于磁盘)。物理内存是Linux活动时使用的主要内存区域;当物理内存不够使用时,Linux会把一部分暂时不用的内存数据放到磁盘上的SWAP中去,以便腾出更多的可用内存空间;而当需要使用位于SWAP的数据时,必须先将其换回到内存中。
从Linux系统上看,除了引导系统的BIN区,整个内存空间主要被分成两个部分:内核内存(Kernel space)【进程间公用】、用户内存(User space)【各个进程间私有】。
虚拟内存技术:给每一个进程一定虚拟内存空间,而只有当虚拟内存实际被使用时,才分配物理内存。同时由于虚拟内存技术+swap内存的存在使得,每个用户进程使用的虚拟地址的大小都是一样的,并且能够大于实际的物理内存空间。因为如果当一个虚拟内存要映射到物理内存时,如果发现没有空闲的物理内存,那么操作系统会在已经被别的进程或者该进程没有立即需要使用的物理空间与swap内存进行交换,从而将该置换出来的物理内存分配给程序。但是需要注意的是,这样的swap操作是十分影响机器性能的。
接下来我们分析为什么当虚拟内存分配的大小>物理内存时会引起swap操作而不是gc操作?
先我们看下linux进程和JVM进程(也是一个linux进程)在虚拟内存的使用情况:
对于linux系统而言其只能运行可执行的二进制代码,JVM进程本身就是一个C开发的linux进程,因此其在使用虚拟内存时也同普通的linux进程一样,将虚拟内存的用户内存分为如下几个部分:代码区(linux进程的代码)、数据区(linux进程的全局或者静态数据等)、堆区(运行时程序动态申请的空间,属于程序运行时直接申请、释放的内存资源,通过new delete操作进程申请和释放)、栈区(存放函数的传入参数、临时变量,以及返回地址等数据)、未使用区。
对于JVM而言其启动时,会将自己jvm进程的全局或者静态变量放在数据区(注意这里还没有和jvm中的java线程扯上任何关系),同时会将jvm进程的代码放在代码区,并且会申请一整块虚拟内存作为堆区给JVM中的线程使用,其包含了我们常说的jvm内存模型中的:年轻代、老年代【一起叫做堆】、永久代(java程序的代码区,在java8中叫做metaspace元空间)。另外jvm进程的栈区,主要是用于java线程的线程栈。
那我们来说说:设置过大的堆区(虚拟内存)和过大的元空间为什么会引起swap操作。首先说下jvm的参数设置:【堆大小设置】-Xmx5g -Xms5g 【元空间大小设置】-XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=512m,并且由于java8的优化该大小只是表示虚拟内存大小,并不会完全映射到物理内存(应该会按照比例申请一部分内存),但是该值限制了jvm进行垃圾回收的容量限制。如果未达到容量,但是物理内存不够的时候,就会将一些不在活跃的内存替换到swap中,从而分配给jvm使用。元空间同理
接下来我们进行下测试,测试环境mac_pro 16G物理内存,设置jvm参数-Xmx100g -Xms100g -Xmn50g -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=512m
可以发现jvm在启动过程中会分配100GB的虚拟内存,而实际使用的内存空间有达到40GB,其原理就是通过swap来实现的,因此过大的设置虚拟内存是会引起频繁的swap操作而不是gc操作。图中的内存使用最低点,是我手动点击触发一次gc的结果,可以发现使用的空间只有3G.
我们再测试下-XX:MetaspaceSize=50G -XX:MaxMetaspaceSize=100G
发现和我们想象的并不一样,metaspaceSize并不是表示初始class内存大小。
MetaspaceSize:默认20.8M左右(x86下开启c2模式),主要是控制metaspaceGC发生的初始阈值,也是最小阈值,但是触发metaspaceGC的阈值是不断变化的,与之对比的主要是指Klass Metaspace与NoKlass Metaspace两块committed的内存和。MaxMetaspaceSize:默认基本是无穷大,但是我还是建议大家设置这个参数,因为很可能会因为没有限制而导致metaspace被无止境使用(一般是内存泄漏)而被OS Kill。这个参数会限制metaspace(包括了Klass Metaspace以及NoKlass Metaspace)被committed的内存大小,会保证committed的内存不会超过这个值,一旦超过就会触发GC,这里要注意和MaxPermSize的区别,MaxMetaspaceSize并不会在jvm启动的时候分配一块这么大的内存出来,而MaxPermSize是会分配一块这么大的内存的。
-XX:MetaspaceSize=64M -XX:MaxMetaspaceSize=128M
可以发现最终MetaspaceSize还是突破了MaxMetaspaceSize=128M的限制
出现如下异常:并且java程序死掉
[2017-05-25 07:55:46,102] Artifact waimai_m_poi_task:war: Error during artifact deployment. See server log for details.
[2017-05-25 07:55:46,103] Artifact waimai_m_poi_task:war: com.intellij.javaee.oss.admin.jmx.JmxAdminException: java.util.concurrent.TimeoutException
在分配过程中最近一次的gc原因:可以发现频繁的full gc,并且GC原因是由于Metadata引起的
最终结论:需要合理的设置MaxMetadataSpace使得虚拟机合理利用内存,但是swap操作理论上不是由于metaspace引起,因为不应该会占用太大的空间。我们设置了MaxMetadataSpace=512M也没有解决机器swap的情况。
因为线上机器默认加上了-XX:+DisableExplicitGC参数,该参数是指在用户调用system.gc的时候该调用会被忽略不会引起gc过程。
JVM_ENTRY_NO_ENV(void, JVM_GC(void))
JVMWrapper("JVM_GC");
if (!DisableExplicitGC) {
Universe::heap()->collect(GCCause::_java_lang_system_gc);
}
JVM_END
另外System.gc()的调用不是说jvm立即会进行full gc,具体执行情况由jvm决定,添加DisableExplicitGC参数的目的是防止用户提醒jvm进行垃圾收集,因为jvm的垃圾收集应该由其自己管理,当年轻代或者年老代不够使用时进行垃圾收集,提升程序吞吐率。
我们说下为什么设置了DisableExplicitGC有可能会引起swap操作:首先,由于我们没有设置-XX:MaxDirectMemorySize来限制直接内存的大小,因此直接内存是可以无限使用的,假如本机的物理内存为16G,直接内存使用了7G,java堆内存使用了10G,程序还能启动那么证明会使用swap的磁盘内存进行切换来扩容。
JVM参数:
-XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=68 -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=12884901888 -XX:MaxDirectMemorySize=8589934592 -XX:MaxHeapSize=12884901888 -XX:MaxMetaspaceSize=536870912 -XX:MaxNewSize=6442450944 -XX:MaxTenuringThreshold=6 -XX:MetaspaceSize=268435456 -XX:NewSize=6442450944 -XX:OldPLABSize=16 -XX:+PrintCommandLineFlags -XX:+PrintGCDetails -XX:+UseCMSCompactAtFullCollection -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
其中最大heap12G MetaSpace512M DirectMemory8G
代码如下:可以发现使用直接内存7G+heap内存10G=17G>16G(机器物理内存)
/**
* Created by wenyi on 17/5/24.
* Email:[email protected]
*/
public class TestDirectByteBuffer {
@Test
public void testSwap() throws Exception {
int directLen=7;
ByteBuffer[] byteBuffers=new ByteBuffer[directLen];
for(int i=0;i1*1024 * 1024 * 1024);
}
System.out.println("direct memory alloc success");
int newLen=20;
byte[][] bytes=new byte[newLen][];
for(int i=0;inew byte[512*1024*1024];
}
System.out.println("new memory alloc success");
}
}
输出日志:可以程序正常的启动了,另外需要说明的是full gc是能够释放不再使用的直接内存,但是本程序中直接内存并没有超出作用域所以不会被释放,因此如果因为jvm的堆内存没有到达gc的阈值点,并且因为堆内存使用过多物理内存,是会引起swap操作的。
direct memory alloc success
[GC (Allocation Failure) [ParNew: 5020585K->526125K(5662336K), 12.2738603 secs] 5020585K->4720431K(11953792K), 12.2739024 secs] [Times: user=62.86 sys=10.30, real=12.28 secs]
[GC (CMS Initial Mark) [1 CMS-initial-mark: 4194306K(6291456K)] 5244719K(11953792K), 0.0021633 secs] [Times: user=0.00 sys=0.01, real=0.00 secs]
[CMS-concurrent-mark-start]
[CMS-concurrent-mark: 0.002/0.002 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
[CMS-concurrent-preclean-start]
[CMS-concurrent-preclean: 0.009/0.009 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
[CMS-concurrent-abortable-preclean-start]
[CMS-concurrent-abortable-preclean: 0.014/0.116 secs] [Times: user=0.10 sys=0.02, real=0.12 secs]
[GC (CMS Final Remark) [YG occupancy: 2246958 K (5662336 K)][Rescan (parallel) , 0.0026644 secs][weak refs processing, 0.0000468 secs][class unloading, 0.0002846 secs][scrub symbol table, 0.0003501 secs][scrub string table, 0.0001422 secs][1 CMS-remark: 4194306K(6291456K)] 6441264K(11953792K), 0.0035461 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
[CMS-concurrent-sweep-start]
[CMS-concurrent-sweep: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[CMS-concurrent-reset-start]
[CMS-concurrent-reset: 0.072/0.072 secs] [Times: user=0.05 sys=0.07, real=0.07 secs]
[GC (Allocation Failure) [ParNew: 5392686K->5392686K(5662336K), 0.0000164 secs][CMS: 4194306K->5767168K(6291456K), 10.0724375 secs] 9586992K->9438305K(11953792K), [Metaspace: 2908K->2908K(1056768K)], 10.0725013 secs] [Times: user=3.41 sys=5.44, real=10.07 secs]
[GC (CMS Initial Mark) [1 CMS-initial-mark: 5767168K(6291456K)] 9962593K(11953792K), 0.0009442 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[CMS-concurrent-mark-start]
[CMS-concurrent-mark: 0.003/0.003 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
[CMS-concurrent-preclean-start]
[CMS-concurrent-preclean: 0.017/0.017 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
[CMS-concurrent-abortable-preclean-start]
[CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC (CMS Final Remark) [YG occupancy: 4914249 K (5662336 K)][Rescan (parallel) Disconnected from the target VM, address: '127.0.0.1:50771', transport: 'socket'
, 0.0025774 secs][weak refs processing, 0.0000531 secs][class unloading, 0.0003481 secs][scrub symbol table, 0.0003880 secs][scrub string table, 0.0001472 secs][1 CMS-remark: 5767168K(6291456K)] 10681417K(11953792K), 0.0035720 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
[CMS-concurrent-sweep-start]
new memory alloc success[CMS-concurrent-sweep: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[CMS-concurrent-reset-start]
Heap
par new generation total 5662336K, used 4964581K [0x00000004c0000000, 0x0000000640000000, 0x0000000640000000)
eden space 5033216K, 98% used [0x00000004c0000000, 0x00000005ef0397a0, 0x00000005f3340000)
from space 629120K, 0% used [0x00000006199a0000, 0x00000006199a0000, 0x0000000640000000)
to space 629120K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000006199a0000)
concurrent mark-sweep generation total 6291456K, used 5767168K [0x0000000640000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 2914K, capacity 4494K, committed 4864K, reserved 1056768K
class space used 311K, capacity 386K, committed 512K, reserved 1048576K
Process finished with exit code 0
进一步证明我们的推论:我将heap堆设置大于真实内存:java程序不变。
JVM参数:-Xmx32g -Xms32g -Xmn16g -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=512m -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=0 -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=98 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=8G
输入日志如下:可以发现这次连young gc都没有。
direct memory alloc success
new memory alloc success
Disconnected from the target VM, address: '127.0.0.1:50813', transport: 'socket'
Heap
par new generation total 15099520K, used 11559506K [0x0000000113e00000, 0x0000000513e00000, 0x0000000513e00000)
eden space 13421824K, 86% used [0x0000000113e00000, 0x00000003d5694b88, 0x0000000447140000)
from space 1677696K, 0% used [0x0000000447140000, 0x0000000447140000, 0x00000004ad7a0000)
to space 1677696K, 0% used [0x00000004ad7a0000, 0x00000004ad7a0000, 0x0000000513e00000)
concurrent mark-sweep generation total 16777216K, used 0K [0x0000000513e00000, 0x0000000913e00000, 0x0000000913e00000)
Metaspace used 2904K, capacity 4108K, committed 4352K, reserved 8192K
如何优化?我们会分两个参数来讲解。
第一个设置XX:MaxDirectMemorySize来限制直接内存的大小,同时为了jvm的效率添加上DisableExplicitGC:该设置可能的问题是 java.lang.OutOfMemoryError: Direct buffer memory
我们看下如下一个测试程序,我们的JVM参数是:
-XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=98 -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=268435456 -XX:MaxDirectMemorySize=536870912 -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=697933824 -XX:MaxTenuringThreshold=6 -XX:OldPLABSize=16 -XX:+PrintCommandLineFlags -XX:+PrintGCDetails -XX:+UseCMSCompactAtFullCollection -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
public class TestDirectByteBuffer {
@Test
public void testSwap() throws Exception {
int directLen=7;
ByteBuffer[] byteBuffers=new ByteBuffer[directLen];
for(int i=0;i1*1024 * 1024 * 1024);
}
System.out.println("direct memory alloc success");
int newLen=20;
byte[][] bytes=new byte[newLen][];
for(int i=0;inew byte[512*1024*1024];
}
System.out.println("new memory alloc success");
}
private void allocDirectMemory(int directLen){
ByteBuffer[] directBuffer=new ByteBuffer[directLen];
for(int i=0;i256 * 1024 * 1024);
}
System.out.println(directLen+"*256M byte buffer success");
}
@Test
public void test512M(){
allocDirectMemory(2);
}
@Test
public void test1G(){
allocDirectMemory(4);
}
}
分别运行512M和1G程序:其中512M是设置最大的直接内存
512M运行结果如下:【成功】
-XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=98 -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=268435456 -XX:MaxDirectMemorySize=536870912 -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=697933824 -XX:MaxTenuringThreshold=6 -XX:OldPLABSize=16 -XX:+PrintCommandLineFlags -XX:+PrintGCDetails -XX:+UseCMSCompactAtFullCollection -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 2*256M byte buffer success
Heap
par new generation total 78656K, used 27993K [0x00000006c0000000, 0x00000006c5550000, 0x00000006e9990000)
eden space 69952K, 40% used [0x00000006c0000000, 0x00000006c1b566e8, 0x00000006c4450000)
from space 8704K, 0% used [0x00000006c4450000, 0x00000006c4450000, 0x00000006c4cd0000)
to space 8704K, 0% used [0x00000006c4cd0000, 0x00000006c4cd0000, 0x00000006c5550000)
concurrent mark-sweep generation total 174784K, used 0K [0x00000006e9990000, 0x00000006f4440000, 0x00000007c0000000)
Metaspace used 5049K, capacity 5264K, committed 5504K, reserved 1056768K
class space used 584K, capacity 627K, committed 640K, reserved 1048576K
1G运行结果如下:【失败】
-XX:CMSFullGCsBeforeCompaction=0 -XX:CMSInitiatingOccupancyFraction=98 -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=268435456 -XX:MaxDirectMemorySize=536870912 -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=697933824 -XX:MaxTenuringThreshold=6 -XX:OldPLABSize=16 -XX:+PrintCommandLineFlags -XX:+PrintGCDetails -XX:+UseCMSCompactAtFullCollection -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC Java HotSpot(TM) 64-Bit Server VM warning: UseCMSCompactAtFullCollection is deprecated and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: CMSFullGCsBeforeCompaction is deprecated and will likely be removed in a future release.
objc[999]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_77.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_77.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be used. Which one is undefined.
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:693)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at com.sankuai.meituan.waimai.wdc.TestDirectByteBuffer.allocDirectMemory(TestDirectByteBuffer.java:33)
at com.sankuai.meituan.waimai.wdc.TestDirectByteBuffer.test1G(TestDirectByteBuffer.java:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Heap
par new generation total 78656K, used 27993K [0x00000006c0000000, 0x00000006c5550000, 0x00000006e9990000)
eden space 69952K, 40% used [0x00000006c0000000, 0x00000006c1b56718, 0x00000006c4450000)
from space 8704K, 0% used [0x00000006c4450000, 0x00000006c4450000, 0x00000006c4cd0000)
to space 8704K, 0% used [0x00000006c4cd0000, 0x00000006c4cd0000, 0x00000006c5550000)
concurrent mark-sweep generation total 174784K, used 0K [0x00000006e9990000, 0x00000006f4440000, 0x00000007c0000000)
Metaspace used 5103K, capacity 5264K, committed 5504K, reserved 1056768K
class space used 591K, capacity 627K, committed 640K, reserved 1048576K
那么如果我们不设置DisableExplicitGC,运行用户的GC调用,来释放直接内存是否就可以解决设置了XX:MaxDirectMemorySize的问题。
话不多,我们也来看一段程序:
JVM参数:-XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=0 -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=98 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+DisableExplicitGC -XX:MaxDirectMemorySize=512M
@Test
public void testReleaseDirectMemory(){
int i=0;
while (true){
i++;
ByteBuffer.allocateDirect(256 * 1024 * 1024);
System.out.println("alloc "+i+"*256M DirectMemory");
}
}
输出结果:可以发现当超过512M时抛出异常
alloc 1*256M DirectMemory
alloc 2*256M DirectMemory
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:693)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at com.sankuai.meituan.waimai.wdc.TestDirectByteBuffer.testReleaseDirectMemory(TestDirectByteBuffer.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
我们将-XX:+DisableExplicitGC去掉再试下:发现当内存使用超过512M时,其会调用FULL GC来清理直接内存空间,使得其可以反复的利用直接内存空间。
alloc 1*256M DirectMemory
alloc 2*256M DirectMemory
[Full GC (System.gc()) [CMS: 0K->2523K(174784K), 0.0355019 secs] 28010K->2523K(253440K), [Metaspace: 5038K->5038K(1056768K)], 0.0356513 secs] [Times: user=0.02 sys=0.01, real=0.04 secs]
alloc 3*256M DirectMemory
alloc 4*256M DirectMemory
[Full GC (System.gc()) [CMS: 2523K->1496K(174784K), 0.0181570 secs] 5324K->1496K(253504K), [Metaspace: 5045K->5045K(1056768K)], 0.0182564 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
alloc 5*256M DirectMemory
alloc 6*256M DirectMemory
[Full GC (System.gc()) [CMS: 1496K->1483K(174784K), 0.0098698 secs] 4296K->1483K(253504K), [Metaspace: 5046K->5046K(1056768K)], 0.0099183 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
alloc 7*256M DirectMemory
alloc 8*256M DirectMemory
[Full GC (System.gc()) [CMS: 1483K->1483K(174784K), 0.0074244 secs] 2883K->1483K(253504K), [Metaspace: 5046K->5046K(1056768K)], 0.0074780 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
alloc 9*256M DirectMemory
alloc 10*256M DirectMemory
[Full GC (System.gc()) [CMS: 1483K->1483K(174784K), 0.0089791 secs] 2883K->1483K(253504K), [Metaspace: 5046K->5046K(1056768K)], 0.0090303 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
alloc 11*256M DirectMemory
alloc 12*256M DirectMemory
[Full GC (System.gc()) [CMS: 1483K->1483K(174784K), 0.0113600 secs] 2883K->1483K(253504K), [Metaspace: 5046K->5046K(1056768K)], 0.0114121 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
alloc 13*256M DirectMemory
alloc 14*256M DirectMemory
[Full GC (System.gc()) [CMS: 1483K->1483K(174784K), 0.0074218 secs] 2883K->1483K(253504K), [Metaspace: 5046K->5046K(1056768K)], 0.0074676 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
alloc 15*256M DirectMemory
alloc 16*256M DirectMemory
我们来看下为什么会触发gc的过程:
public static ByteBuffer allocateDirect(int capacity) {
return new DirectByteBuffer(capacity);
}
//直接内存分配之前会先去调用Bits.reserveMemory(size, cap);去尝试调用System.gc()去释放内存,直到
DirectByteBuffer(int cap) { // package-private
super(-1, 0, cap, cap);
boolean pa = VM.isDirectMemoryPageAligned();
int ps = Bits.pageSize();
long size = Math.max(1L, (long)cap + (pa ? ps : 0));
Bits.reserveMemory(size, cap);
long base = 0;
try {
base = unsafe.allocateMemory(size);
} catch (OutOfMemoryError x) {
Bits.unreserveMemory(size, cap);
throw x;
}
unsafe.setMemory(base, size, (byte) 0);
if (pa && (base % ps != 0)) {
// Round up to page boundary
address = base + ps - (base & (ps - 1));
} else {
address = base;
}
cleaner = Cleaner.create(this, new Deallocator(base, size, cap));
att = null;
}
//接着其先会判断下是否还有可用的直接内存tryReserveMemory,如果有直接返回不调用system.gc,如果没有调用System.gc()
//从后续的while循环可以发现,其会循环9次判断直接内存是否被垃圾收集器回收,如果回收了则分配新的内存,没有回收则抛出OutOfMemoryError("Direct buffer memory");异常。
//不过自己通过run的方式模拟发现,都是成功的,只有debug模式有失败,相当于jvm没有时间进行垃圾回收。
static void reserveMemory(long size, int cap) {
if (!memoryLimitSet && VM.isBooted()) {
maxMemory = VM.maxDirectMemory();
memoryLimitSet = true;
}
// optimist!
if (tryReserveMemory(size, cap)) {
return;
}
final JavaLangRefAccess jlra = SharedSecrets.getJavaLangRefAccess();
// retry while helping enqueue pending Reference objects
// which includes executing pending Cleaner(s) which includes
// Cleaner(s) that free direct buffer memory
while (jlra.tryHandlePendingReference()) {
if (tryReserveMemory(size, cap)) {
return;
}
}
// trigger VM's Reference processing
System.gc();
// a retry loop with exponential back-off delays
// (this gives VM some time to do it's job)
boolean interrupted = false;
try {
long sleepTime = 1;
int sleeps = 0;
while (true) {
if (tryReserveMemory(size, cap)) {
return;
}
if (sleeps >= MAX_SLEEPS) {
break;
}
if (!jlra.tryHandlePendingReference()) {
try {
Thread.sleep(sleepTime);
sleepTime <<= 1;
sleeps++;
} catch (InterruptedException e) {
interrupted = true;
}
}
}
// no luck
throw new OutOfMemoryError("Direct buffer memory");
} finally {
if (interrupted) {
// don't swallow interrupts
Thread.currentThread().interrupt();
}
}
}
因此在线上机器去除-XX:+DisableExplicitGC并且设置-XX:MaxDirectMemorySize=512M发现机器任然会发生swap异常,并且没有出现Direct buffer memory异常,也排除了unsafe类分配直接内存泄漏引起的swap操作。
##
首先我们来看下虚拟内存的使用量:可以发现jvm大约使用了1.62G (最大swap区大小为2G)
获取各个进程使用swap的内存量参考:http://blog.csdn.net/zgs_shmily/article/details/51192308
[xxx@xxxx ~]$ ./check_swap_used.sh
PID Swap Proc_Name
3330 928KB app1
20056 1.14MB app2
2775 5.77MB app3
2724 6.50MB app4
20071 7.61MB app5
27035 10.96MB app6
1652 19.95MB app7
3082 1.62GB /usr/local/java8/bin/java -server -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Djava.net.preferIPv6Addresses=false xxxxxx
我们再来看下整个机器的内存使用量:sar -S
最近一段时间的内存使用量:可以发现在5:40时,swap内存使用约1798980kb=1.72G
04:10:01 PM 970024 1126416 53.73 116092 10.31
04:20:01 PM 970988 1125452 53.68 117096 10.40
04:30:01 PM 971248 1125192 53.67 117504 10.44
04:40:01 PM 977120 1119320 53.39 118496 10.59
04:50:01 PM 978288 1118152 53.34 118392 10.59
05:00:01 PM 979056 1117384 53.30 118612 10.62
05:10:01 PM 979388 1117052 53.28 118940 10.65
05:20:01 PM 741080 1355360 64.65 495932 36.59
05:30:02 PM 741804 1354636 64.62 496300 36.64
05:30:02 PM kbswpfree kbswpused %swpused kbswpcad %swpcad
05:40:01 PM 297460 1798980 85.81 13300 0.74
05:50:01 PM 304900 1791540 85.46 18432 1.03
06:00:02 PM 305560 1790880 85.42 20156 1.13
06:10:01 PM 306412 1790028 85.38 22360 1.25
06:20:01 PM 306788 1789652 85.37 23404 1.31
Average: 679484 1416956 67.59 97168 6.86
我们在看下jvm使用的线程量:可以发现在17:40时有个线程数量的峰值650个线程大约占swap区域大小650M
我们再观察一下最近3天的swap异常和线程数量:可以发现所有发生swap异常的地方都有线程数量的激增,因此可以肯定swap的异常和调用有关。
我们在top命令来看下整个机器物理内存使用情况:
首先JVM的内存使用情况是:5G(堆内存)+ 0.5G (堆外存) + 0.7G (栈内存)+ 0.5 G meta内存 + 其他内存+linux内核内存,再加上其他进行占用的物理内存必定会使用swap空间的。swap是会占用cpu的,因此在高QPS的时候会引起性能的下降。
因此接下来只有两种方案来解决该问题:1)查看是否有错误的使用线程的情况 2)申请加大机器内存。
另外一台机器的swap异常情况对比:
异常报警:
[P2][故障]
主机名: xxxxx
监控项: all(#2) mem.swapused.percent > 80
当前值: 80.68726
业务负责人: xxxx,xxxx
告警模板: WAIMAI_BASE
告警次数: 第1次
时间: 2017-06-05 14:25:00
持续时长: just now
[调用链][dashboard][ACK]
jvm.thread.count线程数量:
数据库连接数量:
机器load情况:
因此最终可以确定swap的发生和jvm线程数量、机器QPS相关,接下来我们需要去看下机器线程使用情况
解决方案1:排查是否有错误使用线程情况,导致内存的占用
线程状态分布:jstack 3082 | grep ‘java.lang.Thread.State’ | awk ‘{print 2, 3, 4, 5}’ | sort | uniq -c
[xxx@xxxx ~]$ jstack 3082 | grep 'java.lang.Thread.State' | awk '{print $2,$3,$4,$5}' | sort | uniq -c
1 BLOCKED (on object monitor)
119 RUNNABLE
25 TIMED_WAITING (on object monitor)
106 TIMED_WAITING (parking)
26 TIMED_WAITING (sleeping)
15 WAITING (on object monitor)
207 WAITING (parking)
[xxx@xxxx ~]$
统计各个线程出现次数
[xxx@xxxx ~]$ jstack 3082 | grep 'tid.*nid' | awk -F '"' '{print $2}' | cut -d - -f 1 | cut -d# -f 1 | sort | uniq -c | sort -k1 -nr
80 Thread
53 DynamicAgentCluster
44 Pigeon
24 com.sankuai.meituan.waimai.wdc.service.IRelationService
20 jetty
18 MySQL Statement Cancellation Timer
18 jedis
16 New I/O worker
12 pool
10 Curator
10 com.sankuai.meituan.waimai.wdc.service.IWmTagService
10 com.sankuai.meituan.waimai.wdc.service.IWdcPoiQueryService
10 com.sankuai.meituan.waimai.wdc.service.IWdcMergeGroupService
10 com.sankuai.meituan.waimai.wdc.service.ISubversiveService
10 com.sankuai.meituan.waimai.wdc.service.ISimilarPoiRecommendService
10 com.sankuai.meituan.waimai.wdc.service.IPublicSeaPoiReportAuditService
10 com.sankuai.meituan.waimai.wdc.service.IPoiSegmentService
10 com.sankuai.meituan.waimai.wdc.service.IOuterRelationService
10 com.sankuai.meituan.waimai.wdc.service.IGeoCodeService
10 com.sankuai.meituan.waimai.wdc.service.IBrandService
10 com.sankuai.meituan.waimai.wdc.service.IBrandRelationService
10 com.sankuai.meituan.waimai.wdc.service.IAutoAuditService
8 main
8 elasticsearch[Big Wheel][transport_client_worker][T
8 api
5 XMDFileAppender
5 tair
4 Gang worker
4 Druid
3 cat
3 AsyncAppender
2 Timer
2 TAsyncClientManager
2 metrics
2 avatar
1 VM Thread
1 VM Periodic Task Thread
1 TraceCollector
1 threadDeathWatcher
1 Tair
1 Surrogate Locker Thread (Concurrent GC)
1 Squirrel
1 squirrel
1 Signal Dispatcher
1 Service Thread
1 Reference Handler
1 pollingConfigurationSource
1 org.eclipse.jetty.util.RolloverFileOutputStream
1 nioEventLoopGroup
1 New I/O server boss
1 New I/O boss
1 mtthrift
1 mtrace
1 MnsCacheManager
1 mafka
1 lion
1 JMonitor Http Agent Sender for app[]
1 jmonitor
1 HashSessionScavenger
1 Hashed wheel timer
1 Finalizer
1 FalconCollect
1 elasticsearch[Big Wheel][transport_client_timer][T
1 elasticsearch[Big Wheel][transport_client_boss][T
1 elasticsearch[Big Wheel][[timer]]
1 elasticsearch[Big Wheel][scheduler][T
1 elasticsearch[Big Wheel][generic][T
1 DestroyJavaVM
1 ConfigCacheManager
1 Concurrent Mark
1 commons
1 C2 CompilerThread1
1 C2 CompilerThread0
1 C1 CompilerThread2
1 Attach Listener
1 AsyncLoggerConfig
1 AsyncLogger
1 Abandoned connection cleanup thread
1 1429208949@qtp
1 1171978040@qtp
waiting状态的线程情况:
[xxx@xxxx ~]$ jstack 3082 | grep -B1 'java.lang.Thread.State.*WAITING' | grep 'tid.*nid' | awk -F '"' '{print $2}' | cut -d - -f 1 | cut -d# -f 1 | sort | uniq -c | sort -k1 -nr
53 DynamicAgentCluster //该线程是一个scheduleAtFixedRate的线程,其是Thrift用于从OCTO中获取指定appkey对应的serverLists,每个ThriftClient都会有一个thread
44 Pigeon
37 com.sankuai.meituan.waimai.wdc.service.IRelationService
18 MySQL Statement Cancellation Timer
16 jetty
15 Thread
15 jedis
12 pool
10 Curator
10 com.sankuai.meituan.waimai.wdc.service.IWmTagService
10 com.sankuai.meituan.waimai.wdc.service.IWdcPoiQueryService
10 com.sankuai.meituan.waimai.wdc.service.IWdcMergeGroupService
10 com.sankuai.meituan.waimai.wdc.service.ISubversiveService
10 com.sankuai.meituan.waimai.wdc.service.ISimilarPoiRecommendService
10 com.sankuai.meituan.waimai.wdc.service.IPublicSeaPoiReportAuditService
10 com.sankuai.meituan.waimai.wdc.service.IPoiSegmentService
10 com.sankuai.meituan.waimai.wdc.service.IOuterRelationService
10 com.sankuai.meituan.waimai.wdc.service.IGeoCodeService
10 com.sankuai.meituan.waimai.wdc.service.IBrandService
10 com.sankuai.meituan.waimai.wdc.service.IBrandRelationService
10 com.sankuai.meituan.waimai.wdc.service.IAutoAuditService
8 api
5 XMDFileAppender
4 main
4 Druid
3 cat
3 AsyncAppender
2 Timer
2 metrics
2 avatar
1 TraceCollector
1 threadDeathWatcher
1 Tair
1 Squirrel
1 squirrel
1 Reference Handler
1 pollingConfigurationSource
1 org.eclipse.jetty.util.RolloverFileOutputStream
1 mtthrift
1 mtrace
1 MnsCacheManager
1 mafka
1 lion
1 JMonitor Http Agent Sender for app[]
1 jmonitor
1 HashSessionScavenger
1 Hashed wheel timer
1 Finalizer
1 FalconCollect
1 elasticsearch[Big Wheel][transport_client_timer][T
1 elasticsearch[Big Wheel][[timer]]
1 elasticsearch[Big Wheel][scheduler][T
1 elasticsearch[Big Wheel][generic][T
1 ConfigCacheManager
1 commons
1 AsyncLoggerConfig
1 AsyncLogger
1 Abandoned connection cleanup thread
1 1429208949@qtp
排查正常Runable的线程,我们需要排查下wait线程的情况,看是否有不合理的使用情况:
解决方案2:申请扩大物理内存
最终我们将机器从8G内存扩展到16G内存,解决了swap的异常:
8G 机器使用情况:(机器刚重启了10分钟)