警惕使用jvm参数CMSRefProcTaskProxy

阅读更多
    昨天中午的时候, 团队的兄弟找我看一个现象: 原先因为堆外内存使用过多会crash掉的java应用, 设置了最大堆外内存量(MaxDirectMemorySize)后jvm不会crash, 但出现了机器的两颗CPU全部被占满, 而且java程序没有响应的情况.
    我用jstat -gc/-gcutil/-gccause查了一下当时gc的情况, 发现出现过CMS GC, 最后一次导致GC的原因是CMS final remark, 没有什么异常. 新生代和旧生代占用量都比较少, survior的from与to区域都正常. 这就比较诡异了, 如果因为堆外内存超出了MaxDirectMemorySize设置的值, 那会抛出OOM, 但这个没有抛出.
    检查了DisableExplicitGC参数,是否关闭了显式GC, 结果没有关闭. 这就更说不通了.
    于是我转向调查CPU使用率为什么这么高. 用top查了一下CPU有几个jvm的线程(top运行后, 用shift + h开启线程观察)占着CPU, 线程的ID分别是: 38024, 38025, 38026, 38027.
然后采用pstack查看这几个线程究竟在干什么. pstack了好几次, 每次这些线程的stack都差不多, 如下:

引用
Thread 220 (Thread 0x40b4c940 (LWP 38024)):
#0  0x00007f1444751b60 in SpinPause@plt () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#1  0x00007f14447fbf09 in ParallelTaskTerminator::offer_termination() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#2  0x00007f1444c7fa87 in CMSRefProcTaskProxy::do_work_steal(int, CMSParDrainMarkingStackClosure*, CMSParKeepAliveClosure*, int*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#3  0x00007f1444c776e5 in CMSRefProcTaskProxy::work(int) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#4  0x00007f1444b40e2a in GangWorker::loop() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#5  0x00007f1444af8e98 in GangWorker::run() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#6  0x00007f1444af8278 in java_start(Thread*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#7  0x000000375560677d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003754ed49ad in clone () from /lib64/libc.so.6
Thread 219 (Thread 0x40708940 (LWP 38025)):
#0  0x0000003754ebb5a7 in sched_yield () from /lib64/libc.so.6
#1  0x00007f1444a267e9 in ParallelTaskTerminator::yield() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#2  0x00007f14447fbfc9 in ParallelTaskTerminator::offer_termination() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#3  0x00007f1444c7fa87 in CMSRefProcTaskProxy::do_work_steal(int, CMSParDrainMarkingStackClosure*, CMSParKeepAliveClosure*, int*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#4  0x00007f1444c776e5 in CMSRefProcTaskProxy::work(int) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#5  0x00007f1444b40e2a in GangWorker::loop() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#6  0x00007f1444af8e98 in GangWorker::run() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#7  0x00007f1444af8278 in java_start(Thread*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#8  0x000000375560677d in start_thread () from /lib64/libpthread.so.0
#9  0x0000003754ed49ad in clone () from /lib64/libc.so.6
Thread 218 (Thread 0x40a3c940 (LWP 38026)):
#0  0x00007f1444751b60 in SpinPause@plt () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#1  0x00007f14447fbf09 in ParallelTaskTerminator::offer_termination() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#2  0x00007f1444c7fa87 in CMSRefProcTaskProxy::do_work_steal(int, CMSParDrainMarkingStackClosure*, CMSParKeepAliveClosure*, int*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#3  0x00007f1444c776e5 in CMSRefProcTaskProxy::work(int)() from /opt/taobao/install
/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#4  0x00007f1444b40e2a in GangWorker::loop() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#5  0x00007f1444af8e98 in GangWorker::run() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#6  0x00007f1444af8278 in java_start(Thread*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#7  0x000000375560677d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003754ed49ad in clone () from /lib64/libc.so.6
Thread 217 (Thread 0x40277940 (LWP 38027)):
#0  0x00007f1444871b29 in SpinPause () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#1  0x00007f14447fbf09 in ParallelTaskTerminator::offer_termination() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#2  0x00007f1444c7fa87 in CMSRefProcTaskProxy::do_work_steal(int, CMSParDrainMarkingStackClosure*, CMSParKeepAliveClosure*, int*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#3  0x00007f1444c776e5 in CMSRefProcTaskProxy::work(int) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#4  0x00007f1444b40e2a in GangWorker::loop() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#5  0x00007f1444af8e98 in GangWorker::run() () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#6  0x00007f1444af8278 in java_start(Thread*) () from /opt/taobao/install/jdk-1.6.0_26/jre/lib/amd64/server/libjvm.so
#7  0x000000375560677d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003754ed49ad in clone () from /lib64/libc.so.6


一眼就可以看出这些都是CMS GC的线程. 他们都停留在CMSRefProcTaskProxy::work下, GC的stack看上去还正常.奇怪的是为什么这么占CPU, 看样子一直在自旋(spin)或yield, 貌似在等待啥状态完成. 自旋锁是占CPU的利器.
要了一份gc log后,  看看gc log是否有啥线索. 发现一个非常诡异的地方:
引用
2012-10-30T11:44:56.808+0800: 2603.044: [CMS-concurrent-mark: 0.314/0.374 secs] [Times: user=0.45 sys=0.02, real=0.37 secs]
2012-10-30T11:44:56.808+0800: 2603.044: [CMS-concurrent-preclean-start]
2012-10-30T11:44:56.811+0800: 2603.047: [CMS-concurrent-preclean: 0.003/0.003 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2012-10-30T11:44:56.811+0800: 2603.047: [GC[YG occupancy: 13211 K (917504 K)]2603.047: [Rescan (parallel) , 0.0113650 secs]2603.059: [weak refs processing


    GC log没有remark, 更没有sweep阶段, 停留在了weak refs processing. 这一阶段正在回收弱引用, 属于remark阶段的一部分, 所以会暂停java应用(stop the world)的. 日志与pstack正好吻合, 正在做处理引用 (CMSRefProcTaskProxy).但处理引用, 为什么会使jvm hang住, 并且不停地自旋等待某状态完成呢?
    我用google查了一下, OTN上有两个人碰到了相同的问题: 链接一 链接二

    从上面的两个链接上看, 没有人能解释这个现象. 但有一个线索, 可以解决这个问题:

引用
Thank you for the advice. Unfortunatly, we are tied to old UTF behaviour which has been changed since 6u11. But, we have turned off ParallelRefProcEnabled and it looks like it helps. So, we are staing with it.


把ParallelRefProcEnabled参数给关闭就可以解决这个问题. 这是并行处理引用的参数,可以使用多GC线程处理引用. 我用
java -XX:+PrintFlagsFinal | grep ParallelRefProcEnabled

查看了一下, 它的默认值是false, 也就是默认关闭了. 向应用方要了一份JVM参数后发现, 线上这个参数是开着的.

    开启这个参数为什么会使CMSRefProcTaskProxy一直在自旋, 从而停止Java应用, 并用占用所有的CPU资源呢? 再次请教google大神, 用”ParallelRefProcEnabled hang”发现了这个 jvm的bug, 上面写着

引用
The CMSRefProcTaskProxy object needs its terminator object to be initialized to the correct number of threads. Otherwise, you get a hang or crash.

    reference processing线程个数与ParallelGCThreads参数来一样, 刚好此应用将ParallelGCThreads设为了4, 所以对应了pstack看到4个线程在处理引用的情况.  再次咨询了JVM团队,理解那句话的意思. 以上的terminator object是用来同步和管理gc线程的对象. 它会记录到目前为止已经完成的线程数_complete_threads , 当一个gc线程干完活后,他会把数_complete_threads+1,当terminator object确定已经完成的线程数_complete_threads==预先设置的所有的gc线程数_n_thread,所有的gc线程就会退出,否则其他的gc线程就会等待. 悲剧的是_n_thread在构造时为0, 后面一直没有被重设过. 所以只需要开启ParallelRefProcEnabled就会出问题. 现在能解释通了, 并且从刚才的pstack我们还可以发现停留在ParallelTaskTerminator::offer_termination()方法, 是这表示当前的gc线程没事干, 一直等待GC Terminator通知它结束, 所以它一直处于自旋锁的状态, 所以CPU才会这么高.
简单地关闭掉了ParallelRefProcEnabled之后, 以上这个现象就不会出现了. 这个bug在JDK7中已经解决, 根据应用团队的反馈, 线上此应用的机器有部分是JDK6u26,还有JDK6u30都出现过相同的现象. 官方说明JDK6u32已经fix掉这个bug, patch中显示_n_thread已经被正确地设为ygc的线程数,所以直到6u32的版本才能放心使用这个参数.

你可能感兴趣的:(jvm,java,cms)