现象 : storm 集群长时间压测一段时间后,不在进行压测,但是集群CPU,JVM 内存 任然飙高不下, 一段时间后发生内存溢出,worker 宕机重启。
结论 : 代码质量问题导致JVM 内存泄露,堆区老年代不断增长后达到FULL GC 阀值,然后JVM 不断频繁的FULL GC ,但是每次FULL GC 后堆区内存使用率并不下降明显。随着storm 拓扑的不断运行反而老年代逐渐增大。
所以一直的在频繁FULL GC,并且每次full gc 耗时达到了3,4 秒左右。非常消耗CPU 资源。一段时间后,发生内存溢出,worker 重启。
分析步骤:
(1)在运行storm 拓扑的机器查看 top -u toctest
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
339059 toctest 20 0 6722m 2.8g 10m S 250.3 36.4 744:08.49 java
536055 toctest 20 0 6744m 2.8g 10m S 260.6 36.7 527:18.76 java
147776 toctest 20 0 2208m 130m 9640 S 1.3 1.7 14:31.10 java
147576 toctest 20 0 3451m 165m 9748 S 0.3 2.1 28:18.00 java
376539 toctest 20 0 15128 1260 896 R 0.3 0.0 0:00.01 top
337578 toctest 20 0 105m 1784 1360 S 0.0 0.0 0:00.03 bash
375977 toctest 20 0 98.5m 564 496 S 0.0 0.0 0:00.00 sleep
376073 toctest 20 0 98.5m 564 496 S 0.0 0.0 0:00.00 sleep
549145 toctest 20 0 105m 2280 620 S 0.0 0.0 3:17.13 sh
549402 toctest 20 0 105m 2292 636 S 0.0 0.0 3:28.05 sh
(3)可确定 339059 536055 这两个storm 拓扑的worker 进程消耗了大量的CPU
[toctest@SHB-L0064266 apache-storm-0.9.6]$ jps
147776 supervisor
536055 worker
147576 nimbus
339059 worker
373921 Jps
(3)具体查看339059 536055 内的线程消耗CPU 情况;此处挑选339059 进行进行查看。可看出339174,339157,339434,339678 线程消耗了大量的CPU
[toctest@SHB-L0064266 apache-storm-0.9.6]$ top -Hp 339059
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
339174 toctest 20 0 7074m 2.8g 10m S 45.2 36.9 8:06.44 java
339157 toctest 20 0 7074m 2.8g 10m S 46.9 36.9 8:07.80 java
339434 toctest 20 0 7074m 2.8g 10m S 40.2 36.9 3:02.11 java
339678 toctest 20 0 7074m 2.8g 10m S 48.2 36.9 7:07.79 java
339258 toctest 20 0 7074m 2.8g 10m R 3.6 36.9 2:47.58 java
339348 toctest 20 0 7074m 2.8g 10m S 3.6 36.9 2:54.00 java
339537 toctest 20 0 7074m 2.8g 10m R 3.6 36.9 2:54.93 java
339539 toctest 20 0 7074m 2.8g 10m S 3.6 36.9 2:54.21 java
339587 toctest 20 0 7074m 2.8g 10m S 3.6 36.9 2:57.73 java
339589 toctest 20 0 7074m 2.8g 10m S 3.6 36.9 3:13.32 java
339316 toctest 20 0 7074m 2.8g 10m S 3.3 36.9 3:05.23 java
339476 toctest 20 0 7074m 2.8g 10m S 3.3 36.9 2:54.30 java
339637 toctest 20 0 7074m 2.8g 10m S 3.3 36.9 3:03.73 java
339673 toctest 20 0 7074m 2.8g 10m S 3.3 36.9 3:04.98 java
339180 toctest 20 0 7074m 2.8g 10m S 2.9 36.9 4:17.60 java
339262 toctest 20 0 7074m 2.8g 10m S 2.9 36.9 3:04.34 java
339671 toctest 20 0 7074m 2.8g 10m S 2.9 36.9 2:55.69 java
339396 toctest 20 0 7074m 2.8g 10m S 2.3 36.9 3:29.77 java
339571 toctest 20 0 7074m 2.8g 10m S 2.3 36.9 2:57.56 java
339635 toctest 20 0 7074m 2.8g 10m S 2.3 36.9 2:58.96 java
339170 toctest 20 0 7074m 2.8g 10m S 2.0 36.9 11:32.28 java
339268 toctest 20 0 7074m 2.8g 10m S 2.0 36.9 3:32.54 java
339508 toctest 20 0 7074m 2.8g 10m S 2.0 36.9 3:01.55 java
339555 toctest 20 0 7074m 2.8g 10m R 2.0 36.9 3:12.25 java
339569 toctest 20 0 7074m 2.8g 10m S 2.0 36.9 3:01.06 java
339310 toctest 20 0 7074m 2.8g 10m S 1.6 36.9 3:04.24 java
339619 toctest 20 0 7074m 2.8g 10m S 1.6 36.9 3:12.63 java
339067 toctest 20 0 7074m 2.8g 10m S 1.3 36.9 1:13.31 java
339330 toctest 20 0 7074m 2.8g 10m S 1.3 36.9 0:31.87 java
......
(4)将进程 339059 的堆栈信息打印出来
[toctest@SHB-L0064266 apache-storm-0.9.6]$ jstack 339059
..........
"Service Thread" daemon prio=10 tid=0x00002b180809a800 nid=0x52c8f runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00002b1808098000 nid=0x52c8e waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00002b1808095000 nid=0x52c8c waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00002b1808092800 nid=0x52c8b runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00002b1808072000 nid=0x52c84 in Object.wait() [0x00002b1829ca1000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x000000077ffe96f0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
"Reference Handler" daemon prio=10 tid=0x00002b1808070000 nid=0x52c83 in Object.wait() [0x00002b1829ba0000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x000000077ffe8fe0> (a java.lang.ref.Reference$Lock)
"VM Thread" prio=10 tid=0x00002b180806b800 nid=0x52c81 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00002b1808021800 nid=0x52ce6 runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00002b1808023000 nid=0x52cd5 runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00002b1808025000 nid=0x52dea runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x00002b1808027000 nid=0x52dee runnable
"VM Periodic Task Thread" prio=10 tid=0x00002b18080b5000 nid=0x52c90 waiting on condition
JNI global references: 544
(5)将 线程 339174,339157,339434,339678 转化成16进制数字为:52ce6,52cd5,52dea,52dee
发现 高消耗CPU 的线程居然是JVM 的GC 线程。为何GC 会如此高耗CPU ,推测是否是频繁的发生了FULL GC 操作。
(6)本地验证代码是否发生JVM 内存泄露
6.1 ) 在Eclipse 启动本地拓扑模式 运行的时候加上参数 JVM 参数如下:
-Xms128m
-Xmx128m
-Djava.rmi.server.hostname=127.0.0.7
-Dcom.sun.management.jmxremote.port=9801
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-Xloggc:D:\Users\ZHOUSHANBIN326\Desktop\gc-log\gc.log
6.2)利用visualvm(没有可进行下载 http://visualvm.github.io/ )进行分析
(7)JVM为何 内存泄露,大家可百度了解,此处不进行扩展了。