...
2018-03-01 17:32:16,243 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 16100ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=16378ms
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Djava.net.preferIPv4Stack=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
HBASE_OFFHEAPSIZE=5G
HBASE_OPTS=-XX:MaxDirectMemorySize=18G
hbase.bucketcache.ioengine = offheap
hbase.bucketcache.size = 4G
从当前配置中可以看出,HBase的RegionServer java heap size为8G,使用的垃圾回收机制为(old) CMS + (new) ParNew收集器.
HBase RegionServer 的 Java 配置选项:
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
主要新增最后两个,打印GC日志和时间戳.结果如下:
6.441: [GC (CMS Initial Mark) [1 CMS-initial-mark: 0K(6855104K)] 662500K(8235264K), 0.1886994 secs] [Times: user=0.39 sys=0.09, real=0.19 secs]
6.615: [CMS-concurrent-mark-start]
6.617: [CMS-concurrent-mark: 0.002/0.002 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
6.618: [CMS-concurrent-preclean-start]
6.649: [CMS-concurrent-preclean: 0.032/0.032 secs] [Times: user=0.06 sys=0.00, real=0.04 secs]
6.650: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 11.757: [CMS-concurrent-abortable-preclean: 1.883/5.108 secs] [Times: user=5.99 sys=0.21, real=5.10 secs]
11.758: [GC (CMS Final Remark) [YG occupancy: 981477 K (1380160 K)]11.758: [Rescan (parallel) , 0.5508233 secs]12.309: [weak refs processing, 0.0000426 secs]12.309: [class unloading, 0.0197898 secs]12.329: [scrub symbol table, 0.0108038 secs]12.340: [scrub string table, 0.0009889 secs][1 CMS-remark: 0K(6855104K)] 981477K(8235264K), 0.5841155 secs] [Times: user=8.36 sys=1.42, real=0.58 secs]
12.343: [CMS-concurrent-sweep-start]
12.343: [CMS-concurrent-sweep: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
12.343: [CMS-concurrent-reset-start]
12.407: [CMS-concurrent-reset: 0.064/0.064 secs] [Times: user=0.02 sys=0.05, real=0.06 secs]
75.744: [GC (Allocation Failure) 75.744: [ParNew: 1226816K->38190K(1380160K), 0.0605172 secs] 1226816K->38190K(8235264K), 0.0606329 secs] [Times: user=0.30 sys=0.17, real=0.06 secs]
119.831: [GC (Allocation Failure) 119.832: [ParNew: 1265006K->54410K(1380160K), 0.0673169 secs] 1265006K->54410K(8235264K), 0.0675545 secs] [Times: user=0.69 sys=0.19, real=0.07 secs]
131.722: [GC (Allocation Failure) 131.723: [ParNew: 1281226K->128075K(1380160K), 0.0756202 secs] 1281226K->128075K(8235264K), 0.0764645 secs] [Times: user=0.81 sys=0.33, real=0.08 secs]
138.073: [GC (Allocation Failure) 138.073: [ParNew: 1354891K->153344K(1380160K), 0.3575232 secs] 1354891K->185778K(8235264K), 0.3578412 secs] [Times: user=1.52 sys=0.25, real=0.36 secs]
177.291: [GC (Allocation Failure) 177.291: [ParNew: 1380160K->153344K(1380160K), 0.3148596 secs] 1412594K->352587K(8235264K), 0.3154605 secs] [Times: user=1.53 sys=0.47, real=0.31 secs]
194.529: [GC (Allocation Failure) 194.529: [ParNew: 1380160K->153344K(1380160K), 0.2991742 secs] 1579403K->528500K(8235264K), 0.2998621 secs] [Times: user=3.39 sys=0.43, real=0.30 secs]
... x13
267.073: [GC (Allocation Failure) 267.073: [ParNew: 1380160K->153344K(1380160K), 0.6624354 secs] 5809968K->5070515K(8235264K), 0.6626810 secs] [Times: user=8.93 sys=0.50, real=0.66 secs]
267.737: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4917171K(6855104K)] 5070681K(8235264K), 0.0235663 secs] [Times: user=0.30 sys=0.00, real=0.03 secs]
267.761: [CMS-concurrent-mark-start]
268.241: [CMS-concurrent-mark: 0.480/0.480 secs] [Times: user=3.62 sys=0.20, real=0.48 secs]
268.241: [CMS-concurrent-preclean-start]
268.311: [CMS-concurrent-preclean: 0.068/0.070 secs] [Times: user=0.26 sys=0.01, real=0.07 secs]
268.311: [CMS-concurrent-abortable-preclean-start]
268.861: [GC (Allocation Failure) 268.861: [ParNew: 1380160K->153344K(1380160K), 0.7025517 secs] 6297331K->5566823K(8235264K), 0.7027702 secs] [Times: user=9.85 sys=0.48, real=0.70 secs]
270.261: [CMS-concurrent-abortable-preclean: 1.241/1.950 secs] [Times: user=13.50 sys=0.65, real=1.95 secs]
270.262: [GC (CMS Final Remark) [YG occupancy: 972096 K (1380160 K)]270.262: [Rescan (parallel) , 0.1256659 secs]270.388: [weak refs processing, 0.0049507 secs]270.393: [class unloading, 0.0205177 secs]270.414: [scrub symbol table, 0.0079385 secs]270.422: [scrub string table, 0.0010636 secs][1 CMS-remark: 5413479K(6855104K)] 6385575K(8235264K), 0.1651917 secs] [Times: user=2.26 sys=0.00, real=0.16 secs]
270.428: [CMS-concurrent-sweep-start]
270.810: [GC (Allocation Failure) 270.810: [ParNew: 1380160K->153342K(1380160K), 0.6945854 secs] 6655115K->5934466K(8235264K), 0.6948417 secs] [Times: user=9.72 sys=0.29, real=0.69 secs]
272.801: [GC (Allocation Failure) 272.801: [ParNew: 1380158K->153344K(1380160K), 0.7545580 secs] 6699606K->5990059K(8235264K), 0.7548343 secs] [Times: user=10.43 sys=0.09, real=0.75 secs]
... x5
285.505: [GC (Allocation Failure) 285.505: [ParNew: 1380160K->153344K(1380160K), 1.1717376 secs] 6402712K->5752751K(8235264K), 1.1720205 secs] [Times: user=10.44 sys=0.56, real=1.17 secs]
287.525: [CMS-concurrent-sweep: 10.434/17.097 secs] [Times: user=110.38 sys=5.66, real=17.10 secs]
287.525: [CMS-concurrent-reset-start]
287.559: [CMS-concurrent-reset: 0.034/0.034 secs] [Times: user=0.07 sys=0.00, real=0.04 secs]
288.097: [GC (Allocation Failure) 288.097: [ParNew: 1380160K->153342K(1380160K), 1.4499834 secs] 6974837K->6325971K(8235264K), 1.4502986 secs] [Times: user=7.49 sys=0.47, real=1.45 secs]
267.737: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4917171K(6855104K)] 5070681K(8235264K), 0.0235663 secs] [Times: user=0.30 sys=0.00, real=0.03 secs]
==stop-the-wrold==,开始使用CMS收集器进行老年代的回收.初始标记阶段,这个阶段标记由根可以直接到达的对象,标记期间所有的线程会暂停. 老年代容量为6855104K, CMS GC在old区空间占用达到4917171K时触发.
2. concurrent mark start
267.761: [CMS-concurrent-mark-start]
并发标记开始阶段,第一阶段中被暂停的线程重新开始运行,由前一个阶段标记的对象触发,所有可到达的对象都在本阶段中标记
3. concurrent mark end
268.241: [CMS-concurrent-mark: 0.480/0.480 secs] [Times: user=3.62 sys=0.20, real=0.48 secs]
并发标记阶段结束,占用0.480秒的cpu时间
4. concurrent preclean start
268.241: [CMS-concurrent-preclean-start]
开始预清理阶段
预清理也是一个并发执行阶段. 本阶段中查找前一阶段执行过程中,从新生代晋升或新分配或被更新的对象. 通过并发的重新扫描这些对象,预处理阶段可以减少下一个stop-the-world重新标记阶段的工作量.
5. concurrent preclean end
268.311: [CMS-concurrent-preclean: 0.068/0.070 secs] [Times: user=0.26 sys=0.01, real=0.07 secs]
预处理阶段耗时 0.068秒,0.070秒 墙钟时间
6. concurrent abortable preclean start
268.311: [CMS-concurrent-abortable-preclean-start]
268.861: [GC (Allocation Failure) 268.861: [ParNew: 1380160K->153344K(1380160K), 0.7025517 secs] 6297331K->5566823K(8235264K), 0.7027702 secs] [Times: user=9.85 sys=0.48, real=0.70 secs]
==stop-the-world==,新生代使用ParNew并行收集器. 新生代容量1380160K, GC回收后占用从1380160K到153344K, 耗时0.7025517秒. 整个堆的占用从6297331K降低到5566823K,耗时0.7027702秒,堆的总大小为8235264K.
注意,这里的Allocation Failure是正常现象,表明Eden区已满,无法继续分配新对象,需要进行一次Minor GC, 但是这里把正常现象描述成了failure,带有一定的误导性.
8. concurrent abortable preclean end
270.261: [CMS-concurrent-abortable-preclean: 1.241/1.950 secs] [Times: user=13.50 sys=0.65, real=1.95 secs]
270.262: [GC (CMS Final Remark) [YG occupancy: 972096 K (1380160 K)]270.262: [Rescan (parallel) , 0.1256659 secs]270.388: [weak refs processing, 0.0049507 secs]270.393: [class unloading, 0.0205177 secs]270.414: [scrub symbol table, 0.0079385 secs]270.422: [scrub string table, 0.0010636 secs][1 CMS-remark: 5413479K(6855104K)] 6385575K(8235264K), 0.1651917 secs] [Times: user=2.26 sys=0.00, real=0.16 secs]
==stop-the-world==阶段,从根对象及被其引用的对象开始,重新扫描CMS堆中残留的更新过的对象. 这里重新扫描耗时0.1256659秒, 处理弱引用对象耗时 0.0049507秒,本阶段耗时 0.1651917秒.
10. concurrent sweep start
270.428: [CMS-concurrent-sweep-start]
开始并发清理阶段,在清理阶段,应用线程还在运行
11. concurrent sweep end
287.525: [CMS-concurrent-sweep: 10.434/17.097 secs] [Times: user=110.38 sys=5.66, real=17.10 secs]
并发清理阶段耗时17.097秒
12. concurrent reset start
287.525: [CMS-concurrent-reset-start]
开始并发重置
13. concurrent reset end
287.559: [CMS-concurrent-reset: 0.034/0.034 secs] [Times: user=0.07 sys=0.00, real=0.04 secs]
并发重置结束,在本阶段,重新初始化CMS内部数据结构,以备下一轮GC的使用.
修改后配置如下:
1. HBase RegionServer 的 Java 堆栈大小: 16G
2. HBase RegionServer 的 Java 配置选项
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
主要将Heap size内存增大1倍, 并且
1) -XX:CMSInitiatingOccupancyFraction=80: Old区触发比率调整为80%
2) -XX:+UseCMSCompactAtFullCollection: CMS是不会移动内存的, 因此, 非常容易产生碎片,导致内存不够用,从而频繁触发Full GC, 因此,内存的压缩这个时候就会被启用, 增加这个参数是个好习惯,可能会影响性能,但是可以消除碎片
3) -XX:CMSFullGCsBeforeCompaction=8 运行多少次Full GC之后,进行一次碎片整理,调整该参数能够优化性能
并不是很明显,依然频繁的出现GC Allocation Failure, 导致频繁的 ParNew Young GC. 如上日志选择的不是很明显,其实主要是过多的Young GC导致JVM暂停, CMS old GC只有在initial Mark和final remark时,才会stop-the-world,所以,可以考虑减少Young GC频率
所以,通过-Xmn设置新生代区域大小为3G,较少Young GC频率
一定程度上缓解了GC超时的问题,因为同时删除了HBase中大量数据,所以,待观察. 如果依然存在问题,考虑使用G1垃圾回收器.
参考:
1. http://ifeve.com/jvm-cms-log/
2. http://blog.jobbole.com/109170/