凌晨时分,Active NameNode突然挂掉,关键日志如下:
2019-06-07 01:32:50,055 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:50,081 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:51,055 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:51,081 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:52,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:52,082 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:53,058 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:53,082 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:54,058 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:54,083 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:55,059 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:55,083 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:56,060 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18014 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:56,084 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:57,062 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19015 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]
2019-06-07 01:32:57,084 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-06-07 01:32:58,046 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 1656036462 failed for required journal (JournalAndStream(mgr=QJM to [10.220.48.13:8485, 10.220.48.26:8485, 10.220.48.27:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:408)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1288)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1257)
at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1395)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5818)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2222)
2019-06-07 01:32:58,049 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2019-06-07 01:32:58,093 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at TX-220-48-33.h.chinabank.com.cn/10.220.48.33
************************************************************/
通过查看监控发现 3台journal node,有一个journal node 7个小时前进程就挂掉了,挂掉的原因是内存溢出,通过查看/tmp目录中的进程dump信息如下可知:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 3145728 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2640), pid=7916, tid=0x00007f86780d3700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
通过查看没有挂掉的 journal node进程的gc信息如下:
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
46.88 0.00 61.72 93.59 97.35 93.57 11230009 26118.504 127717 3519.742 29638.246
34.38 0.00 22.83 93.90 97.35 93.57 11230011 26118.508 127717 3519.742 29638.250
0.00 34.38 79.36 96.91 97.35 93.57 11230012 26118.510 127717 3519.742 29638.252
0.00 31.25 21.89 97.22 97.35 93.57 11230014 26118.514 127717 3519.742 29638.256
31.25 0.00 49.14 97.37 97.35 93.57 11230015 26118.516 127717 3519.742 29638.259
0.00 31.25 0.00 98.03 97.35 93.57 11230018 26118.523 127718 3519.742 29638.265
0.00 34.38 34.07 47.77 97.35 93.57 11230020 26118.527 127718 3519.768 29638.295
0.00 34.38 58.11 48.10 97.35 93.57 11230022 26118.539 127718 3519.768 29638.307
0.00 37.50 53.27 48.50 97.35 93.57 11230024 26118.543 127718 3519.768 29638.311
0.00 34.38 16.37 48.78 97.35 93.57 11230026 26118.547 127718 3519.768 29638.315
21.88 0.00 55.71 48.87 97.35 93.57 11230027 26118.549 127718 3519.768 29638.317
31.25 0.00 63.76 49.20 97.35 93.57 11230029 26118.554 127718 3519.768 29638.322
43.75 0.00 79.80 49.58 97.35 93.57 11230031 26118.558 127718 3519.768 29638.326
0.00 68.75 9.98 50.15 97.35 93.57 11230034 26118.564 127718 3519.768 29638.332
0.00 50.00 14.22 50.52 97.35 93.57 11230036 26118.568 127718 3519.768 29638.336
68.75 0.00 65.19 50.70 97.35 93.57 11230037 26118.570 127718 3519.768 29638.338
56.25 0.00 20.29 50.98 97.35 93.57 11230039 26118.574 127718 3519.768 29638.342
0.00 68.75 85.59 53.96 97.35 93.57 11230040 26118.576 127718 3519.768 29638.344
0.00 34.38 49.55 54.55 97.35 93.57 11230042 26118.581 127718 3519.768 29638.349
0.00 37.50 31.61 54.84 97.35 93.57 11230044 26118.585 127718 3519.768 29638.353
0.00 28.12 23.99 58.07 97.35 93.57 11230046 26118.589 127718 3519.768 29638.357
34.38 0.00 75.00 58.33 97.35 93.57 11230047 26118.591 127718 3519.768 29638.359
34.38 0.00 15.54 58.62 97.35 93.57 11230049 26118.596 127718 3519.768 29638.363
0.00 28.12 43.44 58.77 97.35 93.57 11230050 26118.598 127718 3519.768 29638.366
31.25 0.00 72.77 58.88 97.35 93.57 11230051 26118.600 127718 3519.768 29638.368
28.12 0.00 15.47 59.17 97.35 93.57 11230053 26118.604 127718 3519.768 29638.372
0.00 28.12 35.98 59.28 97.35 93.57 11230054 26118.607 127718 3519.768 29638.375
通过分析gc信息可以发现,该JN进程gc频繁,并且,平均下来每次full gc将近10秒钟,显然可以理解成某些时候 full gc 时间超过了 20 秒钟,也就是超过了 参数 dfs.qjournal.start-segment.timeout.ms 的默认值 20秒,可以理解为 总共 3个JN,其中一个JN进程已经挂了,因此剩下的2个JN不能有任意一个进程出现问题,但是很碰巧,7个NS中有一个NS的NameNode进程(也就是当前挂掉的NameNode进程)在JN挂了1个之后的7小时后,碰到了 剩下的2个JN中其中一个JN发生了full gc,并且时间超过了 20秒没有给NameNode响应,导致开启一个新的editlog失败(与参数 dfs.qjournal.start-segment.timeout.ms 值有关),最终导致namenode进程挂掉。
解决方式:
通过分析JN进程的启动信息发现,JN进程只有1g内存,因此调大JN进程的内存大小,并改成了G1 GC,减少full gc 次数,以及时间。
另外可以调大小 参数 dfs.qjournal.start-segment.timeout.ms (hdfs-site.xml中配置的)的值(默认 20秒)来屏蔽故障的发生