周末处理的一次故障,这里简单记录下。
故障现象:
6点1 分左右开始, Hadoop集群异常,所有的hdfs操作都出现问题。
几十个 job报以下错
1
2
3
4
5
6
7
8
|
FAILED:
RuntimeException
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category WRITE is not supported in state standby
或者: FAILED:
RuntimeException
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby
|
故障原因分析:
1)看nn的日志,可以发现是写jn导致的,怀疑是网络问题导致,由于是5台jn,如果是网络问题的话,根据major的算法,需要3台写入失败才会导致出现问题,虽然jn和nn有些机器不在同一个交换机下,但是,从监控上来看,网络没有异常,排除网络问题
2)仔细查看nn的日志,和对源码的分析,发现是由于 namenode这个时间点进行了一次时间比较长的 full gc (96s ),导致写 journalnode 超时(默认是20s), namenode进程退出。同时从jn的日志可以看出,后面是没有nn的请求过来的(因为nn已经挂掉了)。
Gc日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
2014
-
05
-25T05:
59
:
53.578
+
0800
:
2478553.316
: [Full GC [PSYoungGen: 2784K->0K(4179328K)] [PSOldGen:
28566944K->9208608K(28573696K)] 28569728K->9208608K(32753024K)
[PSPermGen: 41405K->41405K(524288K)],
95.6706740
secs] [Times: user=
95.62
sys=
0.00
, real=
95.67
secs]
Heap
PSYoungGen total 4179328K, used 87037K [
0x00007f70e5000000
,
0x00007f71e5000000
,
0x00007f71e5000000
)
eden space 4164800K,
2
% used [
0x00007f70e5000000
,
0x00007f70ea4ff590
,
0x00007f71e3330000
)
from space 14528K,
0
% used [
0x00007f71e3330000
,
0x00007f71e3330000
,
0x00007f71e4160000
)
to space 14336K,
0
% used [
0x00007f71e4200000
,
0x00007f71e4200000
,
0x00007f71e5000000
)
PSOldGen total 28573696K, used 9208608K [
0x00007f6a15000000
,
0x00007f70e5000000
,
0x00007f70e5000000
)
object space 28573696K,
32
% used [
0x00007f6a15000000
,
0x00007f6c470c8078
,
0x00007f70e5000000
)
PSPermGen total 524288K, used 41466K [
0x00007f69f5000000
,
0x00007f6a15000000
,
0x00007f6a15000000
)
object space 524288K,
7
% used [
0x00007f69f5000000
,
0x00007f69f787e8f8
,
0x00007f6a15000000
)
|
jn日志:
1
2
3
4
5
6
7
8
|
2014
-
05
-
25
05
:
58
:
45
,
432
INFO
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing
edits file
/home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_inprogress_0000000001665741687
->
/home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_0000000001665741687-
0000000001665777062
2014
-
05
-
25
06
:
44
:
54
,
299
ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNode: RECEIVED SIGNAL
15
: SIGTERM
2014
-
05
-
25
06
:
44
:
54
,
302
INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: SHUTDOWN_MSG:
|
nn错误日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
|
2014
-
05
-
25
06
:
01
:
29
,
258
WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
96275
ms (timeout=
20000
ms)
for
a response
for
sendEdits. No responses
yet.
2014
-
05
-
25
06
:
01
:
29
,
259
WARN
org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host
machine (eg GC): pause of approximately 96145ms
GC pool
'PS MarkSweep'
had collection(s): count=
1
time=95670ms
GC pool
'PS Scavenge'
had collection(s): count=
1
time=592ms
2014
-
05
-
25
06
:
01
:
29
,
259
WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took
96281ms to send a batch of
2
edits (
358
bytes) to remote journal
xxx:
8485
2014
-
05
-
25
06
:
01
:
29
,
259
WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took
96283ms to send a batch of
2
edits (
358
bytes) to remote journal
xxx:
8485
2014
-
05
-
25
06
:
01
:
29
,
259
INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Number of transactions:
20019
Total time
for
transactions(ms):
436
Number of transactions batched in Syncs:
9441
Number of syncs:
5160
SyncTimes(ms):
41623
9305
2014
-
05
-
25
06
:
01
:
29
,
259
WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took
96275ms to send a batch of
13
edits (
1638
bytes) to remote journal
xxx:
8485
2014
-
05
-
25
06
:
01
:
29
,
259
FATAL
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed
for
required journal (JournalAndStream(mgr=QJM to [xxx:
8485
, xxx:
8485
,
xxx:
8485
, xxx:
8485
, xxx:
8485
], stream=QuorumOutputStream starting at
txid
1665777063
))
java.io.IOException: Timed out waiting 20000ms
for
a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:
137
)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:
107
)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:
113
)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:
107
)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
8
.apply(JournalSet.java:
490
)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:
350
)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$
100
(JournalSet.java:
53
)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:
486
)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:
581
)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:
1879
)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:
1845
)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:
439
)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:
207
)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$
2
.callBlockingMethod(ClientNamenodeProtocolProtos.java:
44942
)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:
453
)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:
1002
)
at org.apache.hadoop.ipc.Server$Handler$
1
.run(Server.java:
1752
)
at org.apache.hadoop.ipc.Server$Handler$
1
.run(Server.java:
1748
)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:
396
)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1438
)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:
1746
)
2014
-
05
-
25
06
:
01
:
29
,
259
WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took
96276ms to send a batch of
13
edits (
1638
bytes) to remote journal
10.201
.
202.21
:
8485
2014
-
05
-
25
06
:
01
:
29
,
263
INFO
org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock:
/tmp/hive-hdfs/hive_2014-
05
-25_05-
59
-16_296_5866222407565920996/_task_tmp.-ext-
10002
/_tmp.001379_0.
BP-
180494678
-xxx-
1366627257763
blk_-3717787322078480343_437088977{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-
1
, replicas=[ReplicaUnderConstruction[xxx:
50010
|RBW],
ReplicaUnderConstruction[xxx:
50010
|RBW],
ReplicaUnderConstruction[xxx:
50010
|RBW]]}
2014
-
05
-
25
06
:
01
:
29
,
263
WARN
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault:
Not able to place enough replicas, still in need of
2
to reach
3
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
2014
-
05
-
25
06
:
01
:
29
,
263
WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting
QuorumOutputStream starting at txid
1665777063
......
2014
-
05
-
25
06
:
01
:
29
,
268
WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took
96285ms to send a batch of
13
edits (
1638
bytes) to remote journal
xxx:
8485
......
2014
-
05
-
25
06
:
01
:
29
,
324
INFO org.apache.hadoop.util.ExitUtil: Exiting with status
1
2014
-
05
-
25
06
:
01
:
29
,
364
INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
|
在写journalnode超时时,触发了 ExitUtil类的terminate 方法,终止当前的进程:
JournalSet类中:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
for
(JournalAndStream jas : journals) {
try
{
closure.apply(jas);
}
catch
(Throwable t) {
if
(jas.isRequired()) {
final
String msg =
"Error: "
+ status +
" failed for required journal ("
+ jas +
")"
;
LOG .fatal(msg, t);
// If we fail on *any* of the required journals, then we must not
// continue on any of the other journals. Abort them to ensure that
// retry behavior doesn't allow them to keep going in any way.
abortAllJournals();
// the current policy is to shutdown the NN on errors to shared edits
// dir . There are many code paths to shared edits failures - syncs ,
// roll of edits etc. All of them go through this common function
// where the isRequired() check is made. Applying exit policy here
// to catch all code paths.
terminate(
1
, msg);
}
else
{
LOG .error(
"Error: "
+ status +
" failed for (journal "
+ jas +
")"
, t);
badJAS.add(jas);
}
}
}
|
ExitUtil类的terminate方法,调用了System.exit方法:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
/**
* Terminate the current process. Note that terminate is the *only* method
* that should be used to terminate the daemon processes.
* @param status exit code
* @param msg message used to create the ExitException
* @throws ExitException if System.exit is disabled for test purposes
*/
public
static
void
terminate(
int
status, String msg)
throws
ExitException {
LOG.info(
"Exiting with status "
+ status);
if
(systemExitDisabled) {
ExitException ee =
new
ExitException(status, msg);
LOG.fatal(
"Terminate called"
, ee);
if
(
null
== firstExitException) {
firstExitException = ee;
}
throw
ee;
}
System.exit(status);
}
|
处理方法:
重启集群,在 6:50左右恢复正常
后续解决方法:
1)调节journalnode 的写入超时时间
1
|
dfs.qjournal.write-txns.timeout.ms
|
2)调整namenode 的java参数,提前触发 full gc,这样full gc 的时间就会小一些。
3)默认namenode的fullgc方式是parallel gc,是stw模式的,更改为cms的格式。调整namenode的启动参数:
1
2
3
4
5
6
|
-XX:+UseCompressedOops
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=
0
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=
75
-XX:SoftRefLRUPolicyMSPerMB=
0
|
另外一个比较坑的是flume不会自动恢复,需要重启。。
否则会一直报java.lang.InterruptedException: sleep interrupted 错误。
本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1418895,如需转载请自行联系原作者