过程是痛苦的,后面的结论是令人不安的。
上一篇的分析,确定了至少两个个结论:
一、如果总体上active NN写JNs出问题,那么active NN就主动调用terminate,进程退出;
二、JNs的相关的一个配置项:dfs.namenode.shared.edits.dir,这个配置项中出现的JN的信息,对NN来说一定是“required”的。
这篇后续的分析,解释“总体上active NN写JNs出问题”,是怎么回事。以上一篇相反的另一个方向的思路,分析问题是怎么导致的,以及解释代码与QJM的quorum机制是否一致(答案必然是肯定的了)。
还是从active NN FATAL log说起。
2015-11-16 07:36:50,478 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 11830 Total time for transactions(ms): 394 Number of transactions batched in Syncs: 7342 Number of syncs: 350 SyncTimes(ms): 735 30792 26555
1598 2015-11-16 07:36:50,481 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4776804880))
1599 java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
1600 at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
1601 at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
1602 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
1603 at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
1604 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
1605 at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
1606 at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
1607 at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
1608 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
1609 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3001)
1610 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:647)
1611 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTrans latorPB.java:484)
1612 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenode ProtocolProtos.java)
1613 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
1614 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
1615 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
1616 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
1617 at java.security.AccessController.doPrivileged(Native Method)
1618 at javax.security.auth.Subject.doAs(Subject.java:415)
1619 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
1620 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
补充相关日志:
1469:2015-11-16 07:36:26,770 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 4670ms to send a batch of 78 edits (12198 bytes) to remote journal 192.168.146.67:8485
1471:2015-11-16 07:36:50,383 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 21267ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.67:8485
1537:2015-11-16 07:36:29,116 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2345ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.68:8485
1547:2015-11-16 07:36:29,115 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2344ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.66:8485
1593:2015-11-16 07:36:50,459 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 23689 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [192.168.146.67:8485]
JournalSet.java中,JournalSet.mapJournalsAndReportErrors():
/**
* Apply the given operation across all of the journal managers, disabling
* any for which the closure throws an IOException.
* @param closure {@link JournalClosure} object encapsulating the operation.
* @param status message used for logging errors (e.g. "opening journal")
* @throws IOException If the operation fails on all the journals.
*/
private void mapJournalsAndReportErrors(JournalClosure closure, String status)
throws IOException {
List
badJAS = Lists.newLinkedList();
for (JournalAndStream jas : journals) {
try {
closure.apply(jas); // 注意:JournalClosuer.apply()抛出Exception或Error,导致最终的terminate()
} catch (Throwable t) {
if (jas.isRequired()) {
final String msg = "Error: " + status + " failed for required journal ("
+ jas + ")";
LOG.fatal(msg, t);
// If we fail on *any* of the required journals, then we must not
// continue on any of the other journals. Abort them to ensure that
// retry behavior doesn't allow them to keep going in any way.
abortAllJournals();
// the current policy is to shutdown the NN on errors to shared edits
// dir. There are many code paths to shared edits failures - syncs,
// roll of edits etc. All of them go through this common function
// where the isRequired() check is made. Applying exit policy here
// to catch all code paths.
terminate(1, msg);
} else {
LOG.error("Error: " + status + " failed for (journal " + jas + ")", t);
badJAS.add(jas);
}
}
}
disableAndReportErrorOnJournals(badJAS);
if (!NameNodeResourcePolicy.areResourcesAvailable(journals,
minimumRedundantJournals)) {
String message = status + " failed for too many journals";
LOG.error("Error: " + message);
throw new IOException(message);
}
}
JournalSet.JournalSetOutputStream.flush():
@Override
public void flush() throws IOException {
mapJournalsAndReportErrors(new JournalClosure() {
@Override
public void apply(JournalAndStream jas) throws IOException {
if (jas.isActive()) {
jas.getCurrentStream().flush(); // 注意:apply()的实现中,EditLogOutputStream.flush()唯一可导致抛出IOException
}
}
}, "flush");
}
JournalSet.JournalAndStream.getCurrentStream():
private EditLogOutputStream stream;
EditLogOutputStream getCurrentStream() {
return stream;
}
EditLogOutputStream.java中:
/**
* Flush and sync all data that is ready to be flush
* {@link #setReadyToFlush()} into underlying persistent store.
* @param durable if true, the edits should be made truly durable before
* returning
* @throws IOException
*/
abstract protected void flushAndSync(boolean durable) throws IOException;
/**
* Flush data to persistent store.
* Collect sync metrics.
*/
public void flush() throws IOException {
flush(true);
}
public void flush(boolean durable) throws IOException {
numSync++;
long start = monotonicNow();
flushAndSync(durable); // 注意:flush()的实现中,flushAndSync()唯一可抛出IOException
long end = monotonicNow();
totalTimeSync += (end - start);
}
QuorumOutputStream.java中:
/**
* EditLogOutputStream implementation that writes to a quorum of
* remote journals.
*/
class QuorumOutputStream extends EditLogOutputStream {
private final int writeTimeoutMs;
@Override
protected void flushAndSync(boolean durable) throws IOException {
int numReadyBytes = buf.countReadyBytes();
if (numReadyBytes > 0) {
int numReadyTxns = buf.countReadyTxns();
long firstTxToFlush = buf.getFirstReadyTxId();
assert numReadyTxns > 0;
// Copy from our double-buffer into a new byte array. This is for
// two reasons:
// 1) The IPC code has no way of specifying to send only a slice of
// a larger array.
// 2) because the calls to the underlying nodes are asynchronous, we
// need a defensive copy to avoid accidentally mutating the buffer
// before it is sent.
DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
buf.flushTo(bufToSend);
assert bufToSend.getLength() == numReadyBytes;
byte[] data = bufToSend.getData();
assert data.length == bufToSend.getLength();
// 注意:AysncLoggerSet.sendEdits()不会抛出exception
QuorumCall qcall = loggers.sendEdits(
segmentTxId, firstTxToFlush,
numReadyTxns, data);
// 注意:AsyncLogger.waitForWriteQuorum()可抛出IOException
// 需要确认,writeTimeoutMs的值是什么,怎么来的
loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
// Since we successfully wrote this batch, let the loggers know. Any future
// RPCs will thus let the loggers know of the most recent transaction, even
// if a logger has fallen behind.
loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
}
}
}
AsyncLoggerSet.java中:
public QuorumCall sendEdits(
long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
Map> calls = Maps.newHashMap();
for (AsyncLogger logger : loggers) {
ListenableFuture future =
logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
calls.put(logger, future);
}
return QuorumCall.create(calls);
}
/**
* Wait for a quorum of loggers to respond to the given call. If a quorum
* can't be achieved, throws a QuorumException.
* @param q the quorum call
* @param timeoutMs the number of millis to wait
* @param operationName textual description of the operation, for logging
* @return a map of successful results
* @throws QuorumException if a quorum doesn't respond with success
* @throws IOException if the thread is interrupted or times out
*/
Map waitForWriteQuorum(QuorumCall q,
int timeoutMs, String operationName) throws IOException {
// 注意:这里是实现quorum机制的关键一点,参见下面这个AsyncLoggerSet.getMajoritySize()的实现
int majority = getMajoritySize();
try {
// 注意:QuorumCall.waitFor()中,可能会写一个WARN log
q.waitFor(
loggers.size(), // either all respond
majority, // or we get a majority successes
majority, // or we get a majority failures,
timeoutMs, operationName);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IOException("Interrupted waiting " + timeoutMs + "ms for a " +
"quorum of nodes to respond.");
} catch (TimeoutException e) {
// 注意:这个exception的内容,与FATAL log的exception内容一致
throw new IOException("Timed out waiting " + timeoutMs + "ms for a " +
"quorum of nodes to respond.");
}
if (q.countSuccesses() < majority) {
q.rethrowException("Got too many exceptions to achieve quorum size " +
getMajorityString());
}
return q.getResults();
}
/**
* @return the number of nodes which are required to obtain a quorum.
*/
int getMajoritySize() {
// 注意:这里需要确认,loggers最终从何而来
return loggers.size() / 2 + 1;
}
QuorumCall.java中:
/**
* Wait for the quorum to achieve a certain number of responses.
*
* Note that, even after this returns, more responses may arrive,
* causing the return value of other methods in this class to change.
*
* @param minResponses return as soon as this many responses have been
* received, regardless of whether they are successes or exceptions
* @param minSuccesses return as soon as this many successful (non-exception)
* responses have been received
* @param maxExceptions return as soon as this many exception responses
* have been received. Pass 0 to return immediately if any exception is
* received.
* @param millis the number of milliseconds to wait for
* @throws InterruptedException if the thread is interrupted while waiting
* @throws TimeoutException if the specified timeout elapses before
* achieving the desired conditions
*/
public synchronized void waitFor(
int minResponses, int minSuccesses, int maxExceptions,
int millis, String operationName)
throws InterruptedException, TimeoutException {
long st = Time.monotonicNow();
long nextLogTime = st + (long)(millis * WAIT_PROGRESS_INFO_THRESHOLD);
long et = st + millis;
while (true) {
checkAssertionErrors();
if (minResponses > 0 && countResponses() >= minResponses) return;
if (minSuccesses > 0 && countSuccesses() >= minSuccesses) return;
if (maxExceptions >= 0 && countExceptions() > maxExceptions) return;
long now = Time.monotonicNow();
if (now > nextLogTime) {
long waited = now - st;
// 注意:这里的msg变量的内容,与WARN log的内容一致
// 同时也说明,到达这里,说明上面的三个条件都不满足,所以没有return
// 进而说明,没有达到QJM的quorum机制所要求的majority数量
String msg = String.format(
"Waited %s ms (timeout=%s ms) for a response for %s",
waited, millis, operationName);
if (!successes.isEmpty()) {
msg += ". Succeeded so far: [" + Joiner.on(",").join(successes.keySet()) + "]";
}
if (!exceptions.isEmpty()) {
msg += ". Exceptions so far: [" + getExceptionMapString() + "]";
}
if (successes.isEmpty() && exceptions.isEmpty()) {
msg += ". No responses yet.";
}
if (waited > millis * WAIT_PROGRESS_WARN_THRESHOLD) {
QuorumJournalManager.LOG.warn(msg);
} else {
QuorumJournalManager.LOG.info(msg);
}
nextLogTime = now + WAIT_PROGRESS_INTERVAL_MILLIS;
}
long rem = et - now;
if (rem <= 0) {
// 注意:这里抛出TimeoutException,最终导致了active NN进程terminate
throw new TimeoutException();
}
rem = Math.min(rem, nextLogTime - now);
rem = Math.max(rem, 1);
wait(rem);
}
}
QuorumJournalManager.java中:
/**
* A JournalManager that writes to a set of remote JournalNodes,
* requiring a quorum of nodes to ack each write.
*/
@InterfaceAudience.Private
public class QuorumJournalManager implements JournalManager {
private final int writeTxnsTimeoutMs;
// 注意:这里决定了上面的timeout的值
this.writeTxnsTimeoutMs = conf.getInt(
DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY,
DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_DEFAULT);
}
DFSConfigKeys.java中:
public static final String DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY = "dfs.qjournal.write-txns.timeout.ms";
public static final int DFS_QJOURNAL_START_SEGMENT_TIMEOUT_DEFAULT = 20000;
到这里,结合上面结论就很明显了:
active NN往JNs写edit log时,不能够在quorum中写成功(写超时),导致抛出异常(TimeoutException)。
更深层次的原因,可能是网络原因,或者JVM GC的原因。GC日志找不到了,不好确定。
网络方面,并不能说网络不好,应该说,是网络状况没有达到上面的配置项(dfs.qjournal.write-txns.timeout.ms)所要求的程度。
所以,最直接的解决办法,就是增大这个配置项的值,现在是默认的20000毫秒。
保险起见,最好把QuorumJournalManager.java中涉及的关于timeout的配置项都增大。
同时,必须意识到,在极端情况下,HA的两个NN都可能因为这个原因退出,导致整个HDFS集群不可用。