关于机房交换机故障导致HDFS NameNode挂掉的问题(续)

过程是痛苦的,后面的结论是令人不安的。


上一篇的分析,确定了至少两个个结论:
一、如果总体上active NN写JNs出问题,那么active NN就主动调用terminate,进程退出;
二、JNs的相关的一个配置项:dfs.namenode.shared.edits.dir,这个配置项中出现的JN的信息,对NN来说一定是“required”的。


这篇后续的分析,解释“总体上active NN写JNs出问题”,是怎么回事。以上一篇相反的另一个方向的思路,分析问题是怎么导致的,以及解释代码与QJM的quorum机制是否一致(答案必然是肯定的了)。


还是从active NN FATAL log说起。


2015-11-16 07:36:50,478 INFO  namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of transactions: 11830 Total time for transactions(ms): 394 Number of transactions batched in Syncs: 7342 Number of syncs: 350 SyncTimes(ms): 735 30792 26555
1598 2015-11-16 07:36:50,481 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(364)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.146.66:8485, 192.168.146.67:8485, 192.168.146.68:8485], stream=QuorumOutputStream starting at txid 4776804880))
1599 java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
1600         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
1601         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
1602         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
1603         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
1604         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:499)
1605         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)
1606         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
1607         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:495)
1608         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:623)
1609         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3001)
1610         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:647)
1611         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTrans     latorPB.java:484)
1612         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenode     ProtocolProtos.java)
1613         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
1614         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
1615         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
1616         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
1617         at java.security.AccessController.doPrivileged(Native Method)
1618         at javax.security.auth.Subject.doAs(Subject.java:415)
1619         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
1620         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)


补充相关日志:
1469:2015-11-16 07:36:26,770 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 4670ms to send a batch of 78 edits (12198 bytes) to remote journal 192.168.146.67:8485
1471:2015-11-16 07:36:50,383 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 21267ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.67:8485
1537:2015-11-16 07:36:29,116 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2345ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.68:8485
1547:2015-11-16 07:36:29,115 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(378)) - Took 2344ms to send a batch of 39 edits (7095 bytes) to remote journal 192.168.146.66:8485
1593:2015-11-16 07:36:50,459 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 23689 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [192.168.146.67:8485]




JournalSet.java中,JournalSet.mapJournalsAndReportErrors():
  /**
   * Apply the given operation across all of the journal managers, disabling
   * any for which the closure throws an IOException.
   * @param closure {@link JournalClosure} object encapsulating the operation.
   * @param status message used for logging errors (e.g. "opening journal")
   * @throws IOException If the operation fails on all the journals.
   */
  private void mapJournalsAndReportErrors(JournalClosure closure, String status)
      throws IOException {


    List badJAS = Lists.newLinkedList();
    for (JournalAndStream jas : journals) {
      try {
        closure.apply(jas); // 注意:JournalClosuer.apply()抛出Exception或Error,导致最终的terminate()
      } catch (Throwable t) {
        if (jas.isRequired()) {
          final String msg = "Error: " + status + " failed for required journal ("
            + jas + ")";
          LOG.fatal(msg, t);
          // If we fail on *any* of the required journals, then we must not
          // continue on any of the other journals. Abort them to ensure that
          // retry behavior doesn't allow them to keep going in any way.
          abortAllJournals();
          // the current policy is to shutdown the NN on errors to shared edits
          // dir. There are many code paths to shared edits failures - syncs,
          // roll of edits etc. All of them go through this common function
          // where the isRequired() check is made. Applying exit policy here
          // to catch all code paths.
          terminate(1, msg);
        } else {
          LOG.error("Error: " + status + " failed for (journal " + jas + ")", t);
          badJAS.add(jas);
        }
      }
    }
    disableAndReportErrorOnJournals(badJAS);
    if (!NameNodeResourcePolicy.areResourcesAvailable(journals,
        minimumRedundantJournals)) {
      String message = status + " failed for too many journals";
      LOG.error("Error: " + message);
      throw new IOException(message);
    }
  }


JournalSet.JournalSetOutputStream.flush():
    @Override
    public void flush() throws IOException {
      mapJournalsAndReportErrors(new JournalClosure() {
        @Override
        public void apply(JournalAndStream jas) throws IOException {
          if (jas.isActive()) {
            jas.getCurrentStream().flush(); // 注意:apply()的实现中,EditLogOutputStream.flush()唯一可导致抛出IOException
          }
        }
      }, "flush");
    }


JournalSet.JournalAndStream.getCurrentStream():
    private EditLogOutputStream stream;
    EditLogOutputStream getCurrentStream() {
      return stream;
    }


EditLogOutputStream.java中:
  /**
   * Flush and sync all data that is ready to be flush
   * {@link #setReadyToFlush()} into underlying persistent store.
   * @param durable if true, the edits should be made truly durable before
   * returning
   * @throws IOException
   */
  abstract protected void flushAndSync(boolean durable) throws IOException;


  /**
   * Flush data to persistent store.
   * Collect sync metrics.
   */
  public void flush() throws IOException {
    flush(true);
  }


  public void flush(boolean durable) throws IOException {
    numSync++;
    long start = monotonicNow();
    flushAndSync(durable); // 注意:flush()的实现中,flushAndSync()唯一可抛出IOException
    long end = monotonicNow();
    totalTimeSync += (end - start);
  }


QuorumOutputStream.java中:
/**
 * EditLogOutputStream implementation that writes to a quorum of
 * remote journals.
 */
class QuorumOutputStream extends EditLogOutputStream {


  private final int writeTimeoutMs;


  @Override
  protected void flushAndSync(boolean durable) throws IOException {
    int numReadyBytes = buf.countReadyBytes();
    if (numReadyBytes > 0) {
      int numReadyTxns = buf.countReadyTxns();
      long firstTxToFlush = buf.getFirstReadyTxId();


      assert numReadyTxns > 0;


      // Copy from our double-buffer into a new byte array. This is for
      // two reasons:
      // 1) The IPC code has no way of specifying to send only a slice of
      //    a larger array.
      // 2) because the calls to the underlying nodes are asynchronous, we
      //    need a defensive copy to avoid accidentally mutating the buffer
      //    before it is sent.
      DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
      buf.flushTo(bufToSend);
      assert bufToSend.getLength() == numReadyBytes;
      byte[] data = bufToSend.getData();
      assert data.length == bufToSend.getLength();


      // 注意:AysncLoggerSet.sendEdits()不会抛出exception
      QuorumCall qcall = loggers.sendEdits(
          segmentTxId, firstTxToFlush,
          numReadyTxns, data);
      // 注意:AsyncLogger.waitForWriteQuorum()可抛出IOException
      // 需要确认,writeTimeoutMs的值是什么,怎么来的
      loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");


      // Since we successfully wrote this batch, let the loggers know. Any future
      // RPCs will thus let the loggers know of the most recent transaction, even
      // if a logger has fallen behind.
      loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
    }
  }


}


AsyncLoggerSet.java中:
  public QuorumCall sendEdits(
      long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
    Map> calls = Maps.newHashMap();
    for (AsyncLogger logger : loggers) {
      ListenableFuture future =
        logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
      calls.put(logger, future);
    }
    return QuorumCall.create(calls);
  }


  /**
   * Wait for a quorum of loggers to respond to the given call. If a quorum
   * can't be achieved, throws a QuorumException.
   * @param q the quorum call
   * @param timeoutMs the number of millis to wait
   * @param operationName textual description of the operation, for logging
   * @return a map of successful results
   * @throws QuorumException if a quorum doesn't respond with success
   * @throws IOException if the thread is interrupted or times out
   */
  Map waitForWriteQuorum(QuorumCall q,
      int timeoutMs, String operationName) throws IOException {
    // 注意:这里是实现quorum机制的关键一点,参见下面这个AsyncLoggerSet.getMajoritySize()的实现
    int majority = getMajoritySize();
    try {
      // 注意:QuorumCall.waitFor()中,可能会写一个WARN log
      q.waitFor(
          loggers.size(), // either all respond
          majority, // or we get a majority successes
          majority, // or we get a majority failures,
          timeoutMs, operationName);
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
      throw new IOException("Interrupted waiting " + timeoutMs + "ms for a " +
          "quorum of nodes to respond.");
    } catch (TimeoutException e) {
      // 注意:这个exception的内容,与FATAL log的exception内容一致
      throw new IOException("Timed out waiting " + timeoutMs + "ms for a " +
          "quorum of nodes to respond.");
    }


    if (q.countSuccesses() < majority) {
      q.rethrowException("Got too many exceptions to achieve quorum size " +
          getMajorityString());
    }


    return q.getResults();
  }


  /**
   * @return the number of nodes which are required to obtain a quorum.
   */
  int getMajoritySize() {
    // 注意:这里需要确认,loggers最终从何而来
    return loggers.size() / 2 + 1;
  }


QuorumCall.java中:
  /**
   * Wait for the quorum to achieve a certain number of responses.
   *
   * Note that, even after this returns, more responses may arrive,
   * causing the return value of other methods in this class to change.
   *
   * @param minResponses return as soon as this many responses have been
   * received, regardless of whether they are successes or exceptions
   * @param minSuccesses return as soon as this many successful (non-exception)
   * responses have been received
   * @param maxExceptions return as soon as this many exception responses
   * have been received. Pass 0 to return immediately if any exception is
   * received.
   * @param millis the number of milliseconds to wait for
   * @throws InterruptedException if the thread is interrupted while waiting
   * @throws TimeoutException if the specified timeout elapses before
   * achieving the desired conditions
   */
  public synchronized void waitFor(
      int minResponses, int minSuccesses, int maxExceptions,
      int millis, String operationName)
      throws InterruptedException, TimeoutException {
    long st = Time.monotonicNow();
    long nextLogTime = st + (long)(millis * WAIT_PROGRESS_INFO_THRESHOLD);
    long et = st + millis;
    while (true) {
      checkAssertionErrors();
      if (minResponses > 0 && countResponses() >= minResponses) return;
      if (minSuccesses > 0 && countSuccesses() >= minSuccesses) return;
      if (maxExceptions >= 0 && countExceptions() > maxExceptions) return;
      long now = Time.monotonicNow();


      if (now > nextLogTime) {
        long waited = now - st;
        // 注意:这里的msg变量的内容,与WARN log的内容一致
        // 同时也说明,到达这里,说明上面的三个条件都不满足,所以没有return
        // 进而说明,没有达到QJM的quorum机制所要求的majority数量
        String msg = String.format(
            "Waited %s ms (timeout=%s ms) for a response for %s",
            waited, millis, operationName);
        if (!successes.isEmpty()) {
          msg += ". Succeeded so far: [" + Joiner.on(",").join(successes.keySet()) + "]";
        }
        if (!exceptions.isEmpty()) {
          msg += ". Exceptions so far: [" + getExceptionMapString() + "]";
        }
        if (successes.isEmpty() && exceptions.isEmpty()) {
          msg += ". No responses yet.";
        }
        if (waited > millis * WAIT_PROGRESS_WARN_THRESHOLD) {
          QuorumJournalManager.LOG.warn(msg);
        } else {
          QuorumJournalManager.LOG.info(msg);
        }
        nextLogTime = now + WAIT_PROGRESS_INTERVAL_MILLIS;
      }
      long rem = et - now;
      if (rem <= 0) {
        // 注意:这里抛出TimeoutException,最终导致了active NN进程terminate
        throw new TimeoutException();
      }
      rem = Math.min(rem, nextLogTime - now);
      rem = Math.max(rem, 1);
      wait(rem);
    }
  }


QuorumJournalManager.java中:
/**
 * A JournalManager that writes to a set of remote JournalNodes,
 * requiring a quorum of nodes to ack each write.
 */
@InterfaceAudience.Private
public class QuorumJournalManager implements JournalManager {


    private final int writeTxnsTimeoutMs;


    // 注意:这里决定了上面的timeout的值
    this.writeTxnsTimeoutMs = conf.getInt(
        DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY,
        DFSConfigKeys.DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_DEFAULT);


}


DFSConfigKeys.java中:
    public static final String  DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY = "dfs.qjournal.write-txns.timeout.ms";
    public static final int     DFS_QJOURNAL_START_SEGMENT_TIMEOUT_DEFAULT = 20000;




到这里,结合上面结论就很明显了:
active NN往JNs写edit log时,不能够在quorum中写成功(写超时),导致抛出异常(TimeoutException)。
更深层次的原因,可能是网络原因,或者JVM GC的原因。GC日志找不到了,不好确定。
网络方面,并不能说网络不好,应该说,是网络状况没有达到上面的配置项(dfs.qjournal.write-txns.timeout.ms)所要求的程度。
所以,最直接的解决办法,就是增大这个配置项的值,现在是默认的20000毫秒。
保险起见,最好把QuorumJournalManager.java中涉及的关于timeout的配置项都增大。
同时,必须意识到,在极端情况下,HA的两个NN都可能因为这个原因退出,导致整个HDFS集群不可用。

你可能感兴趣的:(关于机房交换机故障导致HDFS NameNode挂掉的问题(续))