Hadoop源码分析(26)

Hadoop源码分析(26)

ZKFC源码分析

 在文档(25)中分析了zkfc的启动过程。在zkfc的启动过程中,其会连接zookeeper与namenode。并对namenode进行健康检查。

  namenode的健康检查,实际是通过RPC调用namenode自身的方法来进行健康检查。健康检查的主要的方法是monitorHealth方法,同时在namenode的启动中分析了用于处理namenode的RPC服务的类为:NameNodeRpcServer。该方法在namenode的实现如下:

  public synchronized void monitorHealth() throws HealthCheckFailedException,
      AccessControlException, IOException {
    checkNNStartup();
    nn.monitorHealth();
  }

  这里重点是第4行调用的monitorHealth方法这个方法内容如下:

synchronized void monitorHealth() 
      throws HealthCheckFailedException, AccessControlException {
    namesystem.checkSuperuserPrivilege();
    if (!haEnabled) {
      return; // no-op, if HA is not enabled
    }
    getNamesystem().checkAvailableResources();
    if (!getNamesystem().nameNodeHasResourcesAvailable()) {
      throw new HealthCheckFailedException(
          "The NameNode has no resources available");
    }
  }

  健康检查执行成功后,会回到zkfc远程调用的客户端。在文档(25)中分析了在这个方法调用结束后会执行一个enterState方法,来进入对应的状态,这个方法的调用情况如下:

enterState方法调用片段

  这两个方法的内容如下:

private synchronized void enterState(State newState) {
    if (newState != state) {
      LOG.info("Entering state " + newState);
      state = newState;
      synchronized (callbacks) {
        for (Callback cb : callbacks) {
          cb.enteredState(newState);
        }
      }
    }
  }

  这个方法很简单,首先是对state进行赋值,然后遍历callbacks中的对象,然后调用者个对象的enteredState方法。这里的callbacks在文档(25)中分析过了在创建healthMonitor的时候会调用addCallback方法添加callback到上述的callbacks中。这里设置的callback是HealthCallbacks类的对象,这个对象的enteredState方法内容如下:

  public void enteredState(HealthMonitor.State newState) {
      setLastHealthState(newState);
      recheckElectability();
    }

  这里重点是第3行的recheckElectability方法,其内容如下:

private void recheckElectability() {
    // Maintain lock ordering of elector -> ZKFC
    synchronized (elector) {
      synchronized (this) {
        boolean healthy = lastHealthState == State.SERVICE_HEALTHY;

        long remainingDelay = delayJoiningUntilNanotime - System.nanoTime(); 
        if (remainingDelay > 0) {
          if (healthy) {
            LOG.info("Would have joined master election, but this node is " +
                "prohibited from doing so for " +
                TimeUnit.NANOSECONDS.toMillis(remainingDelay) + " more ms");
          }
          scheduleRecheck(remainingDelay);
          return;
        }

        switch (lastHealthState) {
        case SERVICE_HEALTHY:
          elector.joinElection(targetToData(localTarget));
          if (quitElectionOnBadState) {
            quitElectionOnBadState = false;
          }
          break;

        case INITIALIZING:
          LOG.info("Ensuring that " + localTarget + " does not " +
              "participate in active master election");
          elector.quitElection(false);
          serviceState = HAServiceState.INITIALIZING;
          break;

        case SERVICE_UNHEALTHY:
        case SERVICE_NOT_RESPONDING:
          LOG.info("Quitting master election for " + localTarget +
              " and marking that fencing is necessary");
          elector.quitElection(true);
          serviceState = HAServiceState.INITIALIZING;
          break;

        case HEALTH_MONITOR_FAILED:
          fatalError("Health monitor failed!");
          break;

        default:
          throw new IllegalArgumentException("Unhandled state:" + lastHealthState);
        }
      }
    }
  }

  这里重点是第18行的switch语句,这里会根据lastHealthState的值来执行不同的方法。而这个lastHealthState的会在健康检查后由enterState方法传入。上文提到了传入的值为SERVICE_HEALTHY。所以这里实际会执行第19行到第24行的方法,这里的重点是第19行执行elector的joinElection方法。这个elector在zkfc初始化的时候提到过,其创建的是ActiveStandbyElector类的对象。其joinElection方法的内容如下:

 public synchronized void joinElection(byte[] data)
      throws HadoopIllegalArgumentException {

    if (data == null) {
      throw new HadoopIllegalArgumentException("data cannot be null");
    }

    if (wantToBeInElection) {
      LOG.info("Already in election. Not re-connecting.");
      return;
    }

    appData = new byte[data.length];
    System.arraycopy(data, 0, appData, 0, data.length);

    LOG.debug("Attempting active election for " + this);
    joinElectionInternal();
  }

  这里主要是一些参数的处理,重点在第17行的joinElectionInternal方法。这个方法的内容如下:

  private void joinElectionInternal() {
    Preconditions.checkState(appData != null,
        "trying to join election without any app data");
    if (zkClient == null) {
      if (!reEstablishSession()) {
        fatalError("Failed to reEstablish connection with ZooKeeper");
        return;
      }
    }

    createRetryCount = 0;
    wantToBeInElection = true;
    createLockNodeAsync();
  }

  这里首先判断了zkClient是否为空,然后对几个参数赋值,最后调用createLockNodeAsync方法。这个方法的内容如下:

  private void createLockNodeAsync() {
    zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
        this, zkClient);
  }

  从上述代码中可以看出,这里的选举实际是在zookeeper的指定路径下创建一个节点。若这个节点创建成功则代表该节点是选举出的主节点,若失败则为从节点。而在传入的参数中:zkLockFilePath是临时目录的路径,CreateMode.EPHEMERAL是目录的类型(此类目录会在客户端断开的时候被删除),this传入的是elector本身,即ActiveStandbyElector类。这个参数是作为回调函数被传入的,而传ActiveStandbyElector是因为他实现了StatCallback和StringCallback,这两个接口,所以它也可以作为回调函数被传入。

  无论创建节点是成功还是失败,zookeeper都会调用ActiveStandbyElector中的processResult方法,由该方法会判断其是否成功,如果成功则将其对应的namenode变为Active,否则为standby。processResult方法内容如下:

public synchronized void processResult(int rc, String path, Object ctx,
      String name) {
    if (isStaleClient(ctx)) return;
    LOG.debug("CreateNode result: " + rc + " for path: " + path
        + " connectionState: " + zkConnectionState +
        "  for " + this);

    Code code = Code.get(rc);
    if (isSuccess(code)) {
      // we successfully created the znode. we are the leader. start monitoring
      if (becomeActive()) {
        monitorActiveStatus();
      } else {
        reJoinElectionAfterFailureToBecomeActive();
      }
      return;
    }

    if (isNodeExists(code)) {
      if (createRetryCount == 0) {
        // znode exists and we did not retry the operation. so a different
        // instance has created it. become standby and monitor lock.
        becomeStandby();
      }
      // if we had retried then the znode could have been created by our first
      // attempt to the server (that we lost) and this node exists response is
      // for the second attempt. verify this case via ephemeral node owner. this
      // will happen on the callback for monitoring the lock.
      monitorActiveStatus();
      return;
    }

    String errorMessage = "Received create error from Zookeeper. code:"
        + code.toString() + " for path " + path;
    LOG.debug(errorMessage);

    if (shouldRetry(code)) {
      if (createRetryCount < maxRetryNum) {
        LOG.debug("Retrying createNode createRetryCount: " + createRetryCount);
        ++createRetryCount;
        createLockNodeAsync();
        return;
      }
      errorMessage = errorMessage
          + ". Not retrying further znode create connection errors.";
    } else if (isSessionExpired(code)) {
      // This isn't fatal - the client Watcher will re-join the election
      LOG.warn("Lock acquisition failed because session was lost");
      return;
    }

    fatalError(errorMessage);
  }

  这里的重点有两个:第一个是第11行的becomeActive方法,第二个是第23行的becomeStandby方法。这里两个方法用于转换namenode的状态。

  首先是becomeActive方法,其内容如下:

private boolean becomeActive() {
    assert wantToBeInElection;
    if (state == State.ACTIVE) {
      // already active
      return true;
    }
    try {
      Stat oldBreadcrumbStat = fenceOldActive();
      writeBreadCrumbNode(oldBreadcrumbStat);

      LOG.debug("Becoming active for " + this);
      appClient.becomeActive();
      state = State.ACTIVE;
      return true;
    } catch (Exception e) {
      LOG.warn("Exception handling the winning of election", e);
      // Caller will handle quitting and rejoining the election.
      return false;
    }
  }

 这里主要有两个方法,首先是第8行的fenceOldActive方法。这个方法是用来处理切换状态前的active节点。然后是第12行的becomeActive方法。

  这里主要分析becomeActive方法,这个方法内容如下:

   public void becomeActive() throws ServiceFailedException {
      ZKFailoverController.this.becomeActive();
    }

  这里是继续调用ZKFailoverController的becomeActive方法。该方法内容如下:

private synchronized void becomeActive() throws ServiceFailedException {
    LOG.info("Trying to make " + localTarget + " active...");
    try {
      HAServiceProtocolHelper.transitionToActive(localTarget.getProxy(
          conf, FailoverController.getRpcTimeoutToNewActive(conf)),
          createReqInfo());
      String msg = "Successfully transitioned " + localTarget +
          " to active state";
      LOG.info(msg);
      serviceState = HAServiceState.ACTIVE;
      recordActiveAttempt(new ActiveAttemptRecord(true, msg));

    } catch (Throwable t) {
      String msg = "Couldn't make " + localTarget + " active";
      LOG.fatal(msg, t);

      recordActiveAttempt(new ActiveAttemptRecord(false, msg + "\n" +
          StringUtils.stringifyException(t)));

      if (t instanceof ServiceFailedException) {
        throw (ServiceFailedException)t;
      } else {
        throw new ServiceFailedException("Couldn't transition to active",
            t);
      }
    }
  }

  这里的重点在第4行的方法,这个方法传入了两个参数,其中第一个参数是通过localTarget的getProxy方法获取的。这方法在文档(25)中解析过是用来获取namenode的代理对象的。然后调用的transitionToActive方法内容如下:

  public static void transitionToActive(HAServiceProtocol svc,
      StateChangeRequestInfo reqInfo)
      throws IOException {
    try {
      svc.transitionToActive(reqInfo);
    } catch (RemoteException e) {
      throw e.unwrapRemoteException(ServiceFailedException.class);
    }
  }

  这里可以看见第5行直接调用了代理对象的transitionToActive方法,这两个方法会通过RPC直接调用NameNodeRpcServer的方法。该方法内容如下:

 public synchronized void transitionToActive(StateChangeRequestInfo req) 
      throws ServiceFailedException, AccessControlException, IOException {
    checkNNStartup();
    nn.checkHaStateChange(req);
    nn.transitionToActive();
  }

  这里的重点在第5行的transitionToActive方法,这个方法的内容如下:

  synchronized void transitionToActive() 
      throws ServiceFailedException, AccessControlException {
    namesystem.checkSuperuserPrivilege();
    if (!haEnabled) {
      throw new ServiceFailedException("HA for namenode is not enabled");
    }
    state.setState(haContext, ACTIVE_STATE);
  }

  重点是第7行的setState方法,这个方法重新设置state的状态。在之前的文档中解析了在namenode的启动的时候都是以standby状态启动的。所以这里的state是standby状态的。其执行的setState方法内容如下:

  public void setState(HAContext context, HAState s) throws ServiceFailedException {
    if (s == NameNode.ACTIVE_STATE) {
      setStateInternal(context, s);
      return;
    }
    super.setState(context, s);
  }

  这里传入的s的值为ACTIVE_STATE,所以第2行的if 条件的结果是True。即这段代码会执行第3行的setStateInternal方法。这个方法内容如下:

protected final void setStateInternal(final HAContext context, final HAState s)
      throws ServiceFailedException {
    prepareToExitState(context);
    s.prepareToEnterState(context);
    context.writeLock();
    try {
      exitState(context);
      context.setState(s);
      s.enterState(context);
      s.updateLastHATransitionTime();
    } finally {
      context.writeUnlock();
    }
  }

  这里的逻辑很简单,首先需要准备退出当前状态(第3行和第4行),没有问题后开始执行退出程序(第7行),然后再设置新的状态(第8行),然后进入新的状态(第9行)。

  执行退出程序调用的exitState方法,这里主要是需要退出standby状态,在文档(24)中解析了在进入standby状态下的时候主要是启动两个线程,用于同步active的数据与执行checkpoint。这里退出standby状态主要就是停掉上述的两个线程。这里调用的exitState方法的内容如下:

 public void exitState(HAContext context) throws ServiceFailedException {
    try {
      context.stopStandbyServices();
    } catch (IOException e) {
      throw new ServiceFailedException("Failed to stop standby services", e);
    }
  }

  这里会继续调用context的stopStandbyServices方法来处理,这个方法的内容如下:

   public void stopStandbyServices() throws IOException {
      try {
        if (namesystem != null) {
          namesystem.stopStandbyServices();
        }
      } catch (Throwable t) {
        doImmediateShutdown(t);
      }
    }

  重点在第6行会调用 namesystem的stopStandbyServices方法。这个方法的内容如下:

  void stopStandbyServices() throws IOException {
    LOG.info("Stopping services started for standby state");
    if (standbyCheckpointer != null) {
      standbyCheckpointer.stop();
    }
    if (editLogTailer != null) {
      editLogTailer.stop();
    }
    if (dir != null && getFSImage() != null && getFSImage().editLog != null) {
      getFSImage().editLog.close();
    }
  }

  这个方法在第4行和第7行停掉了上文提到的两个进程:standbyCheckpointer和editLogTailer。

  然后再看进入新状态的enterState方法,这里的新状态是active,所以调用的是active的enterState方法,其内容如下:

public void enterState(HAContext context) throws ServiceFailedException {
    try {
      context.startActiveServices();
    } catch (IOException e) {
      throw new ServiceFailedException("Failed to start active services", e);
    }
  }

  public void startActiveServices() throws IOException {
      try {
        namesystem.startActiveServices();
      } catch (Throwable t) {
        doImmediateShutdown(t);
      }
    }

  这里和上文相同逐级调用方法,最后调用的startActiveServices方法内容如下:

void startActiveServices() throws IOException {
    startingActiveService = true;
    LOG.info("Starting services required for active state");
    writeLock();
    try {
      FSEditLog editLog = getFSImage().getEditLog();

      if (!editLog.isOpenForWrite()) {
        // During startup, we're already open for write during initialization.
        editLog.initJournalsForWrite();
        // May need to recover
        editLog.recoverUnclosedStreams();

        LOG.info("Catching up to latest edits from old active before " +
            "taking over writer role in edits logs");
        editLogTailer.catchupDuringFailover();

        blockManager.setPostponeBlocksFromFuture(false);
        blockManager.getDatanodeManager().markAllDatanodesStale();
        blockManager.clearQueues();
        blockManager.processAllPendingDNMessages();

        // Only need to re-process the queue, If not in SafeMode.
        if (!isInSafeMode()) {
          LOG.info("Reprocessing replication and invalidation queues");
          initializeReplQueues();
        }

        if (LOG.isDebugEnabled()) {
          LOG.debug("NameNode metadata after re-processing " +
              "replication and invalidation queues during failover:\n" +
              metaSaveAsString());
        }

        long nextTxId = getFSImage().getLastAppliedTxId() + 1;
        LOG.info("Will take over writing edit logs at txnid " + 
            nextTxId);
        editLog.setNextTxId(nextTxId);

        getFSImage().editLog.openForWrite();
      }

      // Enable quota checks.
      dir.enableQuotaChecks();
      if (haEnabled) {
        // Renew all of the leases before becoming active.
        // This is because, while we were in standby mode,
        // the leases weren't getting renewed on this NN.
        // Give them all a fresh start here.
        leaseManager.renewAllLeases();
      }
      leaseManager.startMonitor();
      startSecretManagerIfNecessary();

      //ResourceMonitor required only at ActiveNN. See HDFS-2914
      this.nnrmthread = new Daemon(new NameNodeResourceMonitor());
      nnrmthread.start();

      nnEditLogRoller = new Daemon(new NameNodeEditLogRoller(
          editLogRollerThreshold, editLogRollerInterval));
      nnEditLogRoller.start();

      if (lazyPersistFileScrubIntervalSec > 0) {
        lazyPersistFileScrubber = new Daemon(new LazyPersistFileScrubber(
            lazyPersistFileScrubIntervalSec));
        lazyPersistFileScrubber.start();
      }

      cacheManager.startMonitorThread();
      blockManager.getDatanodeManager().setShouldSendCachingCommands(true);
    } finally {
      startingActiveService = false;
      checkSafeMode();
      writeUnlock("startActiveServices");
    }
  }

  首先是第8行到第41行的if语句,这个语句中的主要是用来处理未关闭的日志流,即之前分析editlog文件中的inprogress文件,并且会打开日志的写权限。然后是第44行到末尾,这里会启动active需要的一些线程。其中最重要的是第59行启动的NameNodeEditLogRoller。

你可能感兴趣的:(hadoop,java,大数据)