在文档(25)中分析了zkfc的启动过程。在zkfc的启动过程中,其会连接zookeeper与namenode。并对namenode进行健康检查。
namenode的健康检查,实际是通过RPC调用namenode自身的方法来进行健康检查。健康检查的主要的方法是monitorHealth方法,同时在namenode的启动中分析了用于处理namenode的RPC服务的类为:NameNodeRpcServer。该方法在namenode的实现如下:
public synchronized void monitorHealth() throws HealthCheckFailedException,
AccessControlException, IOException {
checkNNStartup();
nn.monitorHealth();
}
这里重点是第4行调用的monitorHealth方法这个方法内容如下:
synchronized void monitorHealth()
throws HealthCheckFailedException, AccessControlException {
namesystem.checkSuperuserPrivilege();
if (!haEnabled) {
return; // no-op, if HA is not enabled
}
getNamesystem().checkAvailableResources();
if (!getNamesystem().nameNodeHasResourcesAvailable()) {
throw new HealthCheckFailedException(
"The NameNode has no resources available");
}
}
健康检查执行成功后,会回到zkfc远程调用的客户端。在文档(25)中分析了在这个方法调用结束后会执行一个enterState方法,来进入对应的状态,这个方法的调用情况如下:
这两个方法的内容如下:
private synchronized void enterState(State newState) {
if (newState != state) {
LOG.info("Entering state " + newState);
state = newState;
synchronized (callbacks) {
for (Callback cb : callbacks) {
cb.enteredState(newState);
}
}
}
}
这个方法很简单,首先是对state进行赋值,然后遍历callbacks中的对象,然后调用者个对象的enteredState方法。这里的callbacks在文档(25)中分析过了在创建healthMonitor的时候会调用addCallback方法添加callback到上述的callbacks中。这里设置的callback是HealthCallbacks类的对象,这个对象的enteredState方法内容如下:
public void enteredState(HealthMonitor.State newState) {
setLastHealthState(newState);
recheckElectability();
}
这里重点是第3行的recheckElectability方法,其内容如下:
private void recheckElectability() {
// Maintain lock ordering of elector -> ZKFC
synchronized (elector) {
synchronized (this) {
boolean healthy = lastHealthState == State.SERVICE_HEALTHY;
long remainingDelay = delayJoiningUntilNanotime - System.nanoTime();
if (remainingDelay > 0) {
if (healthy) {
LOG.info("Would have joined master election, but this node is " +
"prohibited from doing so for " +
TimeUnit.NANOSECONDS.toMillis(remainingDelay) + " more ms");
}
scheduleRecheck(remainingDelay);
return;
}
switch (lastHealthState) {
case SERVICE_HEALTHY:
elector.joinElection(targetToData(localTarget));
if (quitElectionOnBadState) {
quitElectionOnBadState = false;
}
break;
case INITIALIZING:
LOG.info("Ensuring that " + localTarget + " does not " +
"participate in active master election");
elector.quitElection(false);
serviceState = HAServiceState.INITIALIZING;
break;
case SERVICE_UNHEALTHY:
case SERVICE_NOT_RESPONDING:
LOG.info("Quitting master election for " + localTarget +
" and marking that fencing is necessary");
elector.quitElection(true);
serviceState = HAServiceState.INITIALIZING;
break;
case HEALTH_MONITOR_FAILED:
fatalError("Health monitor failed!");
break;
default:
throw new IllegalArgumentException("Unhandled state:" + lastHealthState);
}
}
}
}
这里重点是第18行的switch语句,这里会根据lastHealthState的值来执行不同的方法。而这个lastHealthState的会在健康检查后由enterState方法传入。上文提到了传入的值为SERVICE_HEALTHY。所以这里实际会执行第19行到第24行的方法,这里的重点是第19行执行elector的joinElection方法。这个elector在zkfc初始化的时候提到过,其创建的是ActiveStandbyElector类的对象。其joinElection方法的内容如下:
public synchronized void joinElection(byte[] data)
throws HadoopIllegalArgumentException {
if (data == null) {
throw new HadoopIllegalArgumentException("data cannot be null");
}
if (wantToBeInElection) {
LOG.info("Already in election. Not re-connecting.");
return;
}
appData = new byte[data.length];
System.arraycopy(data, 0, appData, 0, data.length);
LOG.debug("Attempting active election for " + this);
joinElectionInternal();
}
这里主要是一些参数的处理,重点在第17行的joinElectionInternal方法。这个方法的内容如下:
private void joinElectionInternal() {
Preconditions.checkState(appData != null,
"trying to join election without any app data");
if (zkClient == null) {
if (!reEstablishSession()) {
fatalError("Failed to reEstablish connection with ZooKeeper");
return;
}
}
createRetryCount = 0;
wantToBeInElection = true;
createLockNodeAsync();
}
这里首先判断了zkClient是否为空,然后对几个参数赋值,最后调用createLockNodeAsync方法。这个方法的内容如下:
private void createLockNodeAsync() {
zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
this, zkClient);
}
从上述代码中可以看出,这里的选举实际是在zookeeper的指定路径下创建一个节点。若这个节点创建成功则代表该节点是选举出的主节点,若失败则为从节点。而在传入的参数中:zkLockFilePath是临时目录的路径,CreateMode.EPHEMERAL是目录的类型(此类目录会在客户端断开的时候被删除),this传入的是elector本身,即ActiveStandbyElector类。这个参数是作为回调函数被传入的,而传ActiveStandbyElector是因为他实现了StatCallback和StringCallback,这两个接口,所以它也可以作为回调函数被传入。
无论创建节点是成功还是失败,zookeeper都会调用ActiveStandbyElector中的processResult方法,由该方法会判断其是否成功,如果成功则将其对应的namenode变为Active,否则为standby。processResult方法内容如下:
public synchronized void processResult(int rc, String path, Object ctx,
String name) {
if (isStaleClient(ctx)) return;
LOG.debug("CreateNode result: " + rc + " for path: " + path
+ " connectionState: " + zkConnectionState +
" for " + this);
Code code = Code.get(rc);
if (isSuccess(code)) {
// we successfully created the znode. we are the leader. start monitoring
if (becomeActive()) {
monitorActiveStatus();
} else {
reJoinElectionAfterFailureToBecomeActive();
}
return;
}
if (isNodeExists(code)) {
if (createRetryCount == 0) {
// znode exists and we did not retry the operation. so a different
// instance has created it. become standby and monitor lock.
becomeStandby();
}
// if we had retried then the znode could have been created by our first
// attempt to the server (that we lost) and this node exists response is
// for the second attempt. verify this case via ephemeral node owner. this
// will happen on the callback for monitoring the lock.
monitorActiveStatus();
return;
}
String errorMessage = "Received create error from Zookeeper. code:"
+ code.toString() + " for path " + path;
LOG.debug(errorMessage);
if (shouldRetry(code)) {
if (createRetryCount < maxRetryNum) {
LOG.debug("Retrying createNode createRetryCount: " + createRetryCount);
++createRetryCount;
createLockNodeAsync();
return;
}
errorMessage = errorMessage
+ ". Not retrying further znode create connection errors.";
} else if (isSessionExpired(code)) {
// This isn't fatal - the client Watcher will re-join the election
LOG.warn("Lock acquisition failed because session was lost");
return;
}
fatalError(errorMessage);
}
这里的重点有两个:第一个是第11行的becomeActive方法,第二个是第23行的becomeStandby方法。这里两个方法用于转换namenode的状态。
首先是becomeActive方法,其内容如下:
private boolean becomeActive() {
assert wantToBeInElection;
if (state == State.ACTIVE) {
// already active
return true;
}
try {
Stat oldBreadcrumbStat = fenceOldActive();
writeBreadCrumbNode(oldBreadcrumbStat);
LOG.debug("Becoming active for " + this);
appClient.becomeActive();
state = State.ACTIVE;
return true;
} catch (Exception e) {
LOG.warn("Exception handling the winning of election", e);
// Caller will handle quitting and rejoining the election.
return false;
}
}
这里主要有两个方法,首先是第8行的fenceOldActive方法。这个方法是用来处理切换状态前的active节点。然后是第12行的becomeActive方法。
这里主要分析becomeActive方法,这个方法内容如下:
public void becomeActive() throws ServiceFailedException {
ZKFailoverController.this.becomeActive();
}
这里是继续调用ZKFailoverController的becomeActive方法。该方法内容如下:
private synchronized void becomeActive() throws ServiceFailedException {
LOG.info("Trying to make " + localTarget + " active...");
try {
HAServiceProtocolHelper.transitionToActive(localTarget.getProxy(
conf, FailoverController.getRpcTimeoutToNewActive(conf)),
createReqInfo());
String msg = "Successfully transitioned " + localTarget +
" to active state";
LOG.info(msg);
serviceState = HAServiceState.ACTIVE;
recordActiveAttempt(new ActiveAttemptRecord(true, msg));
} catch (Throwable t) {
String msg = "Couldn't make " + localTarget + " active";
LOG.fatal(msg, t);
recordActiveAttempt(new ActiveAttemptRecord(false, msg + "\n" +
StringUtils.stringifyException(t)));
if (t instanceof ServiceFailedException) {
throw (ServiceFailedException)t;
} else {
throw new ServiceFailedException("Couldn't transition to active",
t);
}
}
}
这里的重点在第4行的方法,这个方法传入了两个参数,其中第一个参数是通过localTarget的getProxy方法获取的。这方法在文档(25)中解析过是用来获取namenode的代理对象的。然后调用的transitionToActive方法内容如下:
public static void transitionToActive(HAServiceProtocol svc,
StateChangeRequestInfo reqInfo)
throws IOException {
try {
svc.transitionToActive(reqInfo);
} catch (RemoteException e) {
throw e.unwrapRemoteException(ServiceFailedException.class);
}
}
这里可以看见第5行直接调用了代理对象的transitionToActive方法,这两个方法会通过RPC直接调用NameNodeRpcServer的方法。该方法内容如下:
public synchronized void transitionToActive(StateChangeRequestInfo req)
throws ServiceFailedException, AccessControlException, IOException {
checkNNStartup();
nn.checkHaStateChange(req);
nn.transitionToActive();
}
这里的重点在第5行的transitionToActive方法,这个方法的内容如下:
synchronized void transitionToActive()
throws ServiceFailedException, AccessControlException {
namesystem.checkSuperuserPrivilege();
if (!haEnabled) {
throw new ServiceFailedException("HA for namenode is not enabled");
}
state.setState(haContext, ACTIVE_STATE);
}
重点是第7行的setState方法,这个方法重新设置state的状态。在之前的文档中解析了在namenode的启动的时候都是以standby状态启动的。所以这里的state是standby状态的。其执行的setState方法内容如下:
public void setState(HAContext context, HAState s) throws ServiceFailedException {
if (s == NameNode.ACTIVE_STATE) {
setStateInternal(context, s);
return;
}
super.setState(context, s);
}
这里传入的s的值为ACTIVE_STATE,所以第2行的if 条件的结果是True。即这段代码会执行第3行的setStateInternal方法。这个方法内容如下:
protected final void setStateInternal(final HAContext context, final HAState s)
throws ServiceFailedException {
prepareToExitState(context);
s.prepareToEnterState(context);
context.writeLock();
try {
exitState(context);
context.setState(s);
s.enterState(context);
s.updateLastHATransitionTime();
} finally {
context.writeUnlock();
}
}
这里的逻辑很简单,首先需要准备退出当前状态(第3行和第4行),没有问题后开始执行退出程序(第7行),然后再设置新的状态(第8行),然后进入新的状态(第9行)。
执行退出程序调用的exitState方法,这里主要是需要退出standby状态,在文档(24)中解析了在进入standby状态下的时候主要是启动两个线程,用于同步active的数据与执行checkpoint。这里退出standby状态主要就是停掉上述的两个线程。这里调用的exitState方法的内容如下:
public void exitState(HAContext context) throws ServiceFailedException {
try {
context.stopStandbyServices();
} catch (IOException e) {
throw new ServiceFailedException("Failed to stop standby services", e);
}
}
这里会继续调用context的stopStandbyServices方法来处理,这个方法的内容如下:
public void stopStandbyServices() throws IOException {
try {
if (namesystem != null) {
namesystem.stopStandbyServices();
}
} catch (Throwable t) {
doImmediateShutdown(t);
}
}
重点在第6行会调用 namesystem的stopStandbyServices方法。这个方法的内容如下:
void stopStandbyServices() throws IOException {
LOG.info("Stopping services started for standby state");
if (standbyCheckpointer != null) {
standbyCheckpointer.stop();
}
if (editLogTailer != null) {
editLogTailer.stop();
}
if (dir != null && getFSImage() != null && getFSImage().editLog != null) {
getFSImage().editLog.close();
}
}
这个方法在第4行和第7行停掉了上文提到的两个进程:standbyCheckpointer和editLogTailer。
然后再看进入新状态的enterState方法,这里的新状态是active,所以调用的是active的enterState方法,其内容如下:
public void enterState(HAContext context) throws ServiceFailedException {
try {
context.startActiveServices();
} catch (IOException e) {
throw new ServiceFailedException("Failed to start active services", e);
}
}
public void startActiveServices() throws IOException {
try {
namesystem.startActiveServices();
} catch (Throwable t) {
doImmediateShutdown(t);
}
}
这里和上文相同逐级调用方法,最后调用的startActiveServices方法内容如下:
void startActiveServices() throws IOException {
startingActiveService = true;
LOG.info("Starting services required for active state");
writeLock();
try {
FSEditLog editLog = getFSImage().getEditLog();
if (!editLog.isOpenForWrite()) {
// During startup, we're already open for write during initialization.
editLog.initJournalsForWrite();
// May need to recover
editLog.recoverUnclosedStreams();
LOG.info("Catching up to latest edits from old active before " +
"taking over writer role in edits logs");
editLogTailer.catchupDuringFailover();
blockManager.setPostponeBlocksFromFuture(false);
blockManager.getDatanodeManager().markAllDatanodesStale();
blockManager.clearQueues();
blockManager.processAllPendingDNMessages();
// Only need to re-process the queue, If not in SafeMode.
if (!isInSafeMode()) {
LOG.info("Reprocessing replication and invalidation queues");
initializeReplQueues();
}
if (LOG.isDebugEnabled()) {
LOG.debug("NameNode metadata after re-processing " +
"replication and invalidation queues during failover:\n" +
metaSaveAsString());
}
long nextTxId = getFSImage().getLastAppliedTxId() + 1;
LOG.info("Will take over writing edit logs at txnid " +
nextTxId);
editLog.setNextTxId(nextTxId);
getFSImage().editLog.openForWrite();
}
// Enable quota checks.
dir.enableQuotaChecks();
if (haEnabled) {
// Renew all of the leases before becoming active.
// This is because, while we were in standby mode,
// the leases weren't getting renewed on this NN.
// Give them all a fresh start here.
leaseManager.renewAllLeases();
}
leaseManager.startMonitor();
startSecretManagerIfNecessary();
//ResourceMonitor required only at ActiveNN. See HDFS-2914
this.nnrmthread = new Daemon(new NameNodeResourceMonitor());
nnrmthread.start();
nnEditLogRoller = new Daemon(new NameNodeEditLogRoller(
editLogRollerThreshold, editLogRollerInterval));
nnEditLogRoller.start();
if (lazyPersistFileScrubIntervalSec > 0) {
lazyPersistFileScrubber = new Daemon(new LazyPersistFileScrubber(
lazyPersistFileScrubIntervalSec));
lazyPersistFileScrubber.start();
}
cacheManager.startMonitorThread();
blockManager.getDatanodeManager().setShouldSendCachingCommands(true);
} finally {
startingActiveService = false;
checkSafeMode();
writeUnlock("startActiveServices");
}
}
首先是第8行到第41行的if语句,这个语句中的主要是用来处理未关闭的日志流,即之前分析editlog文件中的inprogress文件,并且会打开日志的写权限。然后是第44行到末尾,这里会启动active需要的一些线程。其中最重要的是第59行启动的NameNodeEditLogRoller。