源码走读-Yarn-ResourceManager03-RM的启动之RM详解

0x00 系列文章目录

  1. 源码走读-Yarn-ResourceManager01-基础概念
  2. 源码走读-Yarn-ResourceManager02-RM的启动-脚本
  3. 源码走读-Yarn-ResourceManager03-RM的启动之RM详解
  4. 源码走读-Yarn-ResourceManager04-RM调度之FairScheduler
  5. 源码走读-Yarn-ResourceManager05-MR任务提交-客户端侧分析
  6. 源码走读-Yarn-ResourceManager06-MR任务提交-服务端分析
  7. 源码走读-Yarn-ResourceManager07-ShutdownHookManager
  8. 源码走读-Yarn-ResourceManager08-总结

0x03 RM的启动之RM详解

3.1 ResourceManager的继承关系

3.1.1 ResourceManager第一印象

我们先来看看这个类:

/**
 * The ResourceManager is the main class that is a set of components.
 * "I am the ResourceManager. All your resources belong to us..."
 * 
 *	ResourceManager是一个拥有一系列组件的主类,他拥有所有资源。
 */
@SuppressWarnings("unchecked")
public class ResourceManager extends CompositeService implements Recoverable

我们可以看到他继承了CompositeService类,实现了Recoverable接口。ResourceManager的关系类图如下:

源码走读-Yarn-ResourceManager03-RM的启动之RM详解_第1张图片

上图中红色线带圆圈加号的代表内部类,蓝色实线代表继承,绿色虚线代表实现接口。

弄懂Service和上层类的关系是很重要的,不然后面代码你头都要看大,所以下面我们会详细说下上层的接口和类。

3.1.2 Recoverable接口与实现

我们先看看Recoverable接口:

//这个类很简单,就只有一个recover方法
public interface Recoverable {
  public void recover(RMState state) throws Exception;
}

下面我们看看ResourceManager内实现的recover方法:

  @Override
  public void recover(RMState state) throws Exception {
    // recover RMdelegationTokenSecretManager
    rmContext.getRMDelegationTokenSecretManager().recover(state);

    // recover AMRMTokenSecretManager
    rmContext.getAMRMTokenSecretManager().recover(state);

    // recover applications
    rmAppManager.recover(state);

    setSchedulerRecoveryStartAndWaitTime(state, conf);
  }

这个方法里分别恢复了RMdelegationTokenSecretManagerAMRMTokenSecretManager以及所有的app,最后记录调度器的恢复过程的开始和结束时间。

3.1.3 Closeable与Service接口

现在我们看看另一路ResourceManager的继承关系。回顾一下,由下往上的关系是ResourceManager->CompositeService->AbstractService->Service->Closeable->AutoCloseable

我们先说下Closeable接口。很简单,就是说实现该接口的类就表明有资源可以被关闭,关闭时调用close()方法,但要注意处理IOException。他的爸即AutoCloseable的不同就是close方法抛出的是Exception异常,这里不再赘述。

public interface Closeable extends AutoCloseable {
	public void close() throws IOException;
}

接下来是Service接口,位于org.apache.hadoop.service包内:

//定义服务的声明周期
@Public
@Evolving
public interface Service extends Closeable {
   // 服务状态枚举类
  public enum STATE {
    /** 已构建但尚未初始化 */
    NOTINITED(0, "NOTINITED"),

    /** 已初始化但还没有开始或结束 */
    INITED(1, "INITED"),

    /** 已开始,尚未结束 */
    STARTED(2, "STARTED"),

    /** 已结束,不允许再过度到其他状态 */
    STOPPED(3, "STOPPED");

    // 一个int值,用来在数组查找和JXM接口。
    // 虽然Enamu的ordinal()方法有这个作用,但是随着时间推移提供更多的稳定性保证
    private final int value;

    private final String statename;

    // 状态枚举类的构造方法,跟前文定义的状态序号和状态名匹配
    private STATE(int value, String name) {
      this.value = value;
      this.statename = name;
    }

    public int getValue() {
      return value;
    }

    @Override
    public String toString() {
      return statename;
    }
  }

 /**
  * 服务初始化的方法
  * 
  * 状态必须从 NOINITED -> INITED。
  * 除非init操作失败而且带有异常抛出时,在这种情况下stop方法必须被调用随后服务状态变为STOPPED
  * 
  * config参数是关于服务的配置
  * 此外,要注意当操作过程中发生任何异常时会抛出RuntimeException
  */
  void init(Configuration config);

 /**
  * 服务开始的方法
  * 
  * 状态必须从 INITED -> STARTED。
  * 除非start操作失败而且带有异常抛出时,在这种情况下stop方法必须被调用随后服务状态变为STOPPED
  * 
  * 要注意当操作过程中发生任何异常时会抛出RuntimeException
  */
  void start();

 /**
  * 服务停止的方法
  * 
  * 如果服务已经处于STOPPED状态时,则该操作必须是个空操作。
  * 该方法的实现中应该尽量关闭该服务的所有部分,不论服务处于什么状态都应该完成。
  * 
  * 当操作过程中发生任何异常时会抛出RuntimeException
  */
  void stop();

 /**
  * 服务停止的方法
  * 设计为可在Java7闭包子句中使用的stop()版本
  *
  * 永远不应该抛出IOException
  * 当操作过程中发生任何异常时会抛出RuntimeException
  */
  void close() throws IOException;

  // 注册服务状态更改事件的侦听器,如果已经注册过就为空操作
  void registerServiceListener(ServiceStateChangeListener listener);

  // 注销一个注册过的监听器,如果已经注销就为空操作
  void unregisterServiceListener(ServiceStateChangeListener listener);

  // 获取服务名
  String getName();

  // 获取服务的配置
  Configuration getConfig();

  STATE getServiceState();

  // 获取服务开始时间,若尚未开始就返回0
  long getStartTime();

  // 判断服务是否处于传入的状态(判断结果仅限于调用时)
  boolean isInState(STATE state);

  // 获取服务中第一个抛出的异常,若没有异常记录就返回空
  Throwable getFailureCause();

  //返回当getFailureCause()发生时的状态,如果没有发生过就返回空
  STATE getFailureState();

  /**
  * 在指定时间内阻塞等待服务结束。
  * 
  * 这个方法仅会在所有服务结束操作被执行完成后(得到成功或失败的结果)或者超出指定时间后才会返回。
  * 这个方法可以在服务INITED或是STARTED状态前调用,这样做是为了为了消除在此事件发生之前服务停止的任何竞争条件。
  *
  * timeout 参数为超时毫秒,0代表永远
  * 当服务在指定时间内停止时就返回true
  */
  boolean waitForServiceToStop(long timeout);

  // 返回状态转移的历史快照(静态list),如果没有记录就返回一个没有元素的非Null list
  public List<LifecycleEvent> getLifecycleHistory();
}

介绍完了Service类,可以看出这个类的主要作用就是确定服务状态枚举以及状态转移相关的规范和定义。

3.1.4 AbstractService

下面看看AbstractService,他是所有服务的基础实现类,十分重要,我们这里重点介绍下常用的对象和方法,请注意我为了代码易读修改了部分顺序:

@Public
@Evolving
public abstract class AbstractService implements Service{
  // 注意服务名用final修饰,一旦指定后不可改变
  private final String name;

  // 封装了服务状态的模型类
  private final ServiceStateModel stateModel;
  
  // 构造方法
  public AbstractService(String name) {
    this.name = name;
    stateModel = new ServiceStateModel(name);
  }
  
  @Override
  public String getName() {
    return name;
  }
  
   // 获取服务状态
  @Override
  public final STATE getServiceState() {
    return stateModel.getState();
  }

  // 服务开始的时间戳,在服务开始前是0 
  private long startTime;

  @Override
  public long getStartTime() {
    return startTime;
  }


  // 服务状态变化的监听器, 
  private final ServiceOperations.ServiceListeners listeners
    = new ServiceOperations.ServiceListeners();
    
  // 注意和上面的监听器区分,这个是全局的、横跨所有服务的监听器
  private static ServiceOperations.ServiceListeners globalListeners
    = new ServiceOperations.ServiceListeners();

  // 生命周期历史组成的list
  private final List<LifecycleEvent> lifecycleHistory
    = new ArrayList<LifecycleEvent>(5);

  // 状态转义时用来做对象锁
  private final Object stateChangeLock = new Object();

  // 重写自Service类,判断服务状态是否是指定状态
  @Override
  public final boolean isInState(Service.STATE expected) {
    return stateModel.isInState(expected);
  }
  
  // 配置,在服务初始化后才会前都是Null 
  private volatile Configuration config;
  
  // 在调用init方法时调用该setConfig方法,而且仅应在因为某些原因导致服务实现需要重写初始化配置时:
  // 比如,用继承自Configuration的子类代替
  protected void setConfig(Configuration conf) {
    this.config = conf;
  }
  
  @Override
  public synchronized Configuration getConfig() {
    return config;
  }
  
  // 新增一个状态变更事件到生命周期历史中
  private void recordLifecycleEvent() {
    LifecycleEvent event = new LifecycleEvent();
    event.time = System.currentTimeMillis();
    event.state = getServiceState();
    lifecycleHistory.add(event);
  }

  // 重写Service中的该方法,返回生命周期历史组成的list
  @Override
  public synchronized List<LifecycleEvent> getLifecycleHistory() {
    return new ArrayList<LifecycleEvent>(lifecycleHistory);
  }
  
  // 进入指定状态,并会调用recordLifecycleEvent记录转移事件。
  // 返回之前的状态
  private STATE enterState(STATE newState) {
    assert stateModel != null : "null state in " + name + " " + this.getClass();
    STATE oldState = stateModel.enterState(newState);
    if (oldState != newState) {
      if (LOG.isDebugEnabled()) {
        LOG.debug(
          "Service: " + getName() + " entered state " + getServiceState());
      }
      recordLifecycleEvent();
    }
    return oldState;
  }
  
  // 将状态变更通知本地和全局监听器。通知监听器时发生的异常不允许向上传递。
  private void notifyListeners() {
    try {
      listeners.notifyListeners(this);
      globalListeners.notifyListeners(this);
    } catch (Throwable e) {
      LOG.warn("Exception while notifying listeners of " + this + ": " + e,
               e);
    }
  }
  
  // 重写自Service类,会调用 serviceInit方法
  // 代码中可以看到,该方法如果多次反复调用,除了第一次以外的调用都不会起作用
  // 也就是说该方法不可重入
  // configuration参数为空、状态转移非法或者是其他出错时,会抛出ServiceStateException
  @Override
  public void init(Configuration conf) {
    if (conf == null) {
      throw new ServiceStateException("Cannot initialize service "
                                      + getName() + ": null configuration");
    }
    if (isInState(STATE.INITED)) {
      return;
    }
    synchronized (stateChangeLock) {
      if (enterState(STATE.INITED) != STATE.INITED) {
        setConfig(conf);
        try {
          serviceInit(config);
          if (isInState(STATE.INITED)) {
            // 如果服务在INIT状态结束,就通过监听器通知
            notifyListeners();
          }
        } catch (Exception e) {
          noteFailure(e);
          ServiceOperations.stopQuietly(LOG, this);
          throw ServiceStateException.convert(e);
        }
      }
    }
  }

  // 重写自Service类,会调用serviceStart方法
  // 当前服务状态不允许start时,会抛出ServiceStateException
  @Override
  public void start() {
    if (isInState(STATE.STARTED)) {
      return;
    }
    synchronized (stateChangeLock) {
      if (stateModel.enterState(STATE.STARTED) != STATE.STARTED) {
        try {
          startTime = System.currentTimeMillis();
          serviceStart();
          if (isInState(STATE.STARTED)) {
            // 如果服务成功启动了,通知监听器
            notifyListeners();
          }
        } catch (Exception e) {
          noteFailure(e);
          ServiceOperations.stopQuietly(LOG, this);
          throw ServiceStateException.convert(e);
        }
      }
    }
  }
  
  // 服务失败异常
  private Exception failureCause;

  // 服务失败时处于的状态,仅当服务因为一个错误导致失败时合法
  private STATE failureState = null;
  
  @Override
  public final synchronized Throwable getFailureCause() {
    return failureCause;
  }

  @Override
  public synchronized STATE getFailureState() {
    return failureState;
  }
  
  // 记录触发该方法的异常
  protected final void noteFailure(Exception exception) {
    if (LOG.isDebugEnabled()) {
      LOG.debug("noteFailure " + exception, null);
    }
    if (exception == null) {
      //make sure failure logic doesn't itself cause problems
      return;
    }
    //record the failure details, and log it
    synchronized (this) {
      if (failureCause == null) {
        failureCause = exception;
        failureState = getServiceState();
        LOG.info("Service " + getName()
                 + " failed in state " + failureState
                 + "; cause: " + exception,
                 exception);
      }
    }
  }
  
  // 用来在多线程中协助 waitForServiceToStop方法,为true代表该服务已经终止
  private final AtomicBoolean terminationNotification =
    new AtomicBoolean(false);

  // 重写自Service类,会调用serviceStart方法
  // 当前服务状态不允许start时,会抛出ServiceStateException
  @Override
  public void stop() {
    if (isInState(STATE.STOPPED)) {
      return;
    }
    synchronized (stateChangeLock) {
      if (enterState(STATE.STOPPED) != STATE.STOPPED) {
        try {
          serviceStop();
        } catch (Exception e) {
          //stop-time exceptions are logged if they are the first one,
          noteFailure(e);
          throw ServiceStateException.convert(e);
        } finally {
          // 最终记录改服务已经终止
          terminationNotification.set(true);
          // 唤醒阻塞在terminationNotification上面的所有线程
          synchronized (terminationNotification) {
            terminationNotification.notifyAll();
          }
          //notify anything listening for events
          notifyListeners();
        }
      } else {
        // 处理之前就已经是STOPPED,那就只做debug记录
        if (LOG.isDebugEnabled()) {
          LOG.debug("Ignoring re-entrant call to stop()");
        }
      }
    }
  }

  @Override
  public final void close() throws IOException {
    stop();
  }

  // 等待服务在指定时间内终止
  @Override
  public final boolean waitForServiceToStop(long timeout) {
    boolean completed = terminationNotification.get();
    // 当服务未完成的时候,就等待指定timeout然后判断是否完成
    while (!completed) {
      try {
        //获取terminationNotification对象锁
        synchronized(terminationNotification) {
          terminationNotification.wait(timeout);
        }
        // here there has been a timeout, the object has terminated,
        // or there has been a spurious wakeup (which we ignore)
        completed = true;
      } catch (InterruptedException e) {
        // interrupted; have another look at the flag
        completed = terminationNotification.get();
      }
    }
    // 最后返回当前是否服务终止
    return terminationNotification.get();
  }

  /* ===================================================================== */
  /* Override Points */
  /* ===================================================================== */

 /**
  * 服务所需的所有初始化代码
  * 
  * 这个方法仅会在特定的service类实例生命周期内被调用一次
  * 
  * 方法实现中不需要使用synchronized,因为init方法内已阻止了方法重入
  * 
  * 基本的实现是检查传入的conf参数和现有config对象是否一致,不一致就更新现有config
  */
  protected void serviceInit(Configuration conf) throws Exception {
    if (conf != config) {
      LOG.debug("Config has been overridden during init");
      setConfig(conf);
    }
  }

 /**
  * 在 INITED->STARTED 转移时调用
  * 
  * 这个方法仅会在特定的service类实例生命周期内被调用一次
  * 
  * 方法实现中不需要使用synchronized,因为start方法内已阻止了方法重入
  * 
  * 按需抛出异常,被调用者捕获后然后触发服务停止
  */
  protected void serviceStart() throws Exception {}

 /**
  * 在 状态向 STARTED 转移时调用
  * 
  * 这个方法仅会在特定的service类实例生命周期内被调用一次
  * 
  * 方法实现中不需要使用synchronized,因为stop方法内已阻止了方法重入
  * 
  * 实现该方法的代码在失败处理方面必须是十分健壮的,包括检查空引用等
  * 
  * 按需抛出异常,被调用者捕获后会记录日志
  */
  protected void serviceStop() throws Exception {

  }

  @Override
  public void registerServiceListener(ServiceStateChangeListener l) {
    listeners.add(l);
  }

  @Override
  public void unregisterServiceListener(ServiceStateChangeListener l) {
    listeners.remove(l);
  }

  // 注册全局的监听JVM所有服务状态变更的监听器。注意和前面的实例监听器区分
  public static void registerGlobalListener(ServiceStateChangeListener l) {
    globalListeners.add(l);
  }

  //注销一个全局监听器,当找到时就返回true,然后注销
  public static boolean unregisterGlobalListener(ServiceStateChangeListener l) {
    return globalListeners.remove(l);
  }

  // Package-scoped 方法,测试用:重置全局监听器列表
  @VisibleForTesting
  static void resetGlobalListeners() {
    globalListeners.reset();
  }

  // 重写了toString方法
  @Override
  public String toString() {
    return "Service " + name + " in state " + stateModel;
  }

} 

以上就介绍完了AbstractService,可以看出这个类的主要实现了很多Service接口中的服务状态和状态转移相关的方法,十分重要。我们已经对该套体系有了初步了解。比如调用某个继承了AbstractService的类的init方法的套路就是先调用AbstractService中的init方法控制状态转移,然后调用这个类的serviceInit做具体初始化逻辑。脑袋里记住这个流程,下面很多的service类都是这样做的。

3.1.5 CompositeService

下面看看ResourceManager的直接父类 CompositeService

/**
 * Composition of services.
 */
@Public
@Evolving
public class CompositeService extends AbstractService

注释很简单,我负责组合所有服务。继承关系也很简单,我只有一个爹-AbstractService。下面说下他的主要对象和方法:

首先是静态内部类、Runnable CompositeServiceShutdownHook,他的自我介绍很扯:JVM关闭的时候,我可以优雅地关闭CompositeService。

后面会提到,他会被RM注册到ShutdownHookManager内。

/**
   * JVM Shutdown hook for CompositeService which will stop the give
   * CompositeService gracefully in case of JVM shutdown.
   */
  public static class CompositeServiceShutdownHook implements Runnable {

    private CompositeService compositeService;

    public CompositeServiceShutdownHook(CompositeService compositeService) {
      this.compositeService = compositeService;
    }

    @Override
    public void run() {
      ServiceOperations.stopQuietly(compositeService);
    }
  }

下面介绍几个对象和方法:

//下面这个常量表示总是关闭所有,包括STARTED、INITED状态的服务。
protected static final boolean STOP_ONLY_STARTED_SERVICES = false;

//存各子服务的list,前面提到过这个类是组合了多个服务
private final List<Service> serviceList = new ArrayList<Service>();

//构造方法,传入服务名
public CompositeService(String name) {
    super(name);
  }

//获取子服务列表,注意这里使用了serviceList对象作为对象锁,
//也就是说获取的时候如果刚好有service添加将被阻塞不能被获取到
public List<Service> getServices() {
    synchronized (serviceList) {
      return new ArrayList<Service>(serviceList);
    }
  }

//添加服务到子服务列表
protected void addService(Service service) {
    if (LOG.isDebugEnabled()) {
      LOG.debug("Adding service " + service.getName());
    }
    synchronized (serviceList) {
      serviceList.add(service);
    }
  }

//添加服务,在此之前判断是否是Service实现类
protected boolean addIfService(Object object) {
    if (object instanceof Service) {
      addService((Service) object);
      return true;
    } else {
      return false;
    }
  }

//删除service
protected synchronized boolean removeService(Service service) {
    synchronized (serviceList) {
      return serviceList.remove(service);
    }
  }

接下来是几个特别重要的、并且重写自AbstractService的状态转移操作的方法:

//服务启动方法,很简单,获取本对象中存放的服务的list,挨个调用其init方法启动
protected void serviceInit(Configuration conf) throws Exception {
    List<Service> services = getServices();
    if (LOG.isDebugEnabled()) {
      LOG.debug(getName() + ": initing services, size=" + services.size());
    }
    for (Service service : services) {
      service.init(conf);
    }
    super.serviceInit(conf);
  }

  protected void serviceStart() throws Exception {
    List<Service> services = getServices();
    if (LOG.isDebugEnabled()) {
      LOG.debug(getName() + ": starting services, size=" + services.size());
    }
    for (Service service : services) {
      // start the service. If this fails that service
      // will be stopped and an exception raised
      service.start();
    }
    super.serviceStart();
  }

  protected void serviceStop() throws Exception {
    //stop all services that were started
    int numOfServicesToStop = serviceList.size();
    if (LOG.isDebugEnabled()) {
      LOG.debug(getName() + ": stopping services, size=" + numOfServicesToStop);
    }
    stop(numOfServicesToStop, STOP_ONLY_STARTED_SERVICES);
    super.serviceStop();
  }

CompositeService到这里我们就讲完了,其实他很简单,就是把多个service组合到一起,重写了如serviceInit等状态转移方法,这里是去操作多个service的相关状态转移方法。

到这里,我们把ResourceManager的长辈们挨个介绍完了。

下面一节我们会进入这一章的主角-ResourceManager

3.2 ResourceManager详解

3.2.1 main方法

因为该类代码量太大,直接看可能有点晕,我们不妨先看看main方法知道他是在干嘛:

public static void main(String argv[]) {
	//设定主线程出现未定义捕获处理的异常时的handler
    Thread.setDefaultUncaughtExceptionHandler(new YarnUncaughtExceptionHandler());
    //打印启动日志
    StringUtils.startupShutdownMessage(ResourceManager.class, argv, LOG);
    try {
      //初始化一个Yarn配置类实例
      Configuration conf = new YarnConfiguration();
      // If -format-state-store, then delete RMStateStore; else startup normally
      if (argv.length >= 1) {
        if (argv[0].equals("-format-state-store")) {
          deleteRMStateStore(conf);
        } else if (argv[0].equals("-remove-application-from-state-store")
            && argv.length == 2) {
          removeApplication(conf, argv[1]);
        } else {
          printUsage(System.err);
        }
      } else {
      	// 我们的启动脚本参数会走这个分支
        ResourceManager resourceManager = new ResourceManager();
        //把RM的CompositeService的shutDownHook添加到一个统一的ShutdownHookManager,后面有专门章节讲 ShutdownHookManager 的机制
        ShutdownHookManager.get().addShutdownHook(
          new CompositeServiceShutdownHook(resourceManager),
          SHUTDOWN_HOOK_PRIORITY);
        // 这里就是调用AbstractService.init,然后调用ResourceManager.serviceInit
        resourceManager.init(conf);
        // 和上面类似,调用ResourceManager.serviceStart
        resourceManager.start();
      }
    } catch (Throwable t) {
      LOG.fatal("Error starting ResourceManager", t);
      System.exit(-1);
    }
  }

3.2.2 serviceInit

ResourceManager.serviceInit方法代码如下:

@Override
  protected void serviceInit(Configuration conf) throws Exception {
    this.conf = conf;
    // RM上下文,存有RM的许多重要成员
    this.rmContext = new RMContextImpl();
    
    //配置管理初始化
    this.configurationProvider =
        ConfigurationProviderFactory.getConfigurationProvider(conf);
    this.configurationProvider.init(this.conf);
    rmContext.setConfigurationProvider(configurationProvider);

    // 加载core-site.xml
    InputStream coreSiteXMLInputStream =
        this.configurationProvider.getConfigurationInputStream(this.conf,
            YarnConfiguration.CORE_SITE_CONFIGURATION_FILE);
    if (coreSiteXMLInputStream != null) {
      this.conf.addResource(coreSiteXMLInputStream,
          YarnConfiguration.CORE_SITE_CONFIGURATION_FILE);
    }

    // 从已加载的 core-site.xml文件中获取 用户<->组 的映射表
    Groups.getUserToGroupsMappingServiceWithLoadedConfiguration(this.conf)
        .refresh();

    // 从已加载的 core-site.xml文件中获取 超级用户<->组 的映射表
    // Or use RM specific configurations to overwrite the common ones first
    // if they exist
    RMServerUtils.processRMProxyUsersConf(conf);
    ProxyUsers.refreshSuperUserGroupsConfiguration(this.conf);

    // 加载 yarn-site.xml
    InputStream yarnSiteXMLInputStream =
        this.configurationProvider.getConfigurationInputStream(this.conf,
            YarnConfiguration.YARN_SITE_CONFIGURATION_FILE);
    if (yarnSiteXMLInputStream != null) {
      this.conf.addResource(yarnSiteXMLInputStream,
          YarnConfiguration.YARN_SITE_CONFIGURATION_FILE);
    }

	 // 验证config
    validateConfigs(this.conf);
    
    // 填充是否配置了RM 高可用
    this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf));
    //如果确认配置了RM高可用,就需要验证现有配置的参数是否支持高可用,验证不通过就抛出异常
    if (this.rmContext.isHAEnabled()) {
      HAUtil.verifyAndSetConfiguration(this.conf);
    }
    
    // Set UGI and do login
    // If security is enabled, use login user
    // If security is not enabled, use current user
    this.rmLoginUGI = UserGroupInformation.getCurrentUser();
    try {
      doSecureLogin();
    } catch(IOException ie) {
      throw new YarnRuntimeException("Failed to login", ie);
    }

    // 注册一个异步Dispatcher,有一个单独的线程来处理所有持续开启的服务的各种EventType。
    // Yarn中采用了事件驱动的编程模型,后面很多不同的事件都用了这个dispatcher来处理。后面会详细说
    rmDispatcher = setupDispatcher();
    // 将rmDispatcher放到CompositeService的serviceList
    addIfService(rmDispatcher);
    // 并放入RM上下文中
    rmContext.setDispatcher(rmDispatcher);

	 // 注册管理员服务
	 // AdminService为管理员提供了一套独立的服务接口,以防止大量的普通用户的请求使得管理员发送的管理命令饿死。
	 // 管理员可以通过这些接口命令管理集群,比如动态更新节点列表,更新ACL列表,更新队列信息等
    adminService = createAdminService();
    addService(adminService);
    rmContext.setRMAdminService(adminService);
    
    // 创建和初始化一批服务
    createAndInitActiveServices();

    webAppAddress = WebAppUtils.getWebAppBindURL(this.conf,
                      YarnConfiguration.RM_BIND_HOST,
                      WebAppUtils.getRMWebAppURLWithoutScheme(this.conf));

	 // 接着调用父类CompositeService的serviceInit方法,将他管理的服务全部初始化
    super.serviceInit(this.conf);
  }

可以看到以上代码主要是在做一些RM用到的服务的初始化操作并放入RMcontext中。下面我们看看上面提到的几个重要方法。

首先看看setupDispatcher这个方法在做啥:

  private Dispatcher setupDispatcher() {
    Dispatcher dispatcher = createDispatcher();
    dispatcher.register(RMFatalEventType.class,
        new ResourceManager.RMFatalEventDispatcher());
    return dispatcher;
  }

好吧,又嵌套了两个方法,我们挨个看:

  protected Dispatcher createDispatcher() {
    return new AsyncDispatcher();
  }
  

也就是说创建的实际上是创建了AsyncDispatcher类,可以点击这里跳转到该章节了解更多。上面我们就分析完了Dispatcher dispatcher = createDispatcher();,一句话看着简单其实需要理解的不少。

下面接着看dispatcher.register(RMFatalEventType.class,new ResourceManager.RMFatalEventDispatcher())

其实这里就用的是我们前面分析的AsyncDispatcher.register方法,将RMFatalEventType.class类型的Event处理器指定为RMFatalEventDispatcher。我们可以简单看一下这个event和eventHandler:

// RMFatalEventType就是一个拥有表示两种RM的Fatal错时枚举类
@InterfaceAudience.Private
public enum RMFatalEventType {
  // Source <- Store
  STATE_STORE_OP_FAILED,

  // Source <- Embedded Elector
  EMBEDDED_ELECTOR_FAILED
}

// 这个处理器实现了接口EventHandler的唯一方法handle,定义了该事件处理逻辑
public static class RMFatalEventDispatcher
      implements EventHandler<RMFatalEvent> {

    @Override
    public void handle(RMFatalEvent event) {
      LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
          event.getType().name() + ". Cause:\n" + event.getCause());
		// 简单粗暴,退出。我们就不再细看了
      ExitUtil.terminate(1, event.getCause());
    }
 }

讲完了Dispatcher,接下来说说createAndInitActiveServices这个方法:

 protected void createAndInitActiveServices() throws Exception {
 	 // 新建RMActiveServices实例
    activeServices = new RMActiveServices(this);
    // 初始化RMActiveServices
    activeServices.init(conf);
  }

RMActiveServices的初始化方法中调用了多个重要服务,请点击这里查阅:

到这里,RM的serviceInit方法就讲完了,主要工作是做很多RM用到的服务的初始化工作。下面开始讲RM.serviceStart

3.2.3 serviceStart

 @Override
  protected void serviceStart() throws Exception {
    if (this.rmContext.isHAEnabled()) {
    	// 允许ha,就变为StandBy状态
      transitionToStandby(true);
    } else {
    	// 否则,就变为Active状态
      transitionToActive();
    }
	 // 开启webapp服务,主要是用户认证服务
    startWepApp();
    if (getConfig().getBoolean(YarnConfiguration.IS_MINI_YARN_CLUSTER,
        false)) {
      int port = webApp.port();
      WebAppUtils.setRMWebAppPort(conf, port);
    }
    // 最后把CompositeService里面的所有服务全部调用start方法
    super.serviceStart();
  }

3.3 RM的重要内部类

3.3.1 RMActiveServices

  /**
   * 这个RMActiveServices是继承自CompositeService,那么他应该也是组合了多个Service
   * RMActiveServices 处理所有RM中的活跃的(Active)服务
   */
  @Private
  public class RMActiveServices extends CompositeService {

    private DelegationTokenRenewer delegationTokenRenewer;
    // 调度器对应的EventHandler
    private EventHandler<SchedulerEvent> schedulerDispatcher;
    private ApplicationMasterLauncher applicationMasterLauncher;
    private ContainerAllocationExpirer containerAllocationExpirer;
    private ResourceManager rm;
    private boolean recoveryEnabled;
    // 掌控所有RMActiveService的上下文
    private RMActiveServiceContext activeServiceContext;

    RMActiveServices(ResourceManager rm) {
      super("RMActiveServices");
      this.rm = rm;
    }

    @Override
    protected void serviceInit(Configuration configuration) throws Exception {
      activeServiceContext = new RMActiveServiceContext();
      rmContext.setActiveServiceContext(activeServiceContext);

	   // 用来判断出错时是否直接退出程序
      conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, true);

	   //RMSecretManagerService 主要提供了一些Token相关服务
      rmSecretManagerService = createRMSecretManagerService();
      // 这个地方其实是调用父类CompositeService的方法,加入serviceList
      addService(rmSecretManagerService);

		// 监控Container是否过期(提交ApplicationMaster时检查)
      containerAllocationExpirer = new ContainerAllocationExpirer(rmDispatcher);
      addService(containerAllocationExpirer);
      rmContext.setContainerAllocationExpirer(containerAllocationExpirer);

		// AM存活监控,继承自AbstractLivelinessMonitor,过期发生时会触发回调函数
      AMLivelinessMonitor amLivelinessMonitor = createAMLivelinessMonitor();
      addService(amLivelinessMonitor);
      rmContext.setAMLivelinessMonitor(amLivelinessMonitor);

		// AM结束监控
      AMLivelinessMonitor amFinishingMonitor = createAMLivelinessMonitor();
      addService(amFinishingMonitor);
      rmContext.setAMFinishingMonitor(amFinishingMonitor);
      
      // RM Node标签管理者
      RMNodeLabelsManager nlm = createNodeLabelManager();
      addService(nlm);
      rmContext.setNodeLabelManager(nlm);

      boolean isRecoveryEnabled = conf.getBoolean(
          YarnConfiguration.RECOVERY_ENABLED,
          YarnConfiguration.DEFAULT_RM_RECOVERY_ENABLED);

		// 管理RM状态的存储,我们用的是ZKRMStateStore
		// 这个配置是 yarn.resourcemanager.sotre.class
      RMStateStore rmStore = null;
      if (isRecoveryEnabled) {
        recoveryEnabled = true;
        rmStore = RMStateStoreFactory.getStore(conf);
        boolean isWorkPreservingRecoveryEnabled =
            conf.getBoolean(
              YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_ENABLED,
              YarnConfiguration.DEFAULT_RM_WORK_PRESERVING_RECOVERY_ENABLED);
        rmContext
          .setWorkPreservingRecoveryEnabled(isWorkPreservingRecoveryEnabled);
      } else {
        recoveryEnabled = false;
        rmStore = new NullRMStateStore();
      }

      try {
        rmStore.init(conf);
        rmStore.setRMDispatcher(rmDispatcher);
        rmStore.setResourceManager(rm);
      } catch (Exception e) {
        // the Exception from stateStore.init() needs to be handled for
        // HA and we need to give up master status if we got fenced
        LOG.error("Failed to init state store", e);
        throw e;
      }
      rmContext.setStateStore(rmStore);

      if (UserGroupInformation.isSecurityEnabled()) {
        delegationTokenRenewer = createDelegationTokenRenewer();
        rmContext.setDelegationTokenRenewer(delegationTokenRenewer);
      }

		// 持久化RMApp, RMAppAttempt, RMContainer的信息
      RMApplicationHistoryWriter rmApplicationHistoryWriter =
          createRMApplicationHistoryWriter();
      addService(rmApplicationHistoryWriter);
      rmContext.setRMApplicationHistoryWriter(rmApplicationHistoryWriter);

		// 生产系统指标数据
      SystemMetricsPublisher systemMetricsPublisher = createSystemMetricsPublisher();
      addService(systemMetricsPublisher);
      rmContext.setSystemMetricsPublisher(systemMetricsPublisher);

      // Node列表管理器,还用rmDispatcher注册了一个NodesListManagerEventType事件处理(节点可用\不可用)
      nodesListManager = new NodesListManager(rmContext);
      rmDispatcher.register(NodesListManagerEventType.class, nodesListManager);
      addService(nodesListManager);
      rmContext.setNodesListManager(nodesListManager);

      // ResourceScheduler 调度器的创建,他的子类之一就是FairScheduler
      scheduler = createScheduler();
      scheduler.setRMContext(rmContext);
      addIfService(scheduler);
      rmContext.setScheduler(scheduler);

		// 用rmDispatcher注册了一个SchedulerEventType事件处理
      schedulerDispatcher = createSchedulerEventDispatcher();
      addIfService(schedulerDispatcher);
      rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher);

      // Register event handler for RmAppEvents(App事件)
      rmDispatcher.register(RMAppEventType.class,
          new ApplicationEventDispatcher(rmContext));

      // Register event handler for RmAppAttemptEvents(App尝试事件)
      rmDispatcher.register(RMAppAttemptEventType.class,
          new ApplicationAttemptEventDispatcher(rmContext));

      // Register event handler for RmNodes(RM节点事件)
      rmDispatcher.register(
          RMNodeEventType.class, new NodeEventDispatcher(rmContext));

		//NM存活监控
      nmLivelinessMonitor = createNMLivelinessMonitor();
      addService(nmLivelinessMonitor);

	  /**
	   * 创建资源管理服务。处理来自NodeManager的请求,主要包括两种请求:注册和心跳.
	   * 其中,注册是NodeManager启动时发生的行为,请求包中包含节点ID,可用的资源上限等信息;
	   * 而心跳是周期性行为,包含各个Container运行状态,运行的Application列表、节点健康状况(可通过一个脚本设置),
	   * 以上请求调用通过hadoop自己实现的一套RPC协议实现,具体看看YarnRPC。
	   */
      resourceTracker = createResourceTrackerService();
      addService(resourceTracker);
      rmContext.setResourceTrackerService(resourceTracker);

		// 监控jvm运行状况,异常就记录日志
      DefaultMetricsSystem.initialize("ResourceManager");
      JvmMetrics jm = JvmMetrics.initSingleton("ResourceManager", null);
      pauseMonitor = new JvmPauseMonitor();
      addService(pauseMonitor);
      jm.setPauseMonitor(pauseMonitor);

      // Initialize the Reservation system
      if (conf.getBoolean(YarnConfiguration.RM_RESERVATION_SYSTEM_ENABLE,
          YarnConfiguration.DEFAULT_RM_RESERVATION_SYSTEM_ENABLE)) {
        reservationSystem = createReservationSystem();
        if (reservationSystem != null) {
          reservationSystem.setRMContext(rmContext);
          addIfService(reservationSystem);
          rmContext.setReservationSystem(reservationSystem);
          LOG.info("Initialized Reservation system");
        }
      }

      // 资源抢占监控
      createPolicyMonitors();

	   /**
	   * 用于对所有提交的ApplicationMaster进行管理。
	   * 该组件响应所有来自AM的请求,实现了ApplicationMasterProtocol协议,这个协议是AM与RM通信的唯一协议。
	   * 主要包括以下任务:
	   * 注册新的AM、来自任意正在结束的AM的终止/取消注册请求、认证来自不同AM的所有请求,
	   * 确保合法的AM发送的请求传递给RM中的应用程序对象、获取来自所有运行AM的Container的分配和释放请求、异步的转发给Yarn调度器。
	   * ApplicaitonMaster Service确保了任意时间点、任意AM只有一个线程可以发送请求给RM,因为在RM上所有来自AM的RPC请求都串行化了。
	   */
      masterService = createApplicationMasterService();
      addService(masterService) ;
      rmContext.setApplicationMasterService(masterService);

		// app访问控制
      applicationACLsManager = new ApplicationACLsManager(conf);

      queueACLsManager = createQueueACLsManager(scheduler, conf);

		// 维护applications list,管理app提交、结束、恢复等
      rmAppManager = createRMAppManager();
      // Register event handler for RMAppManagerEvents
      rmDispatcher.register(RMAppManagerEventType.class, rmAppManager);
		
		// 负责处理面向客户端使用的接口,内部实现了 Client和RM之间通讯的ApplicationClientProtocol协议
      clientRM = createClientRMService();
      addService(clientRM);
      rmContext.setClientRMService(clientRM);

		// 负责启动和停止AM
      applicationMasterLauncher = createAMLauncher();
      rmDispatcher.register(AMLauncherEventType.class,
          applicationMasterLauncher);

      addService(applicationMasterLauncher);
      if (UserGroupInformation.isSecurityEnabled()) {
        addService(delegationTokenRenewer);
        delegationTokenRenewer.setRMContext(rmContext);
      }

		// 用JMX接口展现NodeManager节点状态信息
      new RMNMInfo(rmContext, scheduler);

      super.serviceInit(conf);
    }

    @Override
    protected void serviceStart() throws Exception {
      RMStateStore rmStore = rmContext.getStateStore();
      // The state store needs to start irrespective of recoveryEnabled as apps
      // need events to move to further states.
      rmStore.start();

      if(recoveryEnabled) {
        try {
          rmStore.checkVersion();
          if (rmContext.isWorkPreservingRecoveryEnabled()) {
            rmContext.setEpoch(rmStore.getAndIncrementEpoch());
          }
          RMState state = rmStore.loadState();
          recover(state);
        } catch (Exception e) {
          // the Exception from loadState() needs to be handled for
          // HA and we need to give up master status if we got fenced
          LOG.error("Failed to load/recover state", e);
          throw e;
        }
      }

      super.serviceStart();
    }

    @Override
    protected void serviceStop() throws Exception {

      super.serviceStop();

      DefaultMetricsSystem.shutdown();
      if (rmContext != null) {
        RMStateStore store = rmContext.getStateStore();
        try {
          store.close();
        } catch (Exception e) {
          LOG.error("Error closing store.", e);
        }
      }

      super.serviceStop();
    }

    protected void createPolicyMonitors() {
      if (scheduler instanceof PreemptableResourceScheduler
          && conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
          YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
        LOG.info("Loading policy monitors");
        List<SchedulingEditPolicy> policies = conf.getInstances(
            YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
            SchedulingEditPolicy.class);
        if (policies.size() > 0) {
          rmDispatcher.register(ContainerPreemptEventType.class,
              new RMContainerPreemptEventDispatcher(
                  (PreemptableResourceScheduler) scheduler));
          for (SchedulingEditPolicy policy : policies) {
            LOG.info("LOADING SchedulingEditPolicy:" + policy.getPolicyName());
            // periodically check whether we need to take action to guarantee
            // constraints
            SchedulingMonitor mon = new SchedulingMonitor(rmContext, policy);
            addService(mon);
          }
        } else {
          LOG.warn("Policy monitors configured (" +
              YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS +
              ") but none specified (" +
              YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES + ")");
        }
      }
    }
  }

3.3.2 SchedulerEventDispatcher

此君专门负责SchedulerEventType类型事件处理。

 // 他继承了AbstractService,这个我们已经很熟悉了
 // 实现了EventHandler,意味着他也是一个事件处理类,有实现handle方法
 public static class SchedulerEventDispatcher extends AbstractService
      implements EventHandler<SchedulerEvent> {
	 // 调度器对象
    private final ResourceScheduler scheduler;
    // 他也有一个事件阻塞队列
    private final BlockingQueue<SchedulerEvent> eventQueue =
      new LinkedBlockingQueue<SchedulerEvent>();
    private volatile int lastEventQueueSizeLogged = 0;
    // 处理事件的队列
    private final Thread eventProcessor;
    // 线程应该停止与否的标志
    private volatile boolean stopped = false;
    // 在执行事件过程中如果遇到异常是否应该导致程序退出
    private boolean shouldExitOnError = false;
	 // 构造方法,在前面介绍过,是RM在serviceInit方法中调用
    public SchedulerEventDispatcher(ResourceScheduler scheduler) {
      super(SchedulerEventDispatcher.class.getName());
      this.scheduler = scheduler;
      this.eventProcessor = new Thread(new EventProcessor());
      this.eventProcessor.setName("ResourceManager Event Processor");
    }

    @Override
    protected void serviceInit(Configuration conf) throws Exception {
      this.shouldExitOnError =
          conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,
            Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR);
      super.serviceInit(conf);
    }

    @Override
    protected void serviceStart() throws Exception {
      this.eventProcessor.start();
      super.serviceStart();
    }

	 // 处理事件的线程,跟前面的 createThread 线程类似
    private final class EventProcessor implements Runnable {
      @Override
      public void run() {

        SchedulerEvent event;

        while (!stopped && !Thread.currentThread().isInterrupted()) {
          try {
            event = eventQueue.take();
          } catch (InterruptedException e) {
            LOG.error("Returning, interrupted : " + e);
            return; // TODO: Kill RM.
          }

          try {
            // 注意这里,把事件直接交给了我们的主角-调度器!!
            scheduler.handle(event);
          } catch (Throwable t) {
            // An error occurred, but we are shutting down anyway.
            // If it was an InterruptedException, the very act of 
            // shutdown could have caused it and is probably harmless.
            if (stopped) {
              LOG.warn("Exception during shutdown: ", t);
              break;
            }
            LOG.fatal("Error in handling event type " + event.getType()
                + " to the scheduler", t);
            if (shouldExitOnError
                && !ShutdownHookManager.get().isShutdownInProgress()) {
              LOG.info("Exiting, bbye..");
              System.exit(-1);
            }
          }
        }
      }
    }

    @Override
    protected void serviceStop() throws Exception {
      this.stopped = true;
      this.eventProcessor.interrupt();
      try {
        this.eventProcessor.join();
      } catch (InterruptedException e) {
        throw new YarnRuntimeException(e);
      }
      super.serviceStop();
    }

    @Override
    public void handle(SchedulerEvent event) {
      try {
        int qSize = eventQueue.size();
        if (qSize != 0 && qSize % 1000 == 0
            && lastEventQueueSizeLogged != qSize) {
          lastEventQueueSizeLogged = qSize;
          LOG.info("Size of scheduler event-queue is " + qSize);
        }
        int remCapacity = eventQueue.remainingCapacity();
        if (remCapacity < 1000) {
          LOG.info("Very low remaining capacity on scheduler event queue: "
              + remCapacity);
        }
        // 处理事件就是放入自己的阻塞队列,让处理线程去处理
        this.eventQueue.put(event);
      } catch (InterruptedException e) {
        LOG.info("Interrupted. Trying to exit gracefully.");
      }
    }
  }

3.4 其他重要类

前面服务启动过程中有一些重要的类,我们在这一节里讲解下。

3.4.1 ServiceStateModel

这里介绍AbstractService类中用到的状态管理类ServiceStateModel,允许的状态转移图如下:

源码走读-Yarn-ResourceManager03-RM的启动之RM详解_第2张图片

// 注释一句话,实现服务类的状态模型
@Public
@Evolving
public class ServiceStateModel {
  // 这是一个状态转移矩阵,为true代表左侧的状态能往右边对应的那一列代表的状态转移。
  private static final boolean[][] statemap =
    {
      //                uninited inited started stopped
      /* uninited  */    {false, true,  false,  true},
      /* inited    */    {false, true,  true,   true},
      /* started   */    {false, false, true,   true},
      /* stopped   */    {false, false, false,  true},
    };
  
  // 一个保证对多线程可见性的state量
  private volatile Service.STATE state;

  // 服务名称
  private String name;

  // 构造时初始化状态为 NOINITED
  public ServiceStateModel(String name) {
    this(name, Service.STATE.NOTINITED);
  }

  public ServiceStateModel(String name, Service.STATE state) {
    this.state = state;
    this.name = name;
  }

  public Service.STATE getState() {
    return state;
  }

  public boolean isInState(Service.STATE proposed) {
    return state.equals(proposed);
  }

  // 验证当前状态是否为期望状态,不是就抛异常
  public void ensureCurrentState(Service.STATE expectedState) {
    if (state != expectedState) {
      throw new ServiceStateException(name+ ": for this operation, the " +
                                      "current service state must be "
                                      + expectedState
                                      + " instead of " + state);
    }
  }

  // 线程安全的尝试进入提议的状态,不合法就抛异常,转移成功后返回旧状态
  public synchronized Service.STATE enterState(Service.STATE proposed) {
    checkStateTransition(name, state, proposed);
    Service.STATE oldState = state;
    //atomic write of the new state
    state = proposed;
    return oldState;
  }

  // 检查状态转移是否合法,不合法就抛异常
  public static void checkStateTransition(String name,
                                          Service.STATE state,
                                          Service.STATE proposed) {
    if (!isValidStateTransition(state, proposed)) {
      throw new ServiceStateException(name + " cannot enter state "
                                      + proposed + " from state " + state);
    }
  }

  // 通过前面提到的矩阵来返回转移合法性判断
  public static boolean isValidStateTransition(Service.STATE current,
                                               Service.STATE proposed) {
    boolean[] row = statemap[current.getValue()];
    return row[proposed.getValue()];
  }

  // 重写了toString来展示当前服务名和状态
  @Override
  public String toString() {
    return (name.isEmpty() ? "" : ((name) + ": "))
            + state.toString();
  }
}

这个类是不是很简单?就是通过一个矩阵+一些简易的方法来管理RM的状态。

3.4.2 AsyncDispatcher

 // 用单独的线程分发处理多个事件。当前版本只有一个单独的类,
 // 但是潜在的可以对每个事件类型类有一个channel,可以用一个线程池来处理多个事件
@SuppressWarnings("rawtypes")
@Public
@Evolving
public class AsyncDispatcher extends AbstractService implements Dispatcher{
 // 存放封装了各类事件的Event对象的阻塞队列
 private final BlockingQueue<Event> eventQueue;
 
 // 处理所有事件的线程
 private Thread eventHandlingThread;
 
 // 存放不同事件对应的事件处理器的映射表
 protected final Map<Class<? extends Enum>, EventHandler> eventDispatchers;
 
 // 标记是否dispatcher应该结束
 private volatile boolean stopped = false;
 
 // 配置标志,用于启用/禁用在停止功能上排空调度程序的事件。
 private volatile boolean drainEventsOnStop = false;

 // 表示所有剩余的已停止调度事件已经被处理
 private volatile boolean drained = true;
 
 // 等待排空的对象(锁)
 private Object waitForDrained = new Object();
 
 // 仅当drainEventsOnStop为true时可用,如果为true表示在停止时会阻塞新来的事件加入队列
 private volatile boolean blockNewEvents = false;
 
 // GenericEventHandler负责往eventQueue里放入event
 private final EventHandler handlerInstance = new GenericEventHandler();
  
 public AsyncDispatcher() {
	this(new LinkedBlockingQueue<Event>());
 }

 public AsyncDispatcher(BlockingQueue<Event> eventQueue) {
 	 // 这里就是用的AbstractService的构造方法,创建一个名为Dispatcher的服务,初始状态为UNINITED
    super("Dispatcher");
    this.eventQueue = eventQueue;
    this.eventDispatchers = new HashMap<Class<? extends Enum>, EventHandler>();
 }
 
 @Override
  protected void serviceInit(Configuration conf) throws Exception {
    // 获取配置,判断如果Dispatcher出错时是否要导致程序退出,默认为false
    this.exitOnDispatchException =
        conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,
          Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR);
    super.serviceInit(conf);
  }

  @Override
  protected void serviceStart() throws Exception {
    super.serviceStart();
    // 开启事件处理线程
    eventHandlingThread = new Thread(createThread());
    eventHandlingThread.setName("AsyncDispatcher event handler");
    eventHandlingThread.start();
  }
  
  // 服务停止方法
  @Override
  protected void serviceStop() throws Exception {
    if (drainEventsOnStop) {
      // 注意,这里就设置了 blockNewEvents = true,上面的线程会用
      blockNewEvents = true;
      LOG.info("AsyncDispatcher is draining to stop, igonring any new events.");
      synchronized (waitForDrained) {
        // 等待服务处理排空,而且eventHandling线程还活着
        while (!drained && eventHandlingThread.isAlive()) {
          waitForDrained.wait(1000);
          LOG.info("Waiting for AsyncDispatcher to drain. Thread state is :" +
              eventHandlingThread.getState());
        }
      }
    }
    stopped = true;
    //结束处理剩余事件后就将eventHandlingThread interrupt停止
    if (eventHandlingThread != null) {
      eventHandlingThread.interrupt();
      try {
        eventHandlingThread.join();
      } catch (InterruptedException ie) {
        LOG.warn("Interrupted Exception while stopping", ie);
      }
    }

    // stop all the components
    super.serviceStop();
  }
 
  @Override
  public EventHandler getEventHandler() {
    return handlerInstance;
  }
  
  // GenericEventHandler是生产者
  // 一般的事件处理套路是某个对象调用此君的handle方法,放入eventQueue
  // 然后createThread线程取出事件,调用dispatch方法
  // dispatch方法内会找到事件type对应的handler并调用其handle方法处理该事件
  class GenericEventHandler implements EventHandler<Event> {
    public void handle(Event event) {
      if (blockNewEvents) {
        return;
      }
      drained = false;

      /* all this method does is enqueue all the events onto the queue */
      int qSize = eventQueue.size();
      if (qSize != 0 && qSize % 1000 == 0
          && lastEventQueueSizeLogged != qSize) {
        lastEventQueueSizeLogged = qSize;
        LOG.info("Size of event-queue is " + qSize);
      }
      int remCapacity = eventQueue.remainingCapacity();
      if (remCapacity < 1000) {
        LOG.warn("Very low remaining capacity in the event-queue: "
            + remCapacity);
      }
      try {
        // 这里就把他放入事件阻塞队列了
        eventQueue.put(event);
      } catch (InterruptedException e) {
        if (!stopped) {
          LOG.warn("AsyncDispatcher thread interrupted", e);
        }
        // 如果队列已经排空,就设drained为true,否则wait在waitForDrained上的线程会一直挂起
        drained = eventQueue.isEmpty();
        throw new YarnRuntimeException(e);
      }
    };
  }
  
  // 这就是处理所有事件的线程了
  Runnable createThread() {
    return new Runnable() {
      @Override
      public void run() {
        while (!stopped && !Thread.currentThread().isInterrupted()) {
          drained = eventQueue.isEmpty();
          // blockNewEvents仅在dispatcher排空、停止时设置为true,
          // 所以此检查是为了避免每次在循环的正常运行中获取锁和调用notify的开销。
          if (blockNewEvents) {
            synchronized (waitForDrained) {
              if (drained) {
                waitForDrained.notify();
              }
            }
          }
          Event event;
          try {
            // 拿到事件
            event = eventQueue.take();
          } catch(InterruptedException ie) {
            if (!stopped) {
              LOG.warn("AsyncDispatcher thread interrupted", ie);
            }
            return;
          }
          // 处理事件
          if (event != null) {
            dispatch(event);
          }
        }
      }
    };
  }

  // 处理事件的方法
  protected void dispatch(Event event) {
    //all events go thru this loop
    if (LOG.isDebugEnabled()) {
      LOG.debug("Dispatching the event " + event.getClass().getName() + "."
          + event.toString());
    }
    // 获取该事件类型的枚举class
    // 可以看到这里用了继承Enum的泛型,因为所有的event type都被设计为泛型类
    Class<? extends Enum> type = event.getType().getDeclaringClass();
    try{
      // 根据事件类型获取锁对应的handler
      EventHandler handler = eventDispatchers.get(type);
      if(handler != null) {
        handler.handle(event);
      } else {
        throw new Exception("No handler for registered for " + type);
      }
    } catch (Throwable t) {
      LOG.fatal("Error in dispatcher thread", t);
      // 当满足以下条件时才需要直接退出
      if (exitOnDispatchException
          && (ShutdownHookManager.get().isShutdownInProgress()) == false
          && stopped == false) {
        Thread shutDownThread = new Thread(createShutDownThread());
        shutDownThread.setName("AsyncDispatcher ShutDown handler");
        shutDownThread.start();
      }
    }
  }
  
  // 这个就是上面异常时用到的退出线程
  Runnable createShutDownThread() {
    return new Runnable() {
      @Override
      public void run() {
        LOG.info("Exiting, bbye..");
        System.exit(-1);
      }
    };
  }
  
  // 这个就是向AsyncDispatcher注册事件类型和对应handler的方法
  @Override
  public void register(Class<? extends Enum> eventType,
      EventHandler handler) {
    // 先检查是不是已存在该类型事件的handler
    EventHandler<Event> registeredHandler = (EventHandler<Event>)
    eventDispatchers.get(eventType);
    LOG.info("Registering " + eventType + " for " + handler.getClass());
    if (registeredHandler == null) {
      eventDispatchers.put(eventType, handler);
    } else if (!(registeredHandler instanceof MultiListenerHandler)){
      // 如果不是MultiListenerHandler类型,就将老的取出和参数handler一起加到新建的multiHandler
      MultiListenerHandler multiHandler = new MultiListenerHandler();
      multiHandler.addHandler(registeredHandler);
      multiHandler.addHandler(handler);
      eventDispatchers.put(eventType, multiHandler);
    } else {
      // 已经是MultiListenerHandler,直接添加
      MultiListenerHandler multiHandler
      = (MultiListenerHandler) registeredHandler;
      multiHandler.addHandler(handler);
    }
  }
}

3.5 小结

本章主要讲解了ResourceManager的继承关系以及他本身的代码,让我们对RM的主要功能有一个大致了解。还提到了提供状态转移功能的ServiceStateModel,可以了解状态转移机制。最后说了事件处理器AsyncDispatcher,我们可以学习RM中事件驱动到底是怎么实现的。

下一章,我们开始学习调度器-源码走读-Yarn-ResourceManager04-RM调度之FairScheduler

你可能感兴趣的:(hadoop,yarn,源码)