前言
在之前两周主要学了HDFS中的一些模块知识,其中的许多都或多或少有我们借鉴学习的地方,现在将目光转向另外一个块,被誉为MRv2,就是yarn,在Yarn中,解决了MR中JobTracker单点的问题,将此拆分成了ResourceManager和NodeManager这样的结构,在每个节点上,还会有ApplicationMaster来管理应用程序的整个生命周期,的确在Yarn中,多了许多优秀的设计,而今天,我主要分享的就是这个ApplicationMaster相关的一整套服务,他是隶属于ResoureManager的内部服务中的.了解了AM的启动机制,你将会更进一步了解Yarn的任务启动过程.
ApplicationMaster管理涉及类
ApplicationMaster管理涉及到了4大类,ApplicationMasterLauncher,AMLivelinessMonitor,ApplicationMasterService,以及ApplicationMaster自身类.下面介绍一下这些类的用途,在Yarn中,每个类都会有自己明确的功能模块的区分.
1.ApplicationMasterLauncher--姑且叫做AM启动关闭事件处理器,他既是一个服务也是一个处理器,在这个类中,只处理2类事件,launch和cleanup事件.分别对应启动应用和关闭应用的情形.
2.AMLivelinessMonitor--这个类从名字上可以看出他是监控类,监控的对象是AM存活状态的监控类,检测的方法与之前的HDFS一样,都是采用heartbeat的方式,如果有节点过期了,将会触发一次过期事件.
3.ApplicationMasterService--AM请求服务处理类.AMS存在于ResourceManager,中,服务的对象是各个节点上的ApplicationMaster,负责接收各个AM的注册请求,更新心跳包信息等.
4.ApplicationMaster--节点应用管理类,简单的说,ApplicationMaster负责管理整个应用的生命周期.
简答的描述完AM管理的相关类,下面从源码级别分析一下几个流程.
AM启动
要想让AM启动,启动的背景当然是有用户提交了新的Application的时候,之后ApplicationMasterLauncher会生成Launch事件,与对应的nodemanager通信,让其准备启动的新的AM的Container.在这里,就用到了ApplicationMasterLauncher这个类,之前在上文中已经提到,此类就处理2类事件,Launch启动和Cleanup清洗事件,先来看看这个类的基本变量设置
-
- public class ApplicationMasterLauncher extends AbstractService implements
- EventHandler<AMLauncherEvent> {
- private static final Log LOG = LogFactory.getLog(
- ApplicationMasterLauncher.class);
- private final ThreadPoolExecutor launcherPool;
- private LauncherThread launcherHandlingThread;
-
-
- private final BlockingQueue<Runnable> masterEvents
- = new LinkedBlockingQueue<Runnable>();
-
- protected final RMContext context;
-
- public ApplicationMasterLauncher(RMContext context) {
- super(ApplicationMasterLauncher.class.getName());
- this.context = context;
-
- this.launcherPool = new ThreadPoolExecutor(10, 10, 1,
- TimeUnit.HOURS, new LinkedBlockingQueue<Runnable>());
-
- this.launcherHandlingThread = new LauncherThread();
- }
还算比较简单,有一个masterEvents事件队列,还有执行线程以及所需的线程池执行环境。在RM相关的服务中,基本都是继承自AbstractService这个抽象服务类的。ApplicationMasterLauncher中主要处理2类事件,就是下面的展示的
- @Override
- public synchronized void handle(AMLauncherEvent appEvent) {
- AMLauncherEventType event = appEvent.getType();
- RMAppAttempt application = appEvent.getAppAttempt();
-
- switch (event) {
- case LAUNCH:
- launch(application);
- break;
- case CLEANUP:
- cleanup(application);
- default:
- break;
- }
- }
然后调用具体的实现方法,以启动事件launch事件为例
-
- private void launch(RMAppAttempt application) {
- Runnable launcher = createRunnableLauncher(application,
- AMLauncherEventType.LAUNCH);
-
- masterEvents.add(launcher);
- }
这些事件被加入到事件队列之后,是如何被处理的呢,通过消息队列的形式,在一个独立的线程中逐一被执行
-
- private class LauncherThread extends Thread {
-
- public LauncherThread() {
- super("ApplicationMaster Launcher");
- }
-
- @Override
- public void run() {
- while (!this.isInterrupted()) {
- Runnable toLaunch;
- try {
-
- toLaunch = masterEvents.take();
-
- launcherPool.execute(toLaunch);
- } catch (InterruptedException e) {
- LOG.warn(this.getClass().getName() + " interrupted. Returning.");
- return;
- }
- }
- }
- }
如果论到事件的具体执行方式,就要看具体AMLauch是如何执行的,AMLauch本身就是一个runnable实例。
-
-
-
-
- public class AMLauncher implements Runnable {
-
- private static final Log LOG = LogFactory.getLog(AMLauncher.class);
-
- private ContainerManagementProtocol containerMgrProxy;
-
- private final RMAppAttempt application;
- private final Configuration conf;
- private final AMLauncherEventType eventType;
- private final RMContext rmContext;
- private final Container masterContainer;
在里面主要的run方法如下,就是按照事件类型进行区分操作
- @SuppressWarnings("unchecked")
- public void run() {
-
- switch (eventType) {
- case LAUNCH:
- try {
- LOG.info("Launching master" + application.getAppAttemptId());
-
- launch();
- handler.handle(new RMAppAttemptEvent(application.getAppAttemptId(),
- RMAppAttemptEventType.LAUNCHED));
- ...
- break;
- case CLEANUP:
- try {
- LOG.info("Cleaning master " + application.getAppAttemptId());
-
- cleanup();
- ...
- break;
- default:
- LOG.warn("Received unknown event-type " + eventType + ". Ignoring.");
- break;
- }
- }
后面的launch操作会调用RPC函数与远程的NodeManager通信来启动Container。然后到了ApplicationMaster的run()启动方法,在启动方法中,会进行应用注册的方法,
- @SuppressWarnings({ "unchecked" })
- public boolean run() throws YarnException, IOException {
- LOG.info("Starting ApplicationMaster");
-
- Credentials credentials =
- UserGroupInformation.getCurrentUser().getCredentials();
- DataOutputBuffer dob = new DataOutputBuffer();
- credentials.writeTokenStorageToStream(dob);
-
- Iterator<Token<?>> iter = credentials.getAllTokens().iterator();
- while (iter.hasNext()) {
- Token<?> token = iter.next();
- if (token.getKind().equals(AMRMTokenIdentifier.KIND_NAME)) {
- iter.remove();
- }
- }
- allTokens = ByteBuffer.wrap(dob.getData(), 0, dob.getLength());
-
-
- AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler();
- amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener);
- amRMClient.init(conf);
- amRMClient.start();
- .....
-
-
-
-
- appMasterHostname = NetUtils.getHostname();
- RegisterApplicationMasterResponse response = amRMClient
- .registerApplicationMaster(appMasterHostname, appMasterRpcPort,
- appMasterTrackingUrl);
-
-
- int maxMem = response.getMaximumResourceCapability().getMemory();
- LOG.info("Max mem capabililty of resources in this cluster " + maxMem);
-
-
- if (containerMemory > maxMem) {
- LOG.info("Container memory specified above max threshold of cluster."
- + " Using max value." + ", specified=" + containerMemory + ", max="
- + maxMem);
- containerMemory = maxMem;
- }
在这个操作中,会将自己注册到AMLivelinessMonitor中,此刻开始启动心跳监控。
AMLiveLinessMonitor监控
在这里把重心从ApplicationMaster转移到AMLivelinessMonitor上,首先这是一个激活状态的监控线程,此类线程都有一个共同的父类
-
- public class AMLivelinessMonitor extends AbstractLivelinessMonitor<ApplicationAttemptId> {
在AbstractlinessMonitor中定义监控类线程的一类特征和方法
-
- public abstract class AbstractLivelinessMonitor<O> extends AbstractService {
-
- private static final Log LOG = LogFactory.getLog(AbstractLivelinessMonitor.class);
-
-
-
-
- private Thread checkerThread;
- private volatile boolean stopped;
-
- public static final int DEFAULT_EXPIRE = 5*60*1000;
-
- private int expireInterval = DEFAULT_EXPIRE;
-
- private int monitorInterval = expireInterval/3;
-
- private final Clock clock;
-
-
- private Map<O, Long> running = new HashMap<O, Long>();
心跳检测本身非常的简单,做一次通信记录检查,然后更新一下,记录时间,当一个新的节点加入监控或解除监控操作
-
- public synchronized void register(O ob) {
- running.put(ob, clock.getTime());
- }
-
-
- public synchronized void unregister(O ob) {
- running.remove(ob);
- }
每次做心跳周期检测的时候,调用下述方法
-
- public synchronized void receivedPing(O ob) {
-
- if (running.containsKey(ob)) {
- running.put(ob, clock.getTime());
- }
- }
非常简单的更新方法,O ob对象在这里因场景而异,在AM监控中,为ApplicationID应用ID。在后面的AMS和AM的交互中会看到。新的应用加入AMLivelinessMonitor监控中后,后面的主要操作就是AMS与AM之间的交互操作了。
AM与AMS
在ApplicationMaster运行之后,会周期性的向ApplicationMasterService发送心跳信息,心跳信息包含有许多资源描述信息。
-
- @Override
- public AllocateResponse allocate(AllocateRequest request)
- throws YarnException, IOException {
-
- ApplicationAttemptId appAttemptId = authorizeRequest();
-
- this.amLivelinessMonitor.receivedPing(appAttemptId);
- ....
每次心跳信息一来,就会更新最新监控时间。在AMS也有对应的注册应用的方法
-
- @Override
- public RegisterApplicationMasterResponse registerApplicationMaster(
- RegisterApplicationMasterRequest request) throws YarnException,
- IOException {
-
- ApplicationAttemptId applicationAttemptId = authorizeRequest();
-
- ApplicationId appID = applicationAttemptId.getApplicationId();
- .....
-
-
- this.amLivelinessMonitor.receivedPing(applicationAttemptId);
- RMApp app = this.rmContext.getRMApps().get(appID);
-
-
-
- lastResponse.setResponseId(0);
- responseMap.put(applicationAttemptId, lastResponse);
- LOG.info("AM registration " + applicationAttemptId);
- this.rmContext
如果在心跳监控中出现过期的现象,就会触发一个expire事件,在AMLiveLinessMonitor中,这部分的工作是交给CheckThread执行的
-
- public abstract class AbstractLivelinessMonitor<O> extends AbstractService {
- ...
-
-
-
- private Thread checkerThread;
- ....
-
- public static final int DEFAULT_EXPIRE = 5*60*1000;
-
- private int expireInterval = DEFAULT_EXPIRE;
-
- private int monitorInterval = expireInterval/3;
- ....
-
- private Map<O, Long> running = new HashMap<O, Long>();
- ...
-
- private class PingChecker implements Runnable {
-
- @Override
- public void run() {
- while (!stopped && !Thread.currentThread().isInterrupted()) {
- synchronized (AbstractLivelinessMonitor.this) {
- Iterator<Map.Entry<O, Long>> iterator =
- running.entrySet().iterator();
-
-
- long currentTime = clock.getTime();
-
- while (iterator.hasNext()) {
- Map.Entry<O, Long> entry = iterator.next();
-
- if (currentTime > entry.getValue() + expireInterval) {
- iterator.remove();
-
- expire(entry.getKey());
- LOG.info("Expired:" + entry.getKey().toString() +
- " Timed out after " + expireInterval/1000 + " secs");
- }
- }
- }
check线程主要做的事件就是遍历每个节点的最新心跳更新时间,通过计算差值进行判断是否过期,过期调用expire方法。此方法由其子类实现
-
- public class AMLivelinessMonitor extends AbstractLivelinessMonitor<ApplicationAttemptId> {
-
- private EventHandler dispatcher;
- ...
-
- @Override
- protected void expire(ApplicationAttemptId id) {
-
- dispatcher.handle(
- new RMAppAttemptEvent(id, RMAppAttemptEventType.EXPIRE));
- }
- }
产生应用超期事件,然后发给中央调度器去处理。之所以采用的这样的方式,是因为在RM中,所有的模块设计是以事件驱动的形式工作,最大程度的保证了各个模块间的解耦。不同模块通过不同的事件转变为不同的状态,可以理解为状态机的改变。最后用一张书中的截图简单的展示AM模块相关的调用过程。
全部代码的分析请点击链接https://github.com/linyiqun/hadoop-yarn,后续将会继续更新YARN其他方面的代码分析。
参考文献
《Hadoop技术内部–HDFS结构设计与实现原理》.蔡斌等