yarn3.2源码分析之NM端startContainer的资源本地化机制

概述

Container启动过程主要经历三个阶段:资源本地化、启动并运行container、资源回收,其中,资源本地化指创建container工作目录,从HDFS下载运行container所需的各种资源(jar包、可执行文件等)等,而资源回收则是资源本地化的逆过程,它负责清理各种资源。在资源本地化的过程中,相关的组件定义如下:

LocalResource:LocalResource代表运行一个container所需的文件、jar包等资源。NodeManager要求在启动一个container之前将它所需的资源进行本地化。LocalResource具有以下属性:

  • url:下载Resource到本地的远程地址;
  • size:LocalResource的字节大小;
  • timestamp:将资源进行本地化时的创建时间戳;
  • LocalSourceType:LocalResource的类型,有三种类型:FILE、ARCHIVE和PATTERN;
  • Pattern:只有当LocalResource的类型为PATTERN时使用,用于从archive中抽取出它的目录;
  • LocalResourceVisibility:LocalResource的可见性:有三种可见性:public、private和application;

ResourceLocalizationService:NodeManager中负责资源本地化的服务。

DeletionService:NodeManager中负责清理本地资源的服务。ResourceLocalizationService使用它进行资源清理。

Localizer:负责资源本地化的实际线程/进程。如果资源的类型是public,使用PublicLocalizer线程中的线程池异步下载资源;如果资源的类型是private或application,使用ContainerLocalizer进程下载资源。

public resource的资源本地化

public resource的资源本地化的PublicLocalizer线程完成。它内部维护了一个固定线程池,线程池中的线程数量大小由yarn.nodemanager.localizer.fetch.thread-count配置参数决定,默认是4,该参数决定了并行下载public resource的最大并行度。

在通过线程池异步下载public resource时,会首先checkPublicForAll。即在远端文件系统中,递归检查该份资源的目录下的所有文件是否对所有用户都提供读权限,如果不是,则抛出IO异常表示该份资源不是public resource,无法下载。

private/application resource的资源本地化

private/application resource的资源本地化,由独立的进程ContainerLocalizer负责下载。

每个ContainerLocalizer进程都由NodeManager中的一个LocalizerRunner线程负责启动。

在ResourceLocalizationService中,每个用户,或者每个application都有一个对应的LocalResourcesTracker。

  • 用户映射的LocalResourcesTracker用于跟踪private resource;
  • appId映射的LocalResourcesTracker用于跟踪application resource;

ContainerLocalizer进程启动后,会与NodeManager进程中的ResourceLocalizationService进行心跳通信。ResourceLocalizationService实现了LocalizationProtocol,而ContainerLocalizer通过了getProxy()方法创建了LocalizationProtocol的远程代理。

在每次心跳中,LocalizerRunner要么给它启动的ContainerLocalizer进程分配一个资源,要么请求ContainerLocalizer进程死亡。而ContainerLocalizer进程在每次心跳中,向LocalizerRunner汇报资源下载的进度。

如果资源下载失败,会从LocalResourcesTracker中删除该资源,并且container会标记为失败。同时,LocalizerRunner也会停止正在运行ContainerLocalizer进程并退出。

如果资源下载成功,LocalizerRunner要么给它启动的ContainerLocalizer进程分配下一个pending resource,直至所有的pending resource下载完成。

LocalResource的存储位置

成功下载的资源最终会在以下目录:

  • PUBLIC: /filecache
  • PRIVATE: /usercache//filecache
  • APPLICATION: /usercache//appcache//

NM端资源本地化的配置

yarn.nodemanager.local-dirs:LocalResource的存储位置的本地目录。它允许写入多个目录,只要以逗号分隔即可。

yarn.nodemanager.local-cache.max-files-per-directory:ResourceLocalizationService中用于处理来自Localizer的请求的rpc线程数,默认为5。

yarn.nodemanager.localizer.fetch.thread-count:PublicLocalizer下载public resource的线程数,默认是4。

yarn.nodemanager.localizer.cache.target-size-mb:用于下载LocalResource的最大磁盘空间大小。一旦缓存的磁盘空间大小超过该限制,DeletionService会尝试删除已经不再被任何运行的container引用的文件。

yarn.nodemanager.localizer.cache.cleanup.interval-ms:缓存大小超过磁盘空间大小的限制,并且时间超过该间隔时,DeletionService会开始删除已经不再被任何运行的container引用的文件。

ResourceLocalizationService

ResourceLocalizationService记录了每个用户,或者每个application对应的LocalResourcesTracker。

  • 用户映射的LocalResourcesTracker用于跟踪private resource;
  • appId映射的LocalResourcesTracker用于跟踪application resource;
  • publicRsrc成员变量用于跟踪public resource;

ResourceLocalizationService是LocalizationEvent的事件处理器。在处理localize_container_resources类型的LocalizationEvent时,根据RequestEvent中Resource的可见性、用户、appId获取相应的LocalResourcesTracker。每个LocalResourcesTracker用于跟踪具有相同可见性的所有resource。

 LocalResourcesTracker publicRsrc;
/**
   * Map of LocalResourceTrackers keyed by username, for private
   * resources.
   */
  @VisibleForTesting
  final ConcurrentMap privateRsrc =
    new ConcurrentHashMap();

  /**
   * Map of LocalResourceTrackers keyed by appid, for application
   * resources.
   */
  private final ConcurrentMap appRsrc =
    new ConcurrentHashMap();

LocalizerTracker

LocalizerTracker管理和跟踪所有的Localizer:

  • 持有PublicLocalizer线程的引用,用于下载public resource;
  • 持有localizerId与LocalizerRunner的map映射关系,LocalizerRunner用于下载private/application;

LocalizerTracker也是LocalizerEvent的事件处理器。在处理request_resource_localization类型的LocalizerEvent时,会根据资源的类型进行资源本地化。资源有3种类型:public、private和application。public resources的资源本地化使用PublicLocalizer完成,它通过线程池异步下载资源。private/application resource的资源本地化将由LocalizerRunner启动独立的进程ContainerLocalizer执行下载资源任务。

  /**
   * Sub-component handling the spawning of {@link ContainerLocalizer}s
   */
  class LocalizerTracker extends AbstractService implements EventHandler  {
//LocalizerTracker跟踪所有的Localizer
//持有PublicLocalizer线程的引用,用于下载public resource
//持有localizerId与LocalizerRunner的map映射关系,LocalizerRunner用于下载private/application resource
    private final PublicLocalizer publicLocalizer;
    private final Map privLocalizers;

    LocalizerTracker(Configuration conf) {
      this(conf, new HashMap());
    }

    LocalizerTracker(Configuration conf,
        Map privLocalizers) {
      super(LocalizerTracker.class.getName());
      this.publicLocalizer = new PublicLocalizer(conf);
      this.privLocalizers = privLocalizers;
    }
    
    @Override
    public synchronized void serviceStart() throws Exception {
//启动PublicLocalizer线程
      publicLocalizer.start();
      super.serviceStart();
    }

    public LocalizerHeartbeatResponse processHeartbeat(LocalizerStatus status) {
      String locId = status.getLocalizerId();
      synchronized (privLocalizers) {
        LocalizerRunner localizer = privLocalizers.get(locId);
        if (null == localizer) {
          // TODO process resources anyway
          LOG.info("Unknown localizer with localizerId " + locId
              + " is sending heartbeat. Ordering it to DIE");
//如果是未知的localizerId发送过来的心跳,返回请求Localizer去死亡的响应报文
          LocalizerHeartbeatResponse response =
            recordFactory.newRecordInstance(LocalizerHeartbeatResponse.class);
          response.setLocalizerAction(LocalizerAction.DIE);
          return response;
        }
//否则由相应的LocalizerRunner处理心跳发送过来的资源
        return localizer.processHeartbeat(status.getResources());
      }
    }
    
    @Override
    public void serviceStop() throws Exception {
      for (LocalizerRunner localizer : privLocalizers.values()) {
        localizer.interrupt();
      }
      publicLocalizer.interrupt();
      super.serviceStop();
    }

    @Override
    public void handle(LocalizerEvent event) {
      String locId = event.getLocalizerId();
      switch (event.getType()) {
      case REQUEST_RESOURCE_LOCALIZATION:
        // 0) find running localizer or start new thread
        LocalizerResourceRequestEvent req =
          (LocalizerResourceRequestEvent)event;
        switch (req.getVisibility()) {
        case PUBLIC:
//public resources的资源本地化使用PublicLocalizer完成,它将请求提交到线程池执行异步下载资源任务
          publicLocalizer.addResource(req);
          break;
        case PRIVATE:
        case APPLICATION:
//private和application的资源本地化将由LocalizerRunner启动独立的进程ContainerLocalizer执行下载资源任务
          synchronized (privLocalizers) {
            LocalizerRunner localizer = privLocalizers.get(locId);
            if (localizer != null && localizer.killContainerLocalizer.get()) {
              // Old localizer thread has been stopped, remove it and creates
              // a new localizer thread.
              LOG.info("New " + event.getType() + " localize request for "
                  + locId + ", remove old private localizer.");
//如果localizerId对应的LocalizerRunner已经存在,但属于killContainer的Localizer,将之删除掉
              cleanupPrivLocalizers(locId);
              localizer = null;
            }
            if (null == localizer) {
//如果localizerId对应的LocalizerRunner不存在,创建一个新的LocalizerRunner线程并启动它
              LOG.info("Created localizer for " + locId);
              localizer = new LocalizerRunner(req.getContext(), locId);
              privLocalizers.put(locId, localizer);
//启动LocalizerRunner线程
              localizer.start();
            }
            // 1) propagate event
//LocalizerRunner线程添加请求到内部的pending list
            localizer.addResource(req);
          }
          break;
        }
        break;
      }
    }

    public void cleanupPrivLocalizers(String locId) {
      synchronized (privLocalizers) {
        LocalizerRunner localizer = privLocalizers.get(locId);
        if (null == localizer) {
          return; // ignore; already gone
        }
        privLocalizers.remove(locId);
        localizer.interrupt();
      }
    }

    public void endContainerLocalization(String locId) {
      LocalizerRunner localizer;
      synchronized (privLocalizers) {
        localizer = privLocalizers.get(locId);
        if (null == localizer) {
          return; // ignore
        }
      }
      localizer.endContainerLocalization();
    }
  }

LocalizerRunner

run方法启动ContainerLocalizer进程

public void run() {
      Path nmPrivateCTokensPath = null;
      Throwable exception = null;
      try {
        // Get nmPrivateDir
        nmPrivateCTokensPath =
          dirsHandler.getLocalPathForWrite(
                NM_PRIVATE_DIR + Path.SEPARATOR
                    + String.format(ContainerLocalizer.TOKEN_FILE_NAME_FMT,
                        localizerId));

        // 0) init queue, etc.
        // 1) write credentials to private dir
        writeCredentials(nmPrivateCTokensPath);
        // 2) exec initApplication and wait
        if (dirsHandler.areDisksHealthy()) {
//调用LinuxContainerExecutor#startLocalizer()方法,它将启动ContainerLocalizer进程下载资源
          exec.startLocalizer(new LocalizerStartContext.Builder()
              .setNmPrivateContainerTokens(nmPrivateCTokensPath)
              .setNmAddr(localizationServerAddress)
              .setUser(context.getUser())
              .setAppId(context.getContainerId()
                  .getApplicationAttemptId().getApplicationId().toString())
              .setLocId(localizerId)
              .setDirsHandler(dirsHandler)
              .build());
        } else {
          throw new IOException("All disks failed. "
              + dirsHandler.getDisksHealthReport(false));
        }
      // TODO handle ExitCodeException separately?
      } catch (FSError fe) {
        exception = fe;
      } catch (Exception e) {
        exception = e;
      } finally {
        if (exception != null) {
          LOG.info("Localizer failed for "+localizerId, exception);
          // On error, report failure to Container and signal ABORT
          // Notify resource of failed localization
          ContainerId cId = context.getContainerId();
          dispatcher.getEventHandler().handle(new ContainerResourceFailedEvent(
              cId, null, exception.getMessage()));
        }
        List paths = new ArrayList();
        for (LocalizerResourceRequestEvent event : scheduled.values()) {
          // This means some resources were in downloading state. Schedule
          // deletion task for localization dir and tmp dir used for downloading
          Path locRsrcPath = event.getResource().getLocalPath();
          if (locRsrcPath != null) {
            Path locRsrcDirPath = locRsrcPath.getParent();
            paths.add(locRsrcDirPath);
            paths.add(new Path(locRsrcDirPath + "_tmp"));
          }
          event.getResource().unlock();
        }
        if (!paths.isEmpty()) {
          FileDeletionTask deletionTask = new FileDeletionTask(delService,
              context.getUser(), null, paths);
          delService.delete(deletionTask);
        }
        FileDeletionTask deletionTask = new FileDeletionTask(delService, null,
            nmPrivateCTokensPath, null);
        delService.delete(deletionTask);
      }
    }

processHeartbeat()方法与ContainerLocalizer进程心跳通信

处理心跳发送过来的已经下载完成的资源,然后给ContainerLocalizer进程分配下一个pending resource。

    LocalizerHeartbeatResponse processHeartbeat(
        List remoteResourceStatuses) {
      LocalizerHeartbeatResponse response =
        recordFactory.newRecordInstance(LocalizerHeartbeatResponse.class);
      String user = context.getUser();
      ApplicationId applicationId =
          context.getContainerId().getApplicationAttemptId().getApplicationId();

      boolean fetchFailed = false;
      // Update resource statuses.
      for (LocalResourceStatus stat : remoteResourceStatuses) {
        LocalResource rsrc = stat.getResource();
        LocalResourceRequest req = null;
        try {
          req = new LocalResourceRequest(rsrc);
        } catch (URISyntaxException e) {
          LOG.error(
              "Got exception in parsing URL of LocalResource:"
                  + rsrc.getResource(), e);
          continue;
        }
        LocalizerResourceRequestEvent assoc = scheduled.get(req);
        if (assoc == null) {
          // internal error
          LOG.error("Unknown resource reported: " + req);
          continue;
        }
        LocalResourcesTracker tracker =
            getLocalResourcesTracker(req.getVisibility(), user, applicationId);
        if (tracker == null) {
          // This is likely due to a race between heartbeat and
          // app cleaning up.
          continue;
        }
        switch (stat.getStatus()) {
          case FETCH_SUCCESS:
            // notify resource
            try {
//如果资源下载成功,通知相应的LocalResourcesTracker下载成功,由它处理localized类型的ResourceEvent
              tracker.handle(new ResourceLocalizedEvent(req,
                  stat.getLocalPath().toPath(), stat.getLocalSize()));
            } catch (URISyntaxException e) { }

            // unlocking the resource and removing it from scheduled resource
            // list
            assoc.getResource().unlock();
            scheduled.remove(req);
            break;
          case FETCH_PENDING:
            break;
          case FETCH_FAILURE:
//如果资源下载失败,通知相应的LocalResourcesTracker下载失败,由它处理localization_failed类型的ResourceEvent
            final String diagnostics = stat.getException().toString();
            LOG.warn(req + " failed: " + diagnostics);
            fetchFailed = true;
            tracker.handle(new ResourceFailedLocalizationEvent(req,
                diagnostics));

            // unlocking the resource and removing it from scheduled resource
            // list
            assoc.getResource().unlock();
            scheduled.remove(req);
            break;
          default:
            LOG.info("Unknown status: " + stat.getStatus());
            fetchFailed = true;
            tracker.handle(new ResourceFailedLocalizationEvent(req,
                stat.getException().getMessage()));
            break;
        }
      }
      if (fetchFailed || killContainerLocalizer.get()) {
        response.setLocalizerAction(LocalizerAction.DIE);
        return response;
      }

      // Give the localizer resources for remote-fetching.
      List rsrcs =
          new ArrayList();

      /*
       * TODO : It doesn't support multiple downloads per ContainerLocalizer
       * at the same time. We need to think whether we should support this.
       */
//LocalizerRunner分配下一个pending resource给ContainerLocalizer
//目前不支持ContainerLocalizer并行下载多个资源
      LocalResource next = findNextResource();
      if (next != null) {
        try {
//根据资源的可见性、用户名、appId找到相应的LocalResourcesTracker
          LocalResourcesTracker tracker = getLocalResourcesTracker(
              next.getVisibility(), user, applicationId);
          if (tracker != null) {
//根据资源的可见性、用户名、appId获取资源相应的本地存储目录
//private resource的本地目录前缀是:/usercache//filecache
//application resource的本地目录前缀是:/usercache//appcache//
//本地目录后缀是:hierarchicalPath/AtomicLongNumber/req.getPath()
//本地目录后缀由LocalResourcesTracker#getPathForLocalization()方法得出
            Path localPath = getPathForLocalization(next, tracker);
            if (localPath != null) {
              rsrcs.add(NodeManagerBuilderUtils.newResourceLocalizationSpec(
                  next, localPath));
            }
          }
        } catch (IOException e) {
          LOG.error("local path for PRIVATE localization could not be " +
            "found. Disks might have failed.", e);
        } catch (IllegalArgumentException e) {
          LOG.error("Incorrect path for PRIVATE localization."
              + next.getResource().getFile(), e);
        } catch (URISyntaxException e) {
          LOG.error(
              "Got exception in parsing URL of LocalResource:"
                  + next.getResource(), e);
        }
      }

      response.setLocalizerAction(LocalizerAction.LIVE);
      response.setResourceSpecs(rsrcs);
      return response;
    }

ContainerLocalizer进程

LinuxContainerExecutor#startLocalizer()方法启动ContainerLocalizer进程下载资源

public void startLocalizer(LocalizerStartContext ctx)
      throws IOException, InterruptedException {
    Path nmPrivateContainerTokensPath = ctx.getNmPrivateContainerTokens();
    InetSocketAddress nmAddr = ctx.getNmAddr();
    String user = ctx.getUser();
    String appId = ctx.getAppId();
    String locId = ctx.getLocId();
    LocalDirsHandlerService dirsHandler = ctx.getDirsHandler();
    List localDirs = dirsHandler.getLocalDirs();
    List logDirs = dirsHandler.getLogDirs();

    verifyUsernamePattern(user);
    String runAsUser = getRunAsUser(user);
    PrivilegedOperation initializeContainerOp = new PrivilegedOperation(
        PrivilegedOperation.OperationType.INITIALIZE_CONTAINER);
    List prefixCommands = new ArrayList<>();

    addSchedPriorityCommand(prefixCommands);
    initializeContainerOp.appendArgs(
        runAsUser,
        user,
        Integer.toString(
            PrivilegedOperation.RunAsUserCommand.INITIALIZE_CONTAINER
                .getValue()),
        appId,
        locId,
        nmPrivateContainerTokensPath.toUri().getPath().toString(),
        StringUtils.join(PrivilegedOperation.LINUX_FILE_PATH_SEPARATOR,
            localDirs),
        StringUtils.join(PrivilegedOperation.LINUX_FILE_PATH_SEPARATOR,
            logDirs));

    File jvm =                                  // use same jvm as parent
        new File(new File(System.getProperty("java.home"), "bin"), "java");
//构建jdk的安装目录和java命令
    initializeContainerOp.appendArgs(jvm.toString());
//构建java命令的classpath等参数
    initializeContainerOp.appendArgs("-classpath");
    initializeContainerOp.appendArgs(System.getProperty("java.class.path"));
    String javaLibPath = System.getProperty("java.library.path");
    if (javaLibPath != null) {
      initializeContainerOp.appendArgs("-Djava.library.path=" + javaLibPath);
    }

    initializeContainerOp.appendArgs(ContainerLocalizer.getJavaOpts(getConf()));

    List localizerArgs = new ArrayList<>();
//构建ContainerLocalizier的类名和main方法参数,存放到localizerArgs
    buildMainArgs(localizerArgs, user, appId, locId, nmAddr, localDirs);

    Path containerLogDir = getContainerLogDir(dirsHandler, appId, locId);
    localizerArgs = replaceWithContainerLogDir(localizerArgs, containerLogDir);

    initializeContainerOp.appendArgs(localizerArgs);

    try {
      Configuration conf = super.getConf();
      PrivilegedOperationExecutor privilegedOperationExecutor =
          getPrivilegedOperationExecutor();
//通过shell命令执行ContainerLocalizier的main方法,启动ContainerLocalizier进程下载资源
      privilegedOperationExecutor.executePrivilegedOperation(prefixCommands,
          initializeContainerOp, null, null, false, true);

    } catch (PrivilegedOperationException e) {
      int exitCode = e.getExitCode();
      LOG.warn("Exit code from container " + locId + " startLocalizer is : "
          + exitCode, e);

      throw new IOException("Application " + appId + " initialization failed" +
          " (exitCode=" + exitCode + ") with output: " + e.getOutput(), e);
    }
  }

ContainerLocalizer#main()方法

public static void main(String[] argv) throws Throwable {
    Thread.setDefaultUncaughtExceptionHandler(new YarnUncaughtExceptionHandler());
    int nRet = 0;
    // usage: $0 user appId locId host port app_log_dir user_dir [user_dir]*
    // let $x = $x/usercache for $local.dir
    // MKDIR $x/$user/appcache/$appid
    // MKDIR $x/$user/appcache/$appid/output
    // MKDIR $x/$user/appcache/$appid/filecache
    // LOAD $x/$user/appcache/$appid/appTokens
    try {
      String user = argv[0];
      String appId = argv[1];
      String locId = argv[2];
      InetSocketAddress nmAddr =
          new InetSocketAddress(argv[3], Integer.parseInt(argv[4]));
      String[] sLocaldirs = Arrays.copyOfRange(argv, 5, argv.length);
      ArrayList localDirs = new ArrayList(sLocaldirs.length);
      for (String sLocaldir : sLocaldirs) {
        localDirs.add(new Path(sLocaldir));
      }

      final String uid =
          UserGroupInformation.getCurrentUser().getShortUserName();
      if (!user.equals(uid)) {
        // TODO: fail localization
        LOG.warn("Localization running as " + uid + " not " + user);
      }

      ContainerLocalizer localizer =
          new ContainerLocalizer(FileContext.getLocalFSFileContext(), user,
              appId, locId, localDirs,
              RecordFactoryProvider.getRecordFactory(null));
//runLocalization()方法的核心操作是:执行ContainerLocalizer#localizeFiles()方法
      localizer.runLocalization(nmAddr);
      return;
    } catch (Throwable e) {
      // Print traces to stdout so that they can be logged by the NM address
      // space in both DefaultCE and LCE cases
      e.printStackTrace(System.out);
      LOG.error("Exception in main:", e);
      nRet = -1;
    } finally {
      System.exit(nRet);
    }
  }

ContainerLocalizer#localizeFiles()方法

protected void localizeFiles(LocalizationProtocol nodemanager,
      CompletionService cs, UserGroupInformation ugi)
      throws IOException, YarnException {
    while (true) {
      try {
        LocalizerStatus status = createStatus();
        LocalizerHeartbeatResponse response = nodemanager.heartbeat(status);
        switch (response.getLocalizerAction()) {
        case LIVE:
          List newRsrcs = response.getResourceSpecs();
          for (ResourceLocalizationSpec newRsrc : newRsrcs) {
            if (!pendingResources.containsKey(newRsrc.getResource())) {
//通过线程池提交下载任务,并将完成的future保存到CompletionService
//和Map> pendingResources集合
              pendingResources.put(newRsrc.getResource(), cs.submit(download(
                new Path(newRsrc.getDestinationDirectory().getFile()),
                newRsrc.getResource(), ugi)));
            }
          }
          break;
        case DIE:
          // killall running localizations
          for (Future pending : pendingResources.values()) {
            pending.cancel(true);
          }
          status = createStatus();
          // ignore response while dying.
          try {
            nodemanager.heartbeat(status);
          } catch (YarnException e) {
            // Cannot do anything about this during death stage, let's just log
            // it.
            e.printStackTrace(System.out);
            LOG.error("Heartbeat failed while dying: ", e);
          }
          return;
        }
//指定时间内阻塞等待弹出ExecutorCompletionService中下一个完成的future
        cs.poll(1000, TimeUnit.MILLISECONDS);
      } catch (InterruptedException e) {
        return;
      } catch (YarnException e) {
        // TODO cleanup
        throw e;
      }
    }
  }

待续。。

你可能感兴趣的:(Yarn)