Hadoop YARN资源调度中的relaxLocality属性

YARN中的资源请求主要是指ApplicationMaster通过AMRMClientImpl向ResourceManager发起的容器请求。relaxLocality属性在AMRMClientImpl发送请求的时候没有做特别的判断,但是ResourceManager端只有在调度器是FairScheduler才对relaxLocality做处理,FifoScheduler和CapacityScheduler并不会对relaxLocality做处理,所以下文的调度是采用FairScheduler调度器。

机架信息说明

本文所有的节点及其所在机架:

  • node1在rack1上
  • node2在rack2上
  • node3在rack3上

relaxLocality属性的含义

relaxLocality表示松弛本地化(这个有点绕口,可以理解为请求降级),所谓的请求降级即一个ContainerRequest想要ResourceManager将自己申请的资源分配在node1上,最后资源可能分配在node1所在的机架rack1或者rack1以外任意一个节点上,将本来RACK_LOCAL级别的请求降级为RACK_LOCAL或者OFF_SWITCH
ContainerRequest#relaxLocalityResourceRequest#relaxLocality虽然都有relaxLocality这个属性,但是具体的含义还是有很大区别。

ContainerRequest#relaxLocality

ContainerRequest#relaxLocality可以看做是客户端视角的降级处理
一个ApplicationMaster通过AMRMClientImpl#addContainerRequest添加一个containerRequest1,指定nodes={node1}且racks=null.

  • containerRequest1#relaxLocality = false 表示containerRequest1的请求不允许请求降级,最终ResourceManager在调度时只能将该请求所请求的资源分配到node1上;
  • containerRequest1#relaxLocality = true 表示containerRequest1的请求允许请求降级,ResourceManager在调度时,如果经过N 次(具体看FairScheduler的配置)调度还没有将资源分配到node1上,做第一次降级尝试将资源分配到node1所在的机架rack1上,如果如果经过N次(具体看FairScheduler的配置)调度还没有将资源分配到rack1上,做第二次降级尝试将rack1分配到任意node上。

ResourceRequest#relaxLocality

ResourceRequest#relaxLocality可以看做是服务端视角的降级处理。生成发送给服务端的ResourceRequest是在AMRMClientImpl#addContainerRequest中。

  1. 不允许同一Priority下,相同resourceName对应的多个请求具有不同的relaxLocality,个人认为这种设计比较简单吧,便于将resourceName相同的多个请求封装进同一个ResourceRequest中,服务端也没有提供relaxLocality不同的判断。
  2. ResourceRequest#relaxLocality = true表明服务端可能会将一个Contianer分配到ResourceRequest#resourceName上。ResourceRequest#relaxLocality = true的来源有两种,一是ContainerRequest直接指定的nodes和racks,二是ContainerRequest#relaxLocality = true,那么如果ContainerRequest指定了nodes那么这些nodes对应的racks生成的ResourceRequest其relaxLocality为true,还会生成一个resourceName = *的ResourceRequest其relaxLocality也为true;
  3. ResourceRequest#relaxLocality = false,表示服务端不会管这个ResourceRequest;
  4. 可见对于一个ContainerRequest会生成多个ResourceRequest发送给服务端,但是这多个ResourceRequest只会获得一个Contianer,有两点 对这个进行保证,一是对于每一个ContainerRequest只会生成一个resource = "*"的ResourceReques,服务端正是依据这个来判断一个ContainerRequest是否完成;而是对于一个ContainerRequest生成的多个ResourceRequest,服务端只要成功分配其中一个ResourceRequest,会删除其它的多余的ResourceRequest,具体删除逻辑是在AppSchedulingInfo#allocateNodeLocal.
// AMRMClientImpl#addContainerRequest
public synchronized void addContainerRequest(T req) {
    Preconditions.checkArgument(req != null,
        "Resource request can not be null.");
    Set dedupedRacks = new HashSet();
    if (req.getRacks() != null) {
      dedupedRacks.addAll(req.getRacks());
      if(req.getRacks().size() != dedupedRacks.size()) {
        Joiner joiner = Joiner.on(',');
        LOG.warn("ContainerRequest has duplicate racks: "
            + joiner.join(req.getRacks()));
      }
    }
    Set inferredRacks = resolveRacks(req.getNodes());
    inferredRacks.removeAll(dedupedRacks);

    // check that specific and non-specific requests cannot be mixed within a
    // priority
    checkLocalityRelaxationConflict(req.getPriority(), ANY_LIST,
        req.getRelaxLocality());
    // check that specific rack cannot be mixed with specific node within a 
    // priority. If node and its rack are both specified then they must be 
    // in the same request.
    // For explicitly requested racks, we set locality relaxation to true
    checkLocalityRelaxationConflict(req.getPriority(), dedupedRacks, true);
    checkLocalityRelaxationConflict(req.getPriority(), inferredRacks,
        req.getRelaxLocality());
    // check if the node label expression specified is valid
    checkNodeLabelExpression(req);

    if (req.getNodes() != null) {
      HashSet dedupedNodes = new HashSet(req.getNodes());
      if(dedupedNodes.size() != req.getNodes().size()) {
        Joiner joiner = Joiner.on(',');
        LOG.warn("ContainerRequest has duplicate nodes: "
            + joiner.join(req.getNodes()));        
      }
      for (String node : dedupedNodes) {
        addResourceRequest(req.getPriority(), node, req.getCapability(), req,
            true, req.getNodeLabelExpression());
      }
    }

    for (String rack : dedupedRacks) {
      addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
          true, req.getNodeLabelExpression());
    }

    // Ensure node requests are accompanied by requests for
    // corresponding rack
    for (String rack : inferredRacks) {
      addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
          req.getRelaxLocality(), req.getNodeLabelExpression());
    }

    // Off-switch
    addResourceRequest(req.getPriority(), ResourceRequest.ANY, 
        req.getCapability(), req, req.getRelaxLocality(), req.getNodeLabelExpression());
  }

一个ContainerRequest的例子

ApplicationMaster发出一个ContainerRequest并且指定nodes = {node1,node2}, racks = {rack3},即ApplicationMaster期望自己申请的container是分配在node1、node2或者rack1上。对于该ContainerRequest会生成6个ResourceRequest发送给ResourceManager

  1. ResourceRequest#resourceName = node1,ResourceRequest#relaxLocality = true
  2. ResourceRequest#resourceName = node2,ResourceRequest#relaxLocality = true
  3. ResourceRequest#resourceName = rack3,ResourceRequest#relaxLocality = true
  4. ResourceRequest#resourceName = rack1,ResourceRequest#relaxLocality = ContainerRequest#relaxLocality
  5. ResourceRequest#resourceName = rack2,ResourceRequest#relaxLocality = ContainerRequest#relaxLocality
  6. ResourceRequest#resourceName = *,ResourceRequest#relaxLocality = ContainerRequest#relaxLocality

ResourceManager(服务)端FairScheduler对relaxLocality的处理

调度预处理

FairScheduler收到某个NodeManger的NODE_UPDATE事件之后,FairScheduler会在该Node上分配资源,这个分配会有两层调度

  1. FairScheduler通过调度算法计算出各个队列的优先级(暂不表);
  2. 按照优先级给每个队列中的application分配资源,具体application的分配也是按照优先级。

本文主要关注第二层调度,将一个node的资源分配给一个application时,第一步是判断当前的应用是否
有资源需要分配在指定的node上,具体逻辑在FsAppAttempt#hasContainerForNode

  1. 获取application当前优先级的所有请求anyRequest;
  2. 获取application当前优先级在当前节点所在机架的请求rackRequest;
  3. 获取application当前优先级在当前节点的请求nodeRequest;
  4. anyRequest有资源需要分配或者rackRequest有资源需要分配或者nodeRequest有资源需要分配。

根据上一节分析ResourceRequest#relaxLocality = true表示当前的ResourceRequest#resourceName有资源需要分配,这样就好理解FsAppAttempt#hasContainerForNode了,如果anyRequest#relaxLocality = true,表示anyRequest有资源进行分配,不需要管rack或者node了,以此类推。

// FsAppAttempt#hasContainerForNode
public boolean hasContainerForNode(Priority prio, FSSchedulerNode node) {
    ResourceRequest anyRequest = getResourceRequest(prio, ResourceRequest.ANY);
    ResourceRequest rackRequest = getResourceRequest(prio, node.getRackName());
    ResourceRequest nodeRequest = getResourceRequest(prio, node.getNodeName());

    return
        // There must be outstanding requests at the given priority:
        anyRequest != null && anyRequest.getNumContainers() > 0 &&
            // If locality relaxation is turned off at *-level, there must be a
            // non-zero request for the node's rack:
            (anyRequest.getRelaxLocality() ||
                (rackRequest != null && rackRequest.getNumContainers() > 0)) &&
            // If locality relaxation is turned off at rack-level, there must be a
            // non-zero request at the node:
            (rackRequest == null || rackRequest.getRelaxLocality() ||
                (nodeRequest != null && nodeRequest.getNumContainers() > 0)) &&
            // The requested container must be able to fit on the node:
            Resources.lessThanOrEqual(RESOURCE_CALCULATOR, null,
                anyRequest.getCapability(), node.getRMNode().getTotalCapability());
  }

降级的逻辑

一直说可以降级,那么降级的逻辑是在哪里?SchedulerApplicationAttempt#schedulingOpportunities保存了每个应用每个Priority被分配资源(调度)的次数,一次资源的分配可能成功也可能由于node的资源不够而分配失败,当调度次数超过一定阀值资源还没有调度成功便会发生降级。

前面已经提到FSAppAttempt#assignContainer是以Priority为单位进行资源分配,每次真正进行分配资源之前,会调用FSAppAttempt#getAllowedLocalityLevel计算本次资源分配的级别(即是否降级),是NODE_LOCAL还是RACK_LOCAL亦或是OFF_SWITCH级别。

具体逻辑

  • nodeLocalityThreshold通过yarn.scheduler.fair.locality.threshold.node进行配置,默认是-1;
  • rackLocalityThreshold通过yarn.scheduler.fair.locality.threshold.rack进行配置,默认是-1;
  1. 如果nodeLocalityThreshold<0或者rackLocalityThreshold<0直接返回OFF_SWITCH
  2. 如果是该优先级第一调度默认返回NODE_LOCAL
  3. 降级的计算,如果nodeLocalityThresholdrackLocalityThreshold均为0.5,当前集群NodeManger的个数是4,当前优先级调度次数3 > 4*0.5,就会发生降级,先从NODE_LOCAL降到RACK_LOCAL,然后从RACK_LOCAL降到OFF_SWITCH
// FSAppAttempt#getAllowedLocalityLevel
public synchronized NodeType getAllowedLocalityLevel(Priority priority,
      int numNodes, double nodeLocalityThreshold, double rackLocalityThreshold) {
    // upper limit on threshold
    if (nodeLocalityThreshold > 1.0) { nodeLocalityThreshold = 1.0; }
    if (rackLocalityThreshold > 1.0) { rackLocalityThreshold = 1.0; }

    // If delay scheduling is not being used, can schedule anywhere
    if (nodeLocalityThreshold < 0.0 || rackLocalityThreshold < 0.0) {
      return NodeType.OFF_SWITCH;
    }

    // Default level is NODE_LOCAL
    if (!allowedLocalityLevel.containsKey(priority)) {
      allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
      return NodeType.NODE_LOCAL;
    }

    NodeType allowed = allowedLocalityLevel.get(priority);

    // If level is already most liberal, we're done
    if (allowed.equals(NodeType.OFF_SWITCH)) return NodeType.OFF_SWITCH;

    double threshold = allowed.equals(NodeType.NODE_LOCAL) ? nodeLocalityThreshold :
      rackLocalityThreshold;

    // Relax locality constraints once we've surpassed threshold.
    if (getSchedulingOpportunities(priority) > (numNodes * threshold)) {
      if (allowed.equals(NodeType.NODE_LOCAL)) {
        allowedLocalityLevel.put(priority, NodeType.RACK_LOCAL);
        resetSchedulingOpportunities(priority);
      }
      else if (allowed.equals(NodeType.RACK_LOCAL)) {
        allowedLocalityLevel.put(priority, NodeType.OFF_SWITCH);
        resetSchedulingOpportunities(priority);
      }
    }
    return allowedLocalityLevel.get(priority);
  }

升级的逻辑

具体逻辑是在FSAppAttempt#allocate

  1. 当前分配的级别是NODE_LOCALRACK_LOCAL,经过上一节计算之后的逻辑是OFF_SWITCH升级为NODE_LOCALRACK_LOCAL
  2. 当前分配的级别是NODE_LOCAL,讲过上一节计算之后的逻辑是RACK_LOCAL升级为NODE_LOCAL
// FSAppAttempt#allocate
synchronized public RMContainer allocate(NodeType type, FSSchedulerNode node,
      Priority priority, ResourceRequest request,
      Container container) {
    // Update allowed locality level
    NodeType allowed = allowedLocalityLevel.get(priority);
    if (allowed != null) {
      if (allowed.equals(NodeType.OFF_SWITCH) &&
          (type.equals(NodeType.NODE_LOCAL) ||
              type.equals(NodeType.RACK_LOCAL))) {
        this.resetAllowedLocalityLevel(priority, type);
      }
      else if (allowed.equals(NodeType.RACK_LOCAL) &&
          type.equals(NodeType.NODE_LOCAL)) {
        this.resetAllowedLocalityLevel(priority, type);
      }
    }
    // ...
    // ...
    // ...
  }

调度的顺序

具体逻辑是在FSAppAttempt#assignContainer

只要当前的节点有node级别的请求就会采用NODE_LOCAL进行资源调度,看下面代码中的注释1,一个疑问为什么判断是否要用NODE_LOCAL进行调度,需要判断rack请求的属性rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0?
首先明确一点,客户端对于一个ContainerRequest会建立多个ResourceRequest发送给服务端,另外可以查看AppSchedulingInfo#allocateNodeLocal、AppSchedulingInfo#allocateRackLocal和AppSchedulingInfo#allocateOffSwitch这三个方法。
当成功分配一个NODE_LOCAL容器之后会将该请求对应Priority中resourceName为rack*的请求的容器数减一,而当成功分配RACK_LOCALOFF_SWITCH并不会减少NODE_LOCAL中的容器数,这样就会造成该容器已经通过RACK_LOCALOFF_SWITCH级成功分配了,但是localRequest != null && localRequest.getNumContainers() != 0依然为真。

符合相关条件用RACK_LOCAL级别进行调度,建代码中的注释2

// FSAppAttempt#assignContainer
private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
    // ...
    // ...
    // ...

        // 注释1
        if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
            && localRequest != null && localRequest.getNumContainers() != 0) {
          return assignContainer(node, localRequest,
              NodeType.NODE_LOCAL, reserved);
        }

        if (rackLocalRequest != null && !rackLocalRequest.getRelaxLocality()) {
          continue;
        }
        // 注释2
        if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
            && (allowedLocality.equals(NodeType.RACK_LOCAL) ||
            allowedLocality.equals(NodeType.OFF_SWITCH))) {
          return assignContainer(node, rackLocalRequest,
              NodeType.RACK_LOCAL, reserved);
        }

        ResourceRequest offSwitchRequest =
            getResourceRequest(priority, ResourceRequest.ANY);
        if (offSwitchRequest != null && !offSwitchRequest.getRelaxLocality()) {
          continue;
        }

        if (offSwitchRequest != null && offSwitchRequest.getNumContainers() != 0
            && allowedLocality.equals(NodeType.OFF_SWITCH)) {
          return assignContainer(node, offSwitchRequest,
              NodeType.OFF_SWITCH, reserved);
        }
      }
    }
    return Resources.none();
  }

你可能感兴趣的:(Hadoop YARN资源调度中的relaxLocality属性)