Hadoop YARN资源调度中的relaxLocality属性

YARN中的资源请求主要是指ApplicationMaster通过AMRMClientImpl向ResourceManager发起的容器请求。relaxLocality属性在AMRMClientImpl发送请求的时候没有做特别的判断，但是ResourceManager端只有在调度器是FairScheduler才对relaxLocality做处理，FifoScheduler和CapacityScheduler并不会对relaxLocality做处理，所以下文的调度是采用FairScheduler调度器。

机架信息说明

本文所有的节点及其所在机架：

node1在rack1上
node2在rack2上
node3在rack3上

relaxLocality属性的含义

relaxLocality表示松弛本地化（这个有点绕口，可以理解为请求降级），所谓的请求降级即一个ContainerRequest想要ResourceManager将自己申请的资源分配在node1上，最后资源可能分配在node1所在的机架rack1或者rack1以外任意一个节点上，将本来RACK_LOCAL级别的请求降级为RACK_LOCAL或者OFF_SWITCH。
ContainerRequest#relaxLocality和ResourceRequest#relaxLocality虽然都有relaxLocality这个属性，但是具体的含义还是有很大区别。

ContainerRequest#relaxLocality

ContainerRequest#relaxLocality可以看做是客户端视角的降级处理
一个ApplicationMaster通过AMRMClientImpl#addContainerRequest添加一个containerRequest1，指定nodes={node1}且racks=null.

containerRequest1#relaxLocality = false 表示containerRequest1的请求不允许请求降级，最终ResourceManager在调度时只能将该请求所请求的资源分配到node1上；
containerRequest1#relaxLocality = true 表示containerRequest1的请求允许请求降级，ResourceManager在调度时,如果经过N 次（具体看FairScheduler的配置）调度还没有将资源分配到node1上，做第一次降级尝试将资源分配到node1所在的机架rack1上，如果如果经过N次（具体看FairScheduler的配置）调度还没有将资源分配到rack1上，做第二次降级尝试将rack1分配到任意node上。

ResourceRequest#relaxLocality

ResourceRequest#relaxLocality可以看做是服务端视角的降级处理。生成发送给服务端的ResourceRequest是在AMRMClientImpl#addContainerRequest中。

不允许同一Priority下，相同resourceName对应的多个请求具有不同的relaxLocality，个人认为这种设计比较简单吧，便于将resourceName相同的多个请求封装进同一个ResourceRequest中，服务端也没有提供relaxLocality不同的判断。

ResourceRequest#relaxLocality = true表明服务端可能会将一个Contianer分配到ResourceRequest#resourceName上。ResourceRequest#relaxLocality = true的来源有两种，一是ContainerRequest直接指定的nodes和racks，二是ContainerRequest#relaxLocality = true，那么如果ContainerRequest指定了nodes那么这些nodes对应的racks生成的ResourceRequest其relaxLocality为true，还会生成一个resourceName = *的ResourceRequest其relaxLocality也为true；

ResourceRequest#relaxLocality = false，表示服务端不会管这个ResourceRequest；

可见对于一个ContainerRequest会生成多个ResourceRequest发送给服务端，但是这多个ResourceRequest只会获得一个Contianer，有两点对这个进行保证，一是对于每一个ContainerRequest只会生成一个resource = "*"的ResourceReques，服务端正是依据这个来判断一个ContainerRequest是否完成；而是对于一个ContainerRequest生成的多个ResourceRequest，服务端只要成功分配其中一个ResourceRequest，会删除其它的多余的ResourceRequest，具体删除逻辑是在AppSchedulingInfo#allocateNodeLocal.

// AMRMClientImpl#addContainerRequest
public synchronized void addContainerRequest(T req) {
    Preconditions.checkArgument(req != null,
        "Resource request can not be null.");
    Set dedupedRacks = new HashSet();
    if (req.getRacks() != null) {
      dedupedRacks.addAll(req.getRacks());
      if(req.getRacks().size() != dedupedRacks.size()) {
        Joiner joiner = Joiner.on(',');
        LOG.warn("ContainerRequest has duplicate racks: "
            + joiner.join(req.getRacks()));
      }
    }
    Set inferredRacks = resolveRacks(req.getNodes());
    inferredRacks.removeAll(dedupedRacks);

    // check that specific and non-specific requests cannot be mixed within a
    // priority
    checkLocalityRelaxationConflict(req.getPriority(), ANY_LIST,
        req.getRelaxLocality());
    // check that specific rack cannot be mixed with specific node within a 
    // priority. If node and its rack are both specified then they must be 
    // in the same request.
    // For explicitly requested racks, we set locality relaxation to true
    checkLocalityRelaxationConflict(req.getPriority(), dedupedRacks, true);
    checkLocalityRelaxationConflict(req.getPriority(), inferredRacks,
        req.getRelaxLocality());
    // check if the node label expression specified is valid
    checkNodeLabelExpression(req);

    if (req.getNodes() != null) {
      HashSet dedupedNodes = new HashSet(req.getNodes());
      if(dedupedNodes.size() != req.getNodes().size()) {
        Joiner joiner = Joiner.on(',');
        LOG.warn("ContainerRequest has duplicate nodes: "
            + joiner.join(req.getNodes()));        
      }
      for (String node : dedupedNodes) {
        addResourceRequest(req.getPriority(), node, req.getCapability(), req,
            true, req.getNodeLabelExpression());
      }
    }

    for (String rack : dedupedRacks) {
      addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
          true, req.getNodeLabelExpression());
    }

    // Ensure node requests are accompanied by requests for
    // corresponding rack
    for (String rack : inferredRacks) {
      addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
          req.getRelaxLocality(), req.getNodeLabelExpression());
    }

    // Off-switch
    addResourceRequest(req.getPriority(), ResourceRequest.ANY, 
        req.getCapability(), req, req.getRelaxLocality(), req.getNodeLabelExpression());
  }

一个ContainerRequest的例子

ApplicationMaster发出一个ContainerRequest并且指定nodes = {node1,node2}, racks = {rack3}，即ApplicationMaster期望自己申请的container是分配在node1、node2或者rack1上。对于该ContainerRequest会生成6个ResourceRequest发送给ResourceManager

ResourceRequest#resourceName = node1，ResourceRequest#relaxLocality = true

ResourceRequest#resourceName = node2，ResourceRequest#relaxLocality = true

ResourceRequest#resourceName = rack3，ResourceRequest#relaxLocality = true

ResourceRequest#resourceName = rack1，ResourceRequest#relaxLocality = ContainerRequest#relaxLocality

ResourceRequest#resourceName = rack2，ResourceRequest#relaxLocality = ContainerRequest#relaxLocality

ResourceRequest#resourceName = *，ResourceRequest#relaxLocality = ContainerRequest#relaxLocality

ResourceManager（服务）端FairScheduler对relaxLocality的处理

调度预处理

FairScheduler收到某个NodeManger的NODE_UPDATE事件之后，FairScheduler会在该Node上分配资源，这个分配会有两层调度

FairScheduler通过调度算法计算出各个队列的优先级(暂不表)；

按照优先级给每个队列中的application分配资源，具体application的分配也是按照优先级。

本文主要关注第二层调度，将一个node的资源分配给一个application时，第一步是判断当前的应用是否
有资源需要分配在指定的node上，具体逻辑在FsAppAttempt#hasContainerForNode

获取application当前优先级的所有请求anyRequest；

获取application当前优先级在当前节点所在机架的请求rackRequest；

获取application当前优先级在当前节点的请求nodeRequest；

anyRequest有资源需要分配或者rackRequest有资源需要分配或者nodeRequest有资源需要分配。

根据上一节分析ResourceRequest#relaxLocality = true表示当前的ResourceRequest#resourceName有资源需要分配，这样就好理解FsAppAttempt#hasContainerForNode了，如果anyRequest#relaxLocality = true,表示anyRequest有资源进行分配，不需要管rack或者node了，以此类推。

// FsAppAttempt#hasContainerForNode
public boolean hasContainerForNode(Priority prio, FSSchedulerNode node) {
    ResourceRequest anyRequest = getResourceRequest(prio, ResourceRequest.ANY);
    ResourceRequest rackRequest = getResourceRequest(prio, node.getRackName());
    ResourceRequest nodeRequest = getResourceRequest(prio, node.getNodeName());

    return
        // There must be outstanding requests at the given priority:
        anyRequest != null && anyRequest.getNumContainers() > 0 &&
            // If locality relaxation is turned off at *-level, there must be a
            // non-zero request for the node's rack:
            (anyRequest.getRelaxLocality() ||
                (rackRequest != null && rackRequest.getNumContainers() > 0)) &&
            // If locality relaxation is turned off at rack-level, there must be a
            // non-zero request at the node:
            (rackRequest == null || rackRequest.getRelaxLocality() ||
                (nodeRequest != null && nodeRequest.getNumContainers() > 0)) &&
            // The requested container must be able to fit on the node:
            Resources.lessThanOrEqual(RESOURCE_CALCULATOR, null,
                anyRequest.getCapability(), node.getRMNode().getTotalCapability());
  }

降级的逻辑

一直说可以降级，那么降级的逻辑是在哪里？SchedulerApplicationAttempt#schedulingOpportunities保存了每个应用每个Priority被分配资源（调度）的次数，一次资源的分配可能成功也可能由于node的资源不够而分配失败，当调度次数超过一定阀值资源还没有调度成功便会发生降级。

前面已经提到FSAppAttempt#assignContainer是以Priority为单位进行资源分配，每次真正进行分配资源之前，会调用FSAppAttempt#getAllowedLocalityLevel计算本次资源分配的级别（即是否降级），是NODE_LOCAL还是RACK_LOCAL亦或是OFF_SWITCH级别。

具体逻辑

nodeLocalityThreshold通过yarn.scheduler.fair.locality.threshold.node进行配置，默认是-1；

rackLocalityThreshold通过yarn.scheduler.fair.locality.threshold.rack进行配置，默认是-1；

如果nodeLocalityThreshold<0或者rackLocalityThreshold<0直接返回OFF_SWITCH；

如果是该优先级第一调度默认返回NODE_LOCAL；

降级的计算，如果nodeLocalityThreshold和rackLocalityThreshold均为0.5，当前集群NodeManger的个数是4，当前优先级调度次数3 > 4*0.5，就会发生降级，先从NODE_LOCAL降到RACK_LOCAL，然后从RACK_LOCAL降到OFF_SWITCH

// FSAppAttempt#getAllowedLocalityLevel
public synchronized NodeType getAllowedLocalityLevel(Priority priority,
      int numNodes, double nodeLocalityThreshold, double rackLocalityThreshold) {
    // upper limit on threshold
    if (nodeLocalityThreshold > 1.0) { nodeLocalityThreshold = 1.0; }
    if (rackLocalityThreshold > 1.0) { rackLocalityThreshold = 1.0; }

    // If delay scheduling is not being used, can schedule anywhere
    if (nodeLocalityThreshold < 0.0 || rackLocalityThreshold < 0.0) {
      return NodeType.OFF_SWITCH;
    }

    // Default level is NODE_LOCAL
    if (!allowedLocalityLevel.containsKey(priority)) {
      allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
      return NodeType.NODE_LOCAL;
    }

    NodeType allowed = allowedLocalityLevel.get(priority);

    // If level is already most liberal, we're done
    if (allowed.equals(NodeType.OFF_SWITCH)) return NodeType.OFF_SWITCH;

    double threshold = allowed.equals(NodeType.NODE_LOCAL) ? nodeLocalityThreshold :
      rackLocalityThreshold;

    // Relax locality constraints once we've surpassed threshold.
    if (getSchedulingOpportunities(priority) > (numNodes * threshold)) {
      if (allowed.equals(NodeType.NODE_LOCAL)) {
        allowedLocalityLevel.put(priority, NodeType.RACK_LOCAL);
        resetSchedulingOpportunities(priority);
      }
      else if (allowed.equals(NodeType.RACK_LOCAL)) {
        allowedLocalityLevel.put(priority, NodeType.OFF_SWITCH);
        resetSchedulingOpportunities(priority);
      }
    }
    return allowedLocalityLevel.get(priority);
  }

升级的逻辑

具体逻辑是在FSAppAttempt#allocate

当前分配的级别是NODE_LOCAL或RACK_LOCAL，经过上一节计算之后的逻辑是OFF_SWITCH升级为NODE_LOCAL或RACK_LOCAL；

当前分配的级别是NODE_LOCAL，讲过上一节计算之后的逻辑是RACK_LOCAL升级为NODE_LOCAL。

// FSAppAttempt#allocate
synchronized public RMContainer allocate(NodeType type, FSSchedulerNode node,
      Priority priority, ResourceRequest request,
      Container container) {
    // Update allowed locality level
    NodeType allowed = allowedLocalityLevel.get(priority);
    if (allowed != null) {
      if (allowed.equals(NodeType.OFF_SWITCH) &&
          (type.equals(NodeType.NODE_LOCAL) ||
              type.equals(NodeType.RACK_LOCAL))) {
        this.resetAllowedLocalityLevel(priority, type);
      }
      else if (allowed.equals(NodeType.RACK_LOCAL) &&
          type.equals(NodeType.NODE_LOCAL)) {
        this.resetAllowedLocalityLevel(priority, type);
      }
    }
    // ...
    // ...
    // ...
  }

调度的顺序

具体逻辑是在FSAppAttempt#assignContainer

只要当前的节点有node级别的请求就会采用NODE_LOCAL进行资源调度，看下面代码中的注释1，一个疑问为什么判断是否要用NODE_LOCAL进行调度，需要判断rack请求的属性rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0?
首先明确一点，客户端对于一个ContainerRequest会建立多个ResourceRequest发送给服务端，另外可以查看AppSchedulingInfo#allocateNodeLocal、AppSchedulingInfo#allocateRackLocal和AppSchedulingInfo#allocateOffSwitch这三个方法。
当成功分配一个NODE_LOCAL容器之后会将该请求对应Priority中resourceName为rack和*的请求的容器数减一，而当成功分配RACK_LOCAL和OFF_SWITCH并不会减少NODE_LOCAL中的容器数，这样就会造成该容器已经通过RACK_LOCAL或 OFF_SWITCH级成功分配了，但是localRequest != null && localRequest.getNumContainers() != 0依然为真。

符合相关条件用RACK_LOCAL级别进行调度，建代码中的注释2

// FSAppAttempt#assignContainer
private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
    // ...
    // ...
    // ...

        // 注释1
        if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
            && localRequest != null && localRequest.getNumContainers() != 0) {
          return assignContainer(node, localRequest,
              NodeType.NODE_LOCAL, reserved);
        }

        if (rackLocalRequest != null && !rackLocalRequest.getRelaxLocality()) {
          continue;
        }
        // 注释2
        if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
            && (allowedLocality.equals(NodeType.RACK_LOCAL) ||
            allowedLocality.equals(NodeType.OFF_SWITCH))) {
          return assignContainer(node, rackLocalRequest,
              NodeType.RACK_LOCAL, reserved);
        }

        ResourceRequest offSwitchRequest =
            getResourceRequest(priority, ResourceRequest.ANY);
        if (offSwitchRequest != null && !offSwitchRequest.getRelaxLocality()) {
          continue;
        }

        if (offSwitchRequest != null && offSwitchRequest.getNumContainers() != 0
            && allowedLocality.equals(NodeType.OFF_SWITCH)) {
          return assignContainer(node, offSwitchRequest,
              NodeType.OFF_SWITCH, reserved);
        }
      }
    }
    return Resources.none();
  }