YARN中的资源请求主要是指ApplicationMaster通过AMRMClientImpl向ResourceManager发起的容器请求。relaxLocality
属性在AMRMClientImpl发送请求的时候没有做特别的判断,但是ResourceManager端只有在调度器是FairScheduler才对relaxLocality
做处理,FifoScheduler和CapacityScheduler并不会对relaxLocality
做处理,所以下文的调度是采用FairScheduler调度器。
机架信息说明
本文所有的节点及其所在机架:
- node1在rack1上
- node2在rack2上
- node3在rack3上
relaxLocality属性的含义
relaxLocality
表示松弛本地化(这个有点绕口,可以理解为请求降级),所谓的请求降级即一个ContainerRequest想要ResourceManager将自己申请的资源分配在node1上,最后资源可能分配在node1所在的机架rack1或者rack1以外任意一个节点上,将本来RACK_LOCAL
级别的请求降级为RACK_LOCAL
或者OFF_SWITCH
。
ContainerRequest#relaxLocality
和ResourceRequest#relaxLocality
虽然都有relaxLocality
这个属性,但是具体的含义还是有很大区别。
ContainerRequest#relaxLocality
ContainerRequest#relaxLocality可以看做是客户端视角的降级处理
一个ApplicationMaster通过AMRMClientImpl#addContainerRequest添加一个containerRequest1,指定nodes={node1}且racks=null.
-
containerRequest1#relaxLocality = false
表示containerRequest1的请求不允许请求降级,最终ResourceManager在调度时只能将该请求所请求的资源分配到node1上; -
containerRequest1#relaxLocality = true
表示containerRequest1的请求允许请求降级,ResourceManager在调度时,如果经过N 次(具体看FairScheduler的配置)调度还没有将资源分配到node1上,做第一次降级尝试将资源分配到node1所在的机架rack1上,如果如果经过N次(具体看FairScheduler的配置)调度还没有将资源分配到rack1上,做第二次降级尝试将rack1分配到任意node上。
ResourceRequest#relaxLocality
ResourceRequest#relaxLocality可以看做是服务端视角的降级处理。生成发送给服务端的ResourceRequest是在AMRMClientImpl#addContainerRequest中。
- 不允许同一Priority下,相同resourceName对应的多个请求具有不同的
relaxLocality
,个人认为这种设计比较简单吧,便于将resourceName相同的多个请求封装进同一个ResourceRequest中,服务端也没有提供relaxLocality
不同的判断。- ResourceRequest#relaxLocality = true表明服务端可能会将一个Contianer分配到ResourceRequest#resourceName上。ResourceRequest#relaxLocality = true的来源有两种,一是ContainerRequest直接指定的nodes和racks,二是ContainerRequest#relaxLocality = true,那么如果ContainerRequest指定了nodes那么这些nodes对应的racks生成的ResourceRequest其relaxLocality为true,还会生成一个resourceName = *的ResourceRequest其relaxLocality也为true;
- ResourceRequest#relaxLocality = false,表示服务端不会管这个ResourceRequest;
- 可见对于一个ContainerRequest会生成多个ResourceRequest发送给服务端,但是这多个ResourceRequest只会获得一个Contianer,有两点 对这个进行保证,一是对于每一个ContainerRequest只会生成一个resource = "*"的ResourceReques,服务端正是依据这个来判断一个ContainerRequest是否完成;而是对于一个ContainerRequest生成的多个ResourceRequest,服务端只要成功分配其中一个ResourceRequest,会删除其它的多余的ResourceRequest,具体删除逻辑是在AppSchedulingInfo#allocateNodeLocal.
// AMRMClientImpl#addContainerRequest
public synchronized void addContainerRequest(T req) {
Preconditions.checkArgument(req != null,
"Resource request can not be null.");
Set dedupedRacks = new HashSet();
if (req.getRacks() != null) {
dedupedRacks.addAll(req.getRacks());
if(req.getRacks().size() != dedupedRacks.size()) {
Joiner joiner = Joiner.on(',');
LOG.warn("ContainerRequest has duplicate racks: "
+ joiner.join(req.getRacks()));
}
}
Set inferredRacks = resolveRacks(req.getNodes());
inferredRacks.removeAll(dedupedRacks);
// check that specific and non-specific requests cannot be mixed within a
// priority
checkLocalityRelaxationConflict(req.getPriority(), ANY_LIST,
req.getRelaxLocality());
// check that specific rack cannot be mixed with specific node within a
// priority. If node and its rack are both specified then they must be
// in the same request.
// For explicitly requested racks, we set locality relaxation to true
checkLocalityRelaxationConflict(req.getPriority(), dedupedRacks, true);
checkLocalityRelaxationConflict(req.getPriority(), inferredRacks,
req.getRelaxLocality());
// check if the node label expression specified is valid
checkNodeLabelExpression(req);
if (req.getNodes() != null) {
HashSet dedupedNodes = new HashSet(req.getNodes());
if(dedupedNodes.size() != req.getNodes().size()) {
Joiner joiner = Joiner.on(',');
LOG.warn("ContainerRequest has duplicate nodes: "
+ joiner.join(req.getNodes()));
}
for (String node : dedupedNodes) {
addResourceRequest(req.getPriority(), node, req.getCapability(), req,
true, req.getNodeLabelExpression());
}
}
for (String rack : dedupedRacks) {
addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
true, req.getNodeLabelExpression());
}
// Ensure node requests are accompanied by requests for
// corresponding rack
for (String rack : inferredRacks) {
addResourceRequest(req.getPriority(), rack, req.getCapability(), req,
req.getRelaxLocality(), req.getNodeLabelExpression());
}
// Off-switch
addResourceRequest(req.getPriority(), ResourceRequest.ANY,
req.getCapability(), req, req.getRelaxLocality(), req.getNodeLabelExpression());
}
一个ContainerRequest的例子
ApplicationMaster发出一个ContainerRequest并且指定nodes = {node1,node2}, racks = {rack3},即ApplicationMaster期望自己申请的container是分配在node1、node2或者rack1上。对于该ContainerRequest会生成6个ResourceRequest发送给ResourceManager
- ResourceRequest#resourceName = node1,ResourceRequest#relaxLocality = true
- ResourceRequest#resourceName = node2,ResourceRequest#relaxLocality = true
- ResourceRequest#resourceName = rack3,ResourceRequest#relaxLocality = true
- ResourceRequest#resourceName = rack1,ResourceRequest#relaxLocality = ContainerRequest#relaxLocality
- ResourceRequest#resourceName = rack2,ResourceRequest#relaxLocality = ContainerRequest#relaxLocality
- ResourceRequest#resourceName = *,ResourceRequest#relaxLocality = ContainerRequest#relaxLocality
ResourceManager(服务)端FairScheduler对relaxLocality的处理
调度预处理
FairScheduler收到某个NodeManger的NODE_UPDATE事件之后,FairScheduler会在该Node上分配资源,这个分配会有两层调度
- FairScheduler通过调度算法计算出各个队列的优先级(暂不表);
- 按照优先级给每个队列中的application分配资源,具体application的分配也是按照优先级。
本文主要关注第二层调度,将一个node的资源分配给一个application时,第一步是判断当前的应用是否
有资源需要分配在指定的node上,具体逻辑在FsAppAttempt#hasContainerForNode
- 获取application当前优先级的所有请求anyRequest;
- 获取application当前优先级在当前节点所在机架的请求rackRequest;
- 获取application当前优先级在当前节点的请求nodeRequest;
- anyRequest有资源需要分配或者rackRequest有资源需要分配或者nodeRequest有资源需要分配。
根据上一节分析ResourceRequest#relaxLocality = true表示当前的ResourceRequest#resourceName有资源需要分配,这样就好理解FsAppAttempt#hasContainerForNode了,如果anyRequest#relaxLocality = true,表示anyRequest有资源进行分配,不需要管rack或者node了,以此类推。
// FsAppAttempt#hasContainerForNode
public boolean hasContainerForNode(Priority prio, FSSchedulerNode node) {
ResourceRequest anyRequest = getResourceRequest(prio, ResourceRequest.ANY);
ResourceRequest rackRequest = getResourceRequest(prio, node.getRackName());
ResourceRequest nodeRequest = getResourceRequest(prio, node.getNodeName());
return
// There must be outstanding requests at the given priority:
anyRequest != null && anyRequest.getNumContainers() > 0 &&
// If locality relaxation is turned off at *-level, there must be a
// non-zero request for the node's rack:
(anyRequest.getRelaxLocality() ||
(rackRequest != null && rackRequest.getNumContainers() > 0)) &&
// If locality relaxation is turned off at rack-level, there must be a
// non-zero request at the node:
(rackRequest == null || rackRequest.getRelaxLocality() ||
(nodeRequest != null && nodeRequest.getNumContainers() > 0)) &&
// The requested container must be able to fit on the node:
Resources.lessThanOrEqual(RESOURCE_CALCULATOR, null,
anyRequest.getCapability(), node.getRMNode().getTotalCapability());
}
降级的逻辑
一直说可以降级,那么降级的逻辑是在哪里?SchedulerApplicationAttempt#schedulingOpportunities保存了每个应用每个Priority被分配资源(调度)的次数,一次资源的分配可能成功也可能由于node的资源不够而分配失败,当调度次数超过一定阀值资源还没有调度成功便会发生降级。
前面已经提到FSAppAttempt#assignContainer是以Priority为单位进行资源分配,每次真正进行分配资源之前,会调用FSAppAttempt#getAllowedLocalityLevel计算本次资源分配的级别(即是否降级),是
NODE_LOCAL
还是RACK_LOCAL
亦或是OFF_SWITCH
级别。
具体逻辑
nodeLocalityThreshold
通过yarn.scheduler.fair.locality.threshold.node
进行配置,默认是-1;rackLocalityThreshold
通过yarn.scheduler.fair.locality.threshold.rack
进行配置,默认是-1;
- 如果
nodeLocalityThreshold
<0或者rackLocalityThreshold
<0直接返回OFF_SWITCH
;- 如果是该优先级第一调度默认返回
NODE_LOCAL
;- 降级的计算,如果
nodeLocalityThreshold
和rackLocalityThreshold
均为0.5,当前集群NodeManger的个数是4,当前优先级调度次数3 > 4*0.5,就会发生降级,先从NODE_LOCAL
降到RACK_LOCAL
,然后从RACK_LOCAL
降到OFF_SWITCH
// FSAppAttempt#getAllowedLocalityLevel
public synchronized NodeType getAllowedLocalityLevel(Priority priority,
int numNodes, double nodeLocalityThreshold, double rackLocalityThreshold) {
// upper limit on threshold
if (nodeLocalityThreshold > 1.0) { nodeLocalityThreshold = 1.0; }
if (rackLocalityThreshold > 1.0) { rackLocalityThreshold = 1.0; }
// If delay scheduling is not being used, can schedule anywhere
if (nodeLocalityThreshold < 0.0 || rackLocalityThreshold < 0.0) {
return NodeType.OFF_SWITCH;
}
// Default level is NODE_LOCAL
if (!allowedLocalityLevel.containsKey(priority)) {
allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL);
return NodeType.NODE_LOCAL;
}
NodeType allowed = allowedLocalityLevel.get(priority);
// If level is already most liberal, we're done
if (allowed.equals(NodeType.OFF_SWITCH)) return NodeType.OFF_SWITCH;
double threshold = allowed.equals(NodeType.NODE_LOCAL) ? nodeLocalityThreshold :
rackLocalityThreshold;
// Relax locality constraints once we've surpassed threshold.
if (getSchedulingOpportunities(priority) > (numNodes * threshold)) {
if (allowed.equals(NodeType.NODE_LOCAL)) {
allowedLocalityLevel.put(priority, NodeType.RACK_LOCAL);
resetSchedulingOpportunities(priority);
}
else if (allowed.equals(NodeType.RACK_LOCAL)) {
allowedLocalityLevel.put(priority, NodeType.OFF_SWITCH);
resetSchedulingOpportunities(priority);
}
}
return allowedLocalityLevel.get(priority);
}
升级的逻辑
具体逻辑是在FSAppAttempt#allocate
- 当前分配的级别是
NODE_LOCAL
或RACK_LOCAL
,经过上一节计算之后的逻辑是OFF_SWITCH
升级为NODE_LOCAL
或RACK_LOCAL
;- 当前分配的级别是
NODE_LOCAL
,讲过上一节计算之后的逻辑是RACK_LOCAL
升级为NODE_LOCAL
。
// FSAppAttempt#allocate
synchronized public RMContainer allocate(NodeType type, FSSchedulerNode node,
Priority priority, ResourceRequest request,
Container container) {
// Update allowed locality level
NodeType allowed = allowedLocalityLevel.get(priority);
if (allowed != null) {
if (allowed.equals(NodeType.OFF_SWITCH) &&
(type.equals(NodeType.NODE_LOCAL) ||
type.equals(NodeType.RACK_LOCAL))) {
this.resetAllowedLocalityLevel(priority, type);
}
else if (allowed.equals(NodeType.RACK_LOCAL) &&
type.equals(NodeType.NODE_LOCAL)) {
this.resetAllowedLocalityLevel(priority, type);
}
}
// ...
// ...
// ...
}
调度的顺序
具体逻辑是在FSAppAttempt#assignContainer
只要当前的节点有node级别的请求就会采用
NODE_LOCAL
进行资源调度,看下面代码中的注释1
,一个疑问为什么判断是否要用NODE_LOCAL
进行调度,需要判断rack请求的属性rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
?
首先明确一点,客户端对于一个ContainerRequest会建立多个ResourceRequest发送给服务端,另外可以查看AppSchedulingInfo#allocateNodeLocal、AppSchedulingInfo#allocateRackLocal和AppSchedulingInfo#allocateOffSwitch这三个方法。
当成功分配一个NODE_LOCAL
容器之后会将该请求对应Priority中resourceName为rack
和*
的请求的容器数减一,而当成功分配RACK_LOCAL
和OFF_SWITCH
并不会减少NODE_LOCAL
中的容器数,这样就会造成该容器已经通过RACK_LOCAL
或OFF_SWITCH
级成功分配了,但是localRequest != null && localRequest.getNumContainers() != 0
依然为真。
符合相关条件用
RACK_LOCAL
级别进行调度,建代码中的注释2
// FSAppAttempt#assignContainer
private Resource assignContainer(FSSchedulerNode node, boolean reserved) {
// ...
// ...
// ...
// 注释1
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& localRequest != null && localRequest.getNumContainers() != 0) {
return assignContainer(node, localRequest,
NodeType.NODE_LOCAL, reserved);
}
if (rackLocalRequest != null && !rackLocalRequest.getRelaxLocality()) {
continue;
}
// 注释2
if (rackLocalRequest != null && rackLocalRequest.getNumContainers() != 0
&& (allowedLocality.equals(NodeType.RACK_LOCAL) ||
allowedLocality.equals(NodeType.OFF_SWITCH))) {
return assignContainer(node, rackLocalRequest,
NodeType.RACK_LOCAL, reserved);
}
ResourceRequest offSwitchRequest =
getResourceRequest(priority, ResourceRequest.ANY);
if (offSwitchRequest != null && !offSwitchRequest.getRelaxLocality()) {
continue;
}
if (offSwitchRequest != null && offSwitchRequest.getNumContainers() != 0
&& allowedLocality.equals(NodeType.OFF_SWITCH)) {
return assignContainer(node, offSwitchRequest,
NodeType.OFF_SWITCH, reserved);
}
}
}
return Resources.none();
}