资源分配是被动分配的方式,在数据节点发送心跳(NODE_UPDATE)时,根据数据节点汇报的资源情况进行调度分配.
先贴下: ApplicationMaster启动需要的资源多少(memory和virtualcores)在客户端提交应用程序的时候已经初始化(在YARNRunner类里),memory默认是1536M,virtualcores默认是1.
代码清单:
case NODE_UPDATE:
{
NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;
RMNode node = nodeUpdatedEvent.getRMNode();
/**
* 更新节点信息:
* 1.处理已分配的container
* 触发RMContainerEventType.LAUNCHED事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
* 2.处理已经完成的container
* 主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
*/
nodeUpdate(node);
/**
* 是否异步分配,默认值是false,默认capacity-scheduler.xml配置文件里是没有配置的.
* 配置项:yarn.scheduler.capacity.scheduler-asynchronously.enable
*/
if (!scheduleAsynchronously) {
/**
* 进行资源分配
*/
allocateContainersToNode(getNode(node.getNodeID()));
}
}
NODE_UPDATE事件处理逻辑:
1.节点更新信息处理
2.分配资源
/**
* 1.处理已分配的container
* 触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
*
* 2.处理已经完成的container
* 主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
* @param nm
*/
private synchronized void nodeUpdate(RMNode nm) {
if (LOG.isDebugEnabled()) {
LOG.debug("nodeUpdate: " + nm + " clusterResources: " + clusterResource);
}
FiCaSchedulerNode node = getNode(nm.getNodeID());
List<UpdatedContainerInfo> containerInfoList = nm.pullContainerUpdates();
List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>();
List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>();
for(UpdatedContainerInfo containerInfo : containerInfoList) {
newlyLaunchedContainers.addAll(containerInfo.getNewlyLaunchedContainers());
completedContainers.addAll(containerInfo.getCompletedContainers());
}
// Processing the newly launched containers
for (ContainerStatus launchedContainer : newlyLaunchedContainers) {
/**
* 触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
*/
containerLaunchedOnNode(launchedContainer.getContainerId(), node);
}
// Process completed containers
for (ContainerStatus completedContainer : completedContainers) {
ContainerId containerId = completedContainer.getContainerId();
LOG.debug("Container FINISHED: " + containerId);
/**
* 主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
*/
completedContainer(getRMContainer(containerId),
completedContainer, RMContainerEventType.FINISHED);
}
// Now node data structures are upto date and ready for scheduling.
if(LOG.isDebugEnabled()) {
LOG.debug("Node being looked for scheduling " + nm
+ " availableResource: " + node.getAvailableResource());
}
}
更新数据节点信息:
1.处理已分配的container
触发RMContainerEventType.LAUNCHED事件,该事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
2.处理已经完成的container
主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
在贴分配逻辑代码前,先YY几个问题:
1.分配是以队列为单位,那么是怎么选队列的(按什么顺序、条件选队列)?
2.选中队列后,又是怎么选应用程序进行分配(按什么顺序分配提交到队列内的应用程序)?
/**
* 为了尽量简单,能先看懂主体逻辑流程,先不考虑reserved情况
*/
@VisibleForTesting
public synchronized void allocateContainersToNode(FiCaSchedulerNode node) {
if (rmContext.isWorkPreservingRecoveryEnabled()
&& !rmContext.isSchedulerReadyForAllocatingContainers()) {
return;
}
/**
* 数据节点还未注册过
*/
if (!nodes.containsKey(node.getNodeID())) {
LOG.info("Skipping scheduling as the node " + node.getNodeID() +
" has been removed");
return;
}
// Assign new containers...
// 1. Check for reserved applications
// 2. Schedule if there are no reservations
/**
* 看容器节点上有无预留资源,有预留资源则先用
*
* 为了尽量简单,先不考虑reservedContainer情况
*/
RMContainer reservedContainer = node.getReservedContainer();
if (reservedContainer != null) {
FiCaSchedulerApp reservedApplication =
getCurrentAttemptForContainer(reservedContainer.getContainerId());
// Try to fulfill the reservation
LOG.info("Trying to fulfill reservation for application " +
reservedApplication.getApplicationId() + " on node: " +
node.getNodeID());
LeafQueue queue = ((LeafQueue)reservedApplication.getQueue());
CSAssignment assignment =
queue.assignContainers(
clusterResource,
node,
new ResourceLimits(labelManager.getResourceByLabel(
RMNodeLabelsManager.NO_LABEL, clusterResource)));
RMContainer excessReservation = assignment.getExcessReservation();
if (excessReservation != null) {
Container container = excessReservation.getContainer();
queue.completedContainer(
clusterResource, assignment.getApplication(), node,
excessReservation,
SchedulerUtils.createAbnormalContainerStatus(
container.getId(),
SchedulerUtils.UNRESERVED_CONTAINER),
RMContainerEventType.RELEASED, null, true);
}
}
/**
* minimumAllocation包括最小内存和最小虚拟CPU数,在CapacityScheduler初始化initScheduler的时候初始化
* 最小内存: 配置项是yarn.scheduler.minimum-allocation-mb,默认值是1024M
* 最小虚拟CPU数: 配置项是yarn.scheduler.minimum-allocation-vcores,默认值是1
*/
// Try to schedule more if there are no reservations to fulfill
if (node.getReservedContainer() == null) {
/**
* 数据节点的可用资源是否能满足,算法:
* node.getAvailableResource()/minimumAllocation
*/
if (calculator.computeAvailableContainers(node.getAvailableResource(),
minimumAllocation) > 0) {
if (LOG.isDebugEnabled()) {
LOG.debug("Trying to schedule on node: " + node.getNodeName() +
", available: " + node.getAvailableResource());
}
/**
* 这里有两个思路或问题:
* 1.从root开始匹配,那么先匹配哪个队列呢?
* 队列是根据可使用容量来排序遍历,可使用容量越多越靠前
* 2.队列内部按什么顺序匹配需求?
* 队列内是安排FIFO的顺序匹配需求
*
* 注意:assignContainers是从根节点开始匹配,assignContainers和assignContainersToChildQueues方法是相互调用的递归方法,
* 直到叶子节点的时候才调用叶子节点的assignContainers进行实质上的分配
*/
root.assignContainers(
clusterResource,
node,
new ResourceLimits(labelManager.getResourceByLabel(
RMNodeLabelsManager.NO_LABEL, clusterResource)));
}
} else {
LOG.info("Skipping scheduling since node " + node.getNodeID() +
" is reserved by application " +
node.getReservedContainer().getContainerId().getApplicationAttemptId()
);
}
}
allocateContainersToNode方法的主要实现:
从根节点root开始调用assignContainers进行匹配,一直到叶子节点真正完成分配.这个匹配过程中与parentQueue.assignContainersToChildQueues方法两者相互递归调用完成.
主要的是否可分配的检查逻辑是:
1.数据节点汇报上来的可用资源是否大于等于配置的minimumAllocation.
2.检查分配后队列的总占用资源是否超过队列的资源上限.
重新回到主体逻辑代码:
@Override
public synchronized CSAssignment ParantQueue.assignContainers(Resource clusterResource,
FiCaSchedulerNode node, ResourceLimits resourceLimits) {
CSAssignment assignment =
new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);
Set<String> nodeLabels = node.getLabels();
/**
* 数据节点是否标签是否正匹配:
* 1.如果队列标签是*,则可以访问任何一个计算节点
* 2.如果节点没有打标签,则任何队列都可以访问
* 3.如果队列打了固定标签,则只能访问对应标签的节点
*/
if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels, nodeLabels)) {
return assignment;
}
/**
* 检查node上的可用资源是否达到minimumAllocation要求
*
* 计算node上的资源是否可以用(是与minimumAllocation匹配),计算公式:node.getAvailableResource()-minimumAllocation>0
* 1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
* 2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
*/
while (canAssign(clusterResource, node)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Trying to assign containers to child-queue of "
+ getQueueName());
}
/**
* 检查是否超过当前队列资源上限,即判断当前队列是否可分配
*/
if (!super.canAssignToThisQueue(clusterResource, nodeLabels, resourceLimits,
minimumAllocation, Resources.createResource(getMetrics()
.getReservedMB(), getMetrics().getReservedVirtualCores()))) {
break;
}
/**
* 检查通过后,分派到子队列
*/
CSAssignment assignedToChild =
assignContainersToChildQueues(clusterResource, node, resourceLimits);
assignment.setType(assignedToChild.getType());
// Done if no child-queue assigned anything
/**
* 有分配到资源就说明分配成功
*/
if (Resources.greaterThan(
resourceCalculator, clusterResource,
assignedToChild.getResource(), Resources.none())) {
// Track resource utilization for the parent-queue
/**
* 分配成功后,更新父队列资源使用情况
*/
super.allocateResource(clusterResource, assignedToChild.getResource(),
nodeLabels);
/**
* 将子队列的资源使用情况,与当前队列分配的资源合并更新
*/
Resources.addTo(assignment.getResource(), assignedToChild.getResource());
LOG.info("assignedContainer" +
" queue=" + getQueueName() +
" usedCapacity=" + getUsedCapacity() +
" absoluteUsedCapacity=" + getAbsoluteUsedCapacity() +
" used=" + queueUsage.getUsed() +
" cluster=" + clusterResource);
} else {
break;
}
if (LOG.isDebugEnabled()) {
LOG.debug("ParentQ=" + getQueueName()
+ " assignedSoFarInThisIteration=" + assignment.getResource()
+ " usedCapacity=" + getUsedCapacity()
+ " absoluteUsedCapacity=" + getAbsoluteUsedCapacity());
}
if (!rootQueue || assignment.getType() == NodeType.OFF_SWITCH) {
if (LOG.isDebugEnabled()) {
if (rootQueue && assignment.getType() == NodeType.OFF_SWITCH) {
LOG.debug("Not assigning more than one off-switch container," +
" assignments so far: " + assignment);
}
}
break;
}
}
return assignment;
}
ParantQueue.assignContainers的主要逻辑:
1.检查汇报上来的数据节点标签是否匹配.
2.检查汇报上来的数据节点的可用资源是否达到minimumAllocation要求.
3.检查是否超过当前队列的资源上限.
4.检查通过后分派到子节点进行匹配.
/**
* 数据节点标签是否匹配:
* 1.如果队列标签是星号,则可以访问任何一个计算节点
* 2.如果节点没有打标签,则任何队列都可以访问
* 3.如果队列打了固定标签,则只能访问对应标签的节点
* @param queueLabels
* @param nodeLabels
* @return
*/
public static boolean checkQueueAccessToNode(Set<String> queueLabels,
Set<String> nodeLabels) {
if (queueLabels != null && queueLabels.contains(RMNodeLabelsManager.ANY)) {
return true;
}
// any queue can access to a node without label
if (nodeLabels == null || nodeLabels.isEmpty()) {
return true;
}
// a queue can access to a node only if it contains any label of the node
if (queueLabels != null
&& Sets.intersection(queueLabels, nodeLabels).size() > 0) {
return true;
}
return false;
}
检查汇报上来的数据节点标签是否匹配:
1.如果队列标签是星号,则可以访问任何一个计算节点
2.如果节点没有打标签,则任何队列都可以访问
3.如果队列打了固定标签,则只能访问对应标签的节点
/**
* 汇报上来的数据节点上的资源是否可以用,计算公式:node.getAvailableResource()-minimumAllocation>0
* 1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
* 2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
*/
private boolean canAssign(Resource clusterResource, FiCaSchedulerNode node) {
/**
* 汇报上来的数据节点的资源是否可以用,计算公式:node.getAvailableResource()-minimumAllocation>0
* 1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
* 2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
*/
return (node.getReservedContainer() == null) &&
Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
node.getAvailableResource(), minimumAllocation);
}
检查汇报上来的数据节点的可用资源是否达到minimumAllocation要求.
/**
* 检查分配后是否会超过当前队列的资源上限
*
* @param clusterResource
* @param nodeLabels
* @param currentResourceLimits
* @param nowRequired
* @param resourceCouldBeUnreserved
* @return
*/
synchronized boolean canAssignToThisQueue(Resource clusterResource,
Set<String> nodeLabels, ResourceLimits currentResourceLimits,
Resource nowRequired, Resource resourceCouldBeUnreserved) {
// Get label of this queue can access, it's (nodeLabel AND queueLabel)
Set<String> labelCanAccess;
if (null == nodeLabels || nodeLabels.isEmpty()) {
labelCanAccess = new HashSet<String>();
// Any queue can always access any node without label
labelCanAccess.add(RMNodeLabelsManager.NO_LABEL);
} else {
labelCanAccess = new HashSet<String>(
accessibleLabels.contains(CommonNodeLabelsManager.ANY) ? nodeLabels
: Sets.intersection(accessibleLabels, nodeLabels));
}
for (String label : labelCanAccess) {
// New total resource = used + required
Resource newTotalResource =
Resources.add(queueUsage.getUsed(label), nowRequired);
/**
* 没有标签的队列的资源上限: min(当前层级队列的资源上限,父节点指定的上限)
* 有标签的队列的资源上限: 当前层级队列的资源上限
*
* 看root传入的是整个集群的资源,所以一般情况下都是当前层级队列的资源上限
*/
Resource currentLimitResource =
getCurrentLimitResource(label, clusterResource, currentResourceLimits);
/**
* 假如分配成功后,是不是超过了资源上限
*/
if (Resources.greaterThan(resourceCalculator, clusterResource,
newTotalResource, currentLimitResource)) {
if (this.reservationsContinueLooking
&& label.equals(RMNodeLabelsManager.NO_LABEL)
&& Resources.greaterThan(resourceCalculator, clusterResource,
resourceCouldBeUnreserved, Resources.none())) {
// resource-without-reserved = used - reserved
Resource newTotalWithoutReservedResource =
Resources.subtract(newTotalResource, resourceCouldBeUnreserved);
if (Resources.lessThanOrEqual(resourceCalculator, clusterResource,
newTotalWithoutReservedResource, currentLimitResource)) {
if (LOG.isDebugEnabled()) {
LOG.debug("try to use reserved: " + getQueueName()
+ " usedResources: " + queueUsage.getUsed()
+ ", clusterResources: " + clusterResource
+ ", reservedResources: " + resourceCouldBeUnreserved
+ ", capacity-without-reserved: "
+ newTotalWithoutReservedResource + ", maxLimitCapacity: "
+ currentLimitResource);
} currentResourceLimits.setAmountNeededUnreserve(Resources.subtract(newTotalResource,
currentLimitResource));
return true;
}
}
if (LOG.isDebugEnabled()) {
LOG.debug(getQueueName()
+ "Check assign to queue, label=" + label
+ " usedResources: " + queueUsage.getUsed(label)
+ " clusterResources: " + clusterResource
+ " currentUsedCapacity "
+ Resources.divide(resourceCalculator, clusterResource,
queueUsage.getUsed(label),
labelManager.getResourceByLabel(label, clusterResource))
+ " max-capacity: "
+ queueCapacities.getAbsoluteMaximumCapacity(label)
+ ")");
}
return false;
}
return true;
}
return false;
}
检查分配后是否会超过当前队列的资源上限.
/**
* 遍历当前队列的子队列,那么想到一个问题,遍历顺序:
* CapacityScheduler内实现了一个比较器用于给队列排序.
* 1.首先按队列可使用容量排序,可使用资源越多,排序越靠前
* 2.可使用资源一样时,按队列路径排序,路径越短越靠前
* @param cluster
* @param node
* @param limits
* @return
*/
private synchronized CSAssignment ParentQueue.assignContainersToChildQueues(
Resource cluster, FiCaSchedulerNode node, ResourceLimits limits) {
CSAssignment assignment =
new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);
printChildQueues();
/**
* 遍历每个子节点,所有子节点中有一个能匹配就返回.如果都不能提交,则失败
* 两种分配情况:
* 1.所有子队列都不满足分配条件,分配失败,等待下一次匹配(也许会有释放,也可能是因为配置错误导致永远分配失败)
* 2.分配成功
*/
for (Iterator<CSQueue> iter = childQueues.iterator(); iter.hasNext();) {
CSQueue childQueue = iter.next();
if(LOG.isDebugEnabled()) {
LOG.debug("Trying to assign to queue: " + childQueue.getQueuePath()
+ " stats: " + childQueue);
}
/**
* 获取子节点队列资源上限
*/
ResourceLimits childLimits =
getResourceLimitsOfChild(childQueue, cluster, limits);
/**
* 根据队列的多层级结构,这里的childQueue可能是ParentQueue,也可能是LeafQueue.
* 如果ParentQueue则递归调用assignContainers(又会调用assignContainersToChildQueues),
* 直到是LeafQueue,才调用LeafQueue.assignContainers方法则真正进行分配
*
*/
assignment = childQueue.assignContainers(cluster, node, childLimits);
if(LOG.isDebugEnabled()) {
LOG.debug("Assigned to queue: " + childQueue.getQueuePath() +
" stats: " + childQueue + " --> " +
assignment.getResource() + ", " + assignment.getType());
}
/**
* 把完成分配的队列先删除后再添加到队列列表里,以完成重新排序,让已经完成分配的排序靠后(根据队列可用容量和队列路径)
*/
if (Resources.greaterThan(
resourceCalculator, cluster,
assignment.getResource(), Resources.none())) {
// Remove and re-insert to sort
iter.remove();
LOG.info("Re-sorting assigned queue: " + childQueue.getQueuePath() +
" stats: " + childQueue);
childQueues.add(childQueue);
if (LOG.isDebugEnabled()) {
printChildQueues();
}
break;
}
}
return assignment;
}
分派到子节点进行匹配.
这里就涉及到最开始的一个问题:分配是以队列为单位,那么是怎么选队列的(按什么顺序、条件选队列)?
从代码看只是简单的for循环遍历,那么就要看childQueue的排序规则了.
this.childQueues = new TreeSet<CSQueue>(queueComparator);
/**
* 队列比较器:
* 1.可用容量越多越排前面
* 2.可用容量一样时,根据队列路径排序(路径越短越排前面)
*/
static final Comparator<CSQueue> queueComparator = new Comparator<CSQueue>() {
@Override
public int compare(CSQueue q1, CSQueue q2) {
if (q1.getUsedCapacity() < q2.getUsedCapacity()) {
return -1;
} else if (q1.getUsedCapacity() > q2.getUsedCapacity()) {
return 1;
}
return q1.getQueuePath().compareTo(q2.getQueuePath());
}
};
从队列比较器可以看出,队列匹配的规则:
1.可用容量越多先匹配.
2.可用容量一样时,根据队列路径排序,路径越短先匹配.
/**
* 1.检查数据节点标签是否匹配
* 2.遍历应用程序列表(activeApplications),进行分配
*
*/
@Override
public synchronized CSAssignment assignContainers(Resource clusterResource,
FiCaSchedulerNode node, ResourceLimits currentResourceLimits) {
updateCurrentResourceLimits(currentResourceLimits, clusterResource);
if(LOG.isDebugEnabled()) {
LOG.debug("assignContainers: node=" + node.getNodeName()
+ " #applications=" + activeApplications.size());
}
// if our queue cannot access this node, just return
/**
* 数据节点标签是否匹配:
* 1.如果队列标签是*,则可以访问任何一个计算节点
* 2.如果节点没有打标签,则任何队列都可以访问
* 3.如果队列打了固定标签,则只能访问对应标签的节点
*/
if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels,
node.getLabels())) {
return NULL_ASSIGNMENT;
}
// Check for reserved resources
RMContainer reservedContainer = node.getReservedContainer();
/**
* 为了尽量简单,先不考虑
*/
if (reservedContainer != null) {
FiCaSchedulerApp application =
getApplication(reservedContainer.getApplicationAttemptId());
synchronized (application) {
return assignReservedContainer(application, node, reservedContainer,
clusterResource);
}
}
Resource initAmountNeededUnreserve =
currentResourceLimits.getAmountNeededUnreserve();
// Try to assign containers to applications in order
/**
* activeApplications是处理APP_ATTEMPT_ADDED事件时维护添加的
* 这里有遍历,那么就有顺序问题,先看下activeApplications的比较器:
* static final Comparator applicationComparator =
* new Comparator() {
* @Override
* public int compare(FiCaSchedulerApp a1, FiCaSchedulerApp a2) {
* return a1.getApplicationId().compareTo(a2.getApplicationId());
* }
* };
* 看比较器的实现,比较明显队列内部分配FiCaSchedulerApp是FIFO的原则
*/
for (FiCaSchedulerApp application : activeApplications) {
if(LOG.isDebugEnabled()) {
LOG.debug("pre-assignContainers for application "
+ application.getApplicationId());
application.showRequests();
}
synchronized (application) {
// Check if this resource is on the blacklist
/**
* 检查node是否在黑名单列表中
*/
if (SchedulerAppUtils.isBlacklisted(application, node, LOG)) {
continue;
}
// Schedule in priority order
for (Priority priority : application.getPriorities()) {
/**
* request需求已经在SchedulerTransition中调用了scheduler.allocate方法做了添加更新
*/
ResourceRequest anyRequest =
application.getResourceRequest(priority, ResourceRequest.ANY);
if (null == anyRequest) {
continue;
}
// Required resource
Resource required = anyRequest.getCapability();
// Do we need containers at this 'priority'?
/**
* 判断NumContainers是否大于0
*/
if (application.getTotalRequiredResources(priority) <= 0) {
continue;
}
if (!this.reservationsContinueLooking) {
if (!shouldAllocOrReserveNewContainer(application, priority, required)) {
if (LOG.isDebugEnabled()) {
LOG.debug("doesn't need containers based on reservation algo!");
}
continue;
}
}
Set<String> requestedNodeLabels =
getRequestLabelSetByExpression(anyRequest
.getNodeLabelExpression());
Resource userLimit =
computeUserLimitAndSetHeadroom(application, clusterResource,
required, requestedNodeLabels);
currentResourceLimits.setAmountNeededUnreserve(
initAmountNeededUnreserve);
// Check queue max-capacity limit
if (!super.canAssignToThisQueue(clusterResource, node.getLabels(),
currentResourceLimits, required, application.getCurrentReservation())) {
return NULL_ASSIGNMENT;
}
// Check user limit
if (!assignToUser(clusterResource, application.getUser(), userLimit,
application, requestedNodeLabels, currentResourceLimits)) {
break;
}
// Inform the application it is about to get a scheduling opportunity
application.addSchedulingOpportunity(priority);
// Try to schedule
CSAssignment assignment =
assignContainersOnNode(clusterResource, node, application, priority,
null, currentResourceLimits);
// Did the application skip this node?
if (assignment.getSkipped()) {
// Don't count 'skipped nodes' as a scheduling opportunity!
application.subtractSchedulingOpportunity(priority);
continue;
}
Resource assigned = assignment.getResource();
if (Resources.greaterThan(
resourceCalculator, clusterResource, assigned, Resources.none())) {
allocateResource(clusterResource, application, assigned,
node.getLabels());
if (assignment.getType() != NodeType.OFF_SWITCH) {
if (LOG.isDebugEnabled()) {
LOG.debug("Resetting scheduling opportunities");
}
if (assignment.getType() == NodeType.NODE_LOCAL
|| getRackLocalityFullReset()) {
application.resetSchedulingOpportunities(priority);
}
}
return assignment;
} else {
// Do not assign out of order w.r.t priorities
break;
}
}
}
if(LOG.isDebugEnabled()) {
LOG.debug("post-assignContainers for application "
+ application.getApplicationId());
}
application.showRequests();
}
return NULL_ASSIGNMENT;
}
LeafQueue.assignContainers方法实现了最后的分配,触发一系列事件来启动Container.具体又是由assignContainersOnNode方法实现,这个方法会触发一系列的事件,最后由AMLauncher.launch方法调用了rpc方法startContainers来启动Container.翻看叶子节点的assignContainers的实现,还可以回答开始YY的第二个问题:选中队列后,又按什么顺序分配提交到队列内的应用程序的?看其中的for循环遍历,其顺序依赖于activeApplications集合的排序,activeApplications是一个Set类型,其比较器是:
static final Comparator applicationComparator =
new Comparator() {
@Override
public int compare(FiCaSchedulerApp a1, FiCaSchedulerApp a2) {
return a1.getApplicationId().compareTo(a2.getApplicationId());
}
};
看比较器的实现,比较明显队列内部分配(FiCaSchedulerApp)应用程序是FIFO的原则.
/**
* 会触发一系列事件,最后经由AMLauncher.launch方法调用rpc方法startContainers启动Container
* @param clusterResource
* @param node
* @param application
* @param priority
* @param reservedContainer
* @param currentResoureLimits
* @return
*/
private CSAssignment assignContainersOnNode(Resource clusterResource,
FiCaSchedulerNode node, FiCaSchedulerApp application, Priority priority,
RMContainer reservedContainer, ResourceLimits currentResoureLimits) {
Resource assigned = Resources.none();
NodeType requestType = null;
MutableObject allocatedContainer = new MutableObject();
// Data-local
ResourceRequest nodeLocalResourceRequest =
application.getResourceRequest(priority, node.getNodeName());
if (nodeLocalResourceRequest != null) {
requestType = NodeType.NODE_LOCAL;
assigned =
assignNodeLocalContainers(clusterResource, nodeLocalResourceRequest,
node, application, priority, reservedContainer,
allocatedContainer, currentResoureLimits);
if (Resources.greaterThan(resourceCalculator, clusterResource,
assigned, Resources.none())) {
//update locality statistics
if (allocatedContainer.getValue() != null) {
application.incNumAllocatedContainers(NodeType.NODE_LOCAL,
requestType);
}
return new CSAssignment(assigned, NodeType.NODE_LOCAL);
}
}
// Rack-local
ResourceRequest rackLocalResourceRequest =
application.getResourceRequest(priority, node.getRackName());
if (rackLocalResourceRequest != null) {
if (!rackLocalResourceRequest.getRelaxLocality()) {
return SKIP_ASSIGNMENT;
}
if (requestType != NodeType.NODE_LOCAL) {
requestType = NodeType.RACK_LOCAL;
}
assigned =
assignRackLocalContainers(clusterResource, rackLocalResourceRequest,
node, application, priority, reservedContainer,
allocatedContainer, currentResoureLimits);
if (Resources.greaterThan(resourceCalculator, clusterResource,
assigned, Resources.none())) {
//update locality statistics
if (allocatedContainer.getValue() != null) {
application.incNumAllocatedContainers(NodeType.RACK_LOCAL,
requestType);
}
return new CSAssignment(assigned, NodeType.RACK_LOCAL);
}
}
// Off-switch
/**
* AM的资源需求设置了ResourceName为ResourceRequest.ANY
*/
ResourceRequest offSwitchResourceRequest =
application.getResourceRequest(priority, ResourceRequest.ANY);
if (offSwitchResourceRequest != null) {
if (!offSwitchResourceRequest.getRelaxLocality()) {
return SKIP_ASSIGNMENT;
}
if (requestType != NodeType.NODE_LOCAL
&& requestType != NodeType.RACK_LOCAL) {
requestType = NodeType.OFF_SWITCH;
}
assigned =
assignOffSwitchContainers(clusterResource, offSwitchResourceRequest,
node, application, priority, reservedContainer,
allocatedContainer, currentResoureLimits);
if (allocatedContainer.getValue() != null) {
application.incNumAllocatedContainers(NodeType.OFF_SWITCH, requestType);
}
return new CSAssignment(assigned, NodeType.OFF_SWITCH);
}
return SKIP_ASSIGNMENT;
}
1.完成分配
2.会触发一系列事件,最后经由AMLauncher.launch方法调用rpc方法startContainers启动Container
private Resource assignOffSwitchContainers(Resource clusterResource,
ResourceRequest offSwitchResourceRequest, FiCaSchedulerNode node,
FiCaSchedulerApp application, Priority priority,
RMContainer reservedContainer, MutableObject allocatedContainer,
ResourceLimits currentResoureLimits) {
/**
* 主要是从调度延迟角度考虑是否可分配
*/
if (canAssign(application, priority, node, NodeType.OFF_SWITCH,
reservedContainer)) {
/**
* assignContainer方法用户会生成RMContainer,触发RMContainerEventType.START事件
*/
return assignContainer(clusterResource, node, application, priority,
offSwitchResourceRequest, NodeType.OFF_SWITCH, reservedContainer,
allocatedContainer, currentResoureLimits);
}
return Resources.none();
}
private Resource assignContainer(Resource clusterResource, FiCaSchedulerNode node,
FiCaSchedulerApp application, Priority priority,
ResourceRequest request, NodeType type, RMContainer rmContainer,
MutableObject createdContainer, ResourceLimits currentResoureLimits) {
if (LOG.isDebugEnabled()) {
LOG.debug("assignContainers: node=" + node.getNodeName()
+ " application=" + application.getApplicationId()
+ " priority=" + priority.getPriority()
+ " request=" + request + " type=" + type);
}
if (!SchedulerUtils.checkNodeLabelExpression(
node.getLabels(),
request.getNodeLabelExpression())) {
if (rmContainer != null) {
unreserve(application, priority, node, rmContainer);
}
return Resources.none();
}
Resource capability = request.getCapability();
Resource available = node.getAvailableResource();
Resource totalResource = node.getTotalResource();
if (!Resources.lessThanOrEqual(resourceCalculator, clusterResource,
capability, totalResource)) {
LOG.warn("Node : " + node.getNodeID()
+ " does not have sufficient resource for request : " + request
+ " node total capability : " + node.getTotalResource());
return Resources.none();
}
assert Resources.greaterThan(
resourceCalculator, clusterResource, available, Resources.none());
// Create the container if necessary
/**
* 基于node信息创建一个Container
* Container主要有以下几个成员变量:
* 1.nodeId:数据节点id
* 2.containerId:根据appAttemptId和containerId生成
* 3.priority:优先级
* 4.resource:资源需求
* 5.httpAddress:与数据节点通信地址
* 6.containerToken:Token
*/
Container container =
getContainer(rmContainer, application, node, capability, priority);
// something went wrong getting/creating the container
if (container == null) {
LOG.warn("Couldn't get container for allocation!");
return Resources.none();
}
boolean shouldAllocOrReserveNewContainer = shouldAllocOrReserveNewContainer(
application, priority, capability);
// Can we allocate a container on this node?
/**
* 数据节点的可用资源与申请的资源大小相比,是否足够
*/
int availableContainers =
resourceCalculator.computeAvailableContainers(available, capability);
boolean needToUnreserve = Resources.greaterThan(resourceCalculator,clusterResource,
currentResoureLimits.getAmountNeededUnreserve(), Resources.none());
if (availableContainers > 0) {
// Allocate...
// Did we previously reserve containers at this 'priority'?
/**
* rmContainer是一个传参,值为null,是一个RMContainer类型的reserveContainer
*/
if (rmContainer != null) {
unreserve(application, priority, node, rmContainer);
} else if (this.reservationsContinueLooking && node.getLabels().isEmpty()) {
if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
Resource amountToUnreserve = capability;
if (needToUnreserve) {
amountToUnreserve = currentResoureLimits.getAmountNeededUnreserve();
}
boolean containerUnreserved =
findNodeToUnreserve(clusterResource, node, application, priority,
amountToUnreserve);
if (!containerUnreserved) {
return Resources.none();
}
}
}
// Inform the application
/**
* 1.创建RMContainer
* 2.将创建的RMContainer加入newlyAllocatedContainers(后续的NODE_UPDATE事件处理时会把该列表中已经分配的Container进行启动)
* 3.将创建的RMContainer加入liveContainers(liveContainers干啥用)
* 4.记录已分配的resourceRequests到对应的RMContainer中,以便后面恢复
* 5.触发RMContainerEventType.START
*/
RMContainer allocatedContainer =
application.allocate(type, node, priority, request, container);
// Does the application need this resource?
if (allocatedContainer == null) {
return Resources.none();
}
/**
* 1.node上(launchedContainers)记录已经分配的Container
* 2.给node上的可用资源做减数,已用资源做加数
*/
node.allocateContainer(allocatedContainer);
String label = RMNodeLabelsManager.NO_LABEL;
if (node.getLabels() != null && !node.getLabels().isEmpty()) {
label = node.getLabels().iterator().next();
}
LOG.info("assignedContainer" +
" application attempt=" + application.getApplicationAttemptId() +
" container=" + container +
" queue=" + this +
" clusterResource=" + clusterResource +
" type=" + type +
" requestedPartition=" + label);
createdContainer.setValue(allocatedContainer);
return container.getResource();
} else {
if (shouldAllocOrReserveNewContainer || rmContainer != null) {
if (reservationsContinueLooking && rmContainer == null) {
if (needToUnreserve) {
if (LOG.isDebugEnabled()) {
LOG.debug("we needed to unreserve to be able to allocate");
}
return Resources.none();
}
}
// Reserve by 'charging' in advance...
reserve(application, priority, node, rmContainer, container);
LOG.info("Reserved container " +
" application=" + application.getApplicationId() +
" resource=" + request.getCapability() +
" queue=" + this.toString() +
" usedCapacity=" + getUsedCapacity() +
" absoluteUsedCapacity=" + getAbsoluteUsedCapacity() +
" used=" + queueUsage.getUsed() +
" cluster=" + clusterResource);
return request.getCapability();
}
return Resources.none();
}
}
/**
* 1.创建RMContainer
* 2.将创建的RMContainer加入newlyAllocatedContainers(后续的NODE_UPDATE事件处理时会把该列表中已经分配的Container进行启动)
* 3.将创建的RMContainer加入liveContainers(liveContainers干啥用)
* 4.记录已分配的resourceRequests到对应的RMContainer,以便后面恢复
* 5.触发RMContainerEventType.START
* @param type
* @param node
* @param priority
* @param request
* @param container
* @return
*/
synchronized public RMContainer allocate(NodeType type, FiCaSchedulerNode node,
Priority priority, ResourceRequest request,
Container container) {
if (isStopped) {
return null;
}
if (getTotalRequiredResources(priority) <= 0) {
return null;
}
// Create RMContainer
/**
* 创建RMContainer
*/
RMContainer rmContainer = new RMContainerImpl(container, this
.getApplicationAttemptId(), node.getNodeID(),
appSchedulingInfo.getUser(), this.rmContext);
// Add it to allContainers list.
newlyAllocatedContainers.add(rmContainer);
liveContainers.put(container.getId(), rmContainer);
// Update consumption and track allocations
/**
* 1.这里已经认为分配成功,将相关资源需求的NumContainer做减数
* 2.记录做了减数的request到resourceRequests,以便后面恢复
*/
List<ResourceRequest> resourceRequestList = appSchedulingInfo.allocate(
type, node, priority, request, container);
Resources.addTo(currentConsumption, container.getResource());
/**
* 将appSchedulingInfo.allocate返回的resourceRequests记录下来,以便后面恢复
*/
((RMContainerImpl)rmContainer).setResourceRequests(resourceRequestList);
// Inform the container
/**
* 触发RMContainerEventType.START
*/
rmContainer.handle(
new RMContainerEvent(container.getId(), RMContainerEventType.START));
if (LOG.isDebugEnabled()) {
LOG.debug("allocate: applicationAttemptId="
+ container.getId().getApplicationAttemptId()
+ " container=" + container.getId() + " host="
+ container.getNodeId().getHost() + " type=" + type);
}
RMAuditLogger.logSuccess(getUser(),
AuditConstants.ALLOC_CONTAINER, "SchedulerApp",
getApplicationId(), container.getId());
return rmContainer;
}