Flink TaskExecutor中Slot的计算资源管理

Task Slot的基本概念

        在Flink启动模式中,以集群模式开启而非MiniCluster模式开启的话,每个TaskManager都是独立运行的一个单独的JVM进程,并且在一个TaskManager中可能运行多个子任务,这些子任务都在各自独立的线程中运行。为了描述控制一个TaskManager中可以运行的任务的数量,引入了Task Slot的概念。

        每一个Task Slot代表了TaskManager所拥有的计算资源的一个固定的子集。其可以共享对应的TaskManager的内存资源; 例如:一个拥有3个slot的TaskManager,每个slot可以使用1⁄3的内存。这样,运行在不同slot中的子任务不会竞争内存资源。目前Flink还不支持CPU资源的隔离,只支持内存资源的隔离。通过调整slot的数量,还可以控制子任务的隔离程度。例如:如果每个TaskManager只有1个slot,那么每个子任务都运行在单独的JVM进程中;每个TaskManager有多个slot的话,就意味着可以有更多的子任务运行在同一个JVM进程中。而在同一个JVM进程中的子任务,可以共享上下文信息、TCP连接和心跳消息,减少数据的网络传输,也能共享一些数据结构,一定程度上减少了每个子任务的消耗。

        默认情况下,Flink允许子任务共享slot,前提是:它们属于同一个Job并且不是同一个operator的子任务。这样的结果是,在同一个slot中可能会运行Job的一个完整的pipeline。允许Slot共享有两个主要的好处:

  1. Flink计算一个Job所需的slot数量时,只需要确定其最大并行度即可,而不用考虑每一个任务的并行度;
  2. 能更好的利用资源。如果没有slot共享,那些资源需求不大的子任务和资源需求大的子任务会占用相同的资源,但如允许slot共享,它们就可能被分配到同一个slot中;

Flink通过SlotSharingGroup和CoLocationGroup来确定在调度任务的时候如何进行资源共享,它们俩分别对应两种约束条件:

  1. SlotSharingGroup:相同SlotSharingGroup的不同JobVertex的子任务可以被分配在同一个slot中,但不保证能做到;
  2. CoLocationGroup:相同SlotSharingGroup的不同JobVertex,它们的第n个子任务必须保证都在同一个slot中,这是一种强制性的约束;

       从整体上来看,在资源管理中涉及了JobMaster、ResourceManager、TaskManager三种组件。JobMaster是slot资源的使用者,其向ResourceManager申请资源,ResourceManager负责slot资源的分配、资源空闲回收释放以及slot资源不足的时候向Yarn、k8s等运行外部资源框架申请资源。TaskManager是slot资源的持有者,在其slotTable中记录了slot分配给了哪个作业的哪个task。

Flink TaskExecutor中Slot的计算资源管理_第1张图片

TaskExecutor中Slot的管理

TaskSlot

       TaskSlot是TaskExecutor中为了便于对资源进行管理所进行的对slot的抽象,其主要包含当前task slot所在TaskExecutor中的索引位置index,并且包含该task Slot所拥有的资源描述文件resourceProfile,以及其所可能申请分配到的task任务及job;并且每个TaskSlot都有自己的slot state状态信息;该slot状态可能处于Active(已被JobManager使用)、Allocated(已经被分配给具体job,但job manager还未认领)、Releasing(其slot内部任务已经执行完成,等待task资源释放)、Free(空闲)这四种状态之中。其内部的主要属性如下:

public class TaskSlot {

	/** Index of the task slot. */
	private final int index;

	/** Resource characteristics for this slot. */
	private final ResourceProfile resourceProfile; // 资源描述文件 (cpu、memory、network memory等)

	/** Tasks running in this slot. */
	private final Map tasks; // 

	/** State of this slot. */
	private TaskSlotState state;

	/** Job id to which the slot has been allocated; null if not allocated. */
	private JobID jobId;

	/** Allocation id of this slot; null if not allocated. */
	private AllocationID allocationId;
	
	......
}

       TaskSlot提供了修改状态的方法,如allocate(JobID newJobId, AllocationID newAllocationId)方法会将slot状态标记为Allocated状态;markFree()会将slot标记为Free状态,但只有在所有Task都被移除之后才能释放成功。slot在切换状态的时候会先判断它当前所处的状态。另外可以通过add(Task task)向slot中添加Task,需要保证这些Task 都来自同一个 Job。

// 向task slot添加任务
public boolean add(Task task) {
   // Check that this slot has been assigned to the job sending this task
   Preconditions.checkArgument(task.getJobID().equals(jobId), "The task's job id does not match the " +
      "job id for which the slot has been allocated.");
   Preconditions.checkArgument(task.getAllocationId().equals(allocationId), "The task's allocation " +
      "id does not match the allocation id for which the slot has been allocated.");
   Preconditions.checkState(TaskSlotState.ACTIVE == state, "The task slot is not in state active.");

   Task oldTask = tasks.put(task.getExecutionId(), task);

   if (oldTask != null) {
      tasks.put(task.getExecutionId(), oldTask);
      return false;
   } else {
      return true;
   }
}

// 向task slot申请slot资源
public boolean allocate(JobID newJobId, AllocationID newAllocationId) {
   if (TaskSlotState.FREE == state) {
      // sanity checks
      Preconditions.checkState(allocationId == null);
      Preconditions.checkState(jobId == null);

      this.jobId = Preconditions.checkNotNull(newJobId);
      this.allocationId = Preconditions.checkNotNull(newAllocationId);

      state = TaskSlotState.ALLOCATED;

      return true;
   } else if (TaskSlotState.ALLOCATED == state || TaskSlotState.ACTIVE == state) {
      Preconditions.checkNotNull(newJobId);
      Preconditions.checkNotNull(newAllocationId);

      return newJobId.equals(jobId) && newAllocationId.equals(allocationId);
   } else {
      return false;
   }
}

// 尝试标记slot资源为free状态
public boolean markFree() {
   if (isEmpty()) {
      state = TaskSlotState.FREE;
      this.jobId = null;
      this.allocationId = null;

      return true;
   } else {
      return false;
   }
}

TaskSlotTable

        在TaskExecutor中主要通过TaskSlotTable来管理它所拥有的所有slot资源(TaskSlot),其申请、释放资源等都是操作对应的taskSlot状态state以及其分配的job;

public class TaskSlotTable implements TimeoutListener {
	
	/** Timer service used to time out allocated slots. */
	private final TimerService timerService; // 对已分配但却未被jobmaster领取的slot进行 timeout超时检测

	/** The list of all task slots. */
	private final List taskSlots; // 所有taskslots的列表

	/** Mapping from allocation id to task slot. */
	private final Map allocationIDTaskSlotMap; //

	/** Mapping from execution attempt id to task and task slot. */
	private final Map taskSlotMappings; //

	/** Mapping from job id to allocated slots for a job. */
	private final Map> slotsPerJob; // 

	/** Interface for slot actions, such as freeing them or timing them out. */
	private SlotActions slotActions;

	/** Whether the table has been started. */
	private boolean started;
}

       其主要是通过allocateSlot(int index, JobID jobId, AllocationID allocationId, Time slotTimeout)方法可以将指定index索引的slot分配给指定AllocationID对应的请求,这个方法内部会委托调用对应taskSlot内部的allocate(JobID newJobId, AllocationID newAllocationId)方法进行slot资源申请分配。

       其中在TaskSlotTable#allocateSlot()方法最后会带有一个超时时间参数slotTimeout。其用于对已分配的slot进行超时检测;其主要超时检测的工作方式为:在TaskSlotTable中有一个成员变量是TimerService timerService,通过timeService可以注册定时器,如果定时器在超时时间到达之前没有被取消,那么其会通过timeoutListener.notifyTimeout(key, ticket)方法;通过timeoutListener注册指向自己的监听调用TaskSlotTable#SlotAction.timeout方法。如果被分配的slot关联的slot在超时之前没有被取消,那么该slot就会被重新释放,标记为Free状态。

public boolean allocateSlot(int index, JobID jobId, AllocationID allocationId, Time slotTimeout) {
   checkInit();
   TaskSlot taskSlot = taskSlots.get(index); // 获取到对应的TaskSlot资源
   boolean result = taskSlot.allocate(jobId, allocationId); // 申请slot

   if (result) { // 申请成功
      // update the allocation id to task slot map
      allocationIDTaskSlotMap.put(allocationId, taskSlot);
      // register a timeout for this slot since it's in state allocated
      timerService.registerTimeout(allocationId, slotTimeout.getSize(), slotTimeout.getUnit()); // 注册超时器
      // add this slot to the set of job slots
      Set slots = slotsPerJob.get(jobId);

      if (slots == null) {
         slots = new HashSet<>(4);
         slotsPerJob.put(jobId, slots);
      }
      slots.add(allocationId);
   }
   return result;
}

如果将对应的slot标记为Active,则在标记slot为Active的时候会取消在slot申请时关联分配的超时定时器:

public boolean markSlotActive(AllocationID allocationId) throws SlotNotFoundException {
   checkInit();
   TaskSlot taskSlot = getTaskSlot(allocationId);

   if (taskSlot != null) {
      if (taskSlot.markActive()) {
         // unregister a potential timeout
         LOG.info("Activate slot {}.", allocationId);
         timerService.unregisterTimeout(allocationId); // 取消超时定时器
         return true;
      } else {
         return false;
      }
   } else {
      throw new SlotNotFoundException(allocationId);
   }
}

        通过createSlotReport可以获得一个SlotReport对象,SlotReport中包含了当前TaskExecutor中所有slot的状态以及它们的分配情况。在createSlotReport方法中,其主要构建对应的SlotID和SlotStatus状态;其中SlotID是用来唯一性(在整个周期内不会改变)标识每个TaskManager上的唯一slot对象;其是一个slot的唯一标识,主要包含两个属性,其中ResourceID表明该slot所在的TaskExecutor,slotNumber是该slot在TaskExecutor中的索引位置。

// ---------------------------------------------------------------------
// Slot report methods   (构建SlotReport对象,其包含当前TaskExecutor中所有slot的状态以及它们的分配情况)
// ---------------------------------------------------------------------
public SlotReport createSlotReport(ResourceID resourceId) {
   final int numberSlots = taskSlots.size(); // The list of all task slots

   List slotStatuses = Arrays.asList(new SlotStatus[numberSlots]);

   for (int i = 0; i < numberSlots; i++) {
      TaskSlot taskSlot = taskSlots.get(i);
      SlotID slotId = new SlotID(resourceId, taskSlot.getIndex()); // 唯一性的SlotID,其标识了 一个TaskManager上唯一的slot(ResourceID, index)

      SlotStatus slotStatus = new SlotStatus(
         slotId,
         taskSlot.getResourceProfile(),
         taskSlot.getJobId(),            // 该slot分配的jobid or null
         taskSlot.getAllocationId());    // 该slot分配的allocationID or null
      slotStatuses.set(i, slotStatus);
   }

   final SlotReport slotReport = new SlotReport(slotStatuses);
   return slotReport;
}

TaskExecutor

TaskExecutor的基本工作及其对应的组件、接口:

  1. 其主要向ResourceManager上报所有slot资源的状态(ResourceManagerLeaderListener),
  2. 向对应的JobManager提供其对应jobId的slot的状态(JobLeaderListenerImpl),
  3. 管理当前TaskExecutor上的所有slot(TaskSlotTable),
  4. 心跳管理(汇报slot)JobManagerHeartbeatListenerResourceManagerHeartbeatListener
  5. 任务的基本管理、checkpoint管理等(TaskExecutorGateway)

       TaskExecutor需要向ResourceManager报告所有slot的状态,这样ResourceManager就知道了所有slot的分配情况。其主要通过1、在获取ResourceManager的连接时,通过注册对应的ResourceManagerLeaderListener监听函数异步回调NotifyOfLeaderCall通知;便会异步的与对应的RM建立连接reconnectToResourceManager(TaskExecutorToResourceManagerConnection);在成功建立连接到RM之后,便会通过RPC异步的调用resourceManager.registerTaskExecutor();向RM注册自己以及建立相关的心跳连接;在注册成功之后,其注册成功信息会通过ResourceManagerRegistrationListener监听函数异步回调;向RM上报自己的slot资源。其资源上报情况主要如下:

  1. TaskExecutor首次和ResourceManager建立连接的时候,需要发送SlotReport
  2. TaskExecutor和ResourceManager定期发送心跳信息,心跳包中包含SlotReport
class TaskExecutor {
    private void establishResourceManagerConnection(
      ResourceManagerGateway resourceManagerGateway,
      ResourceID resourceManagerResourceId,
      InstanceID taskExecutorRegistrationId,
      ClusterInformation clusterInformation) {
      // 首次建立连接,向RM报告slot信息
      final CompletableFuture slotReportResponseFuture = resourceManagerGateway.sendSlotReport(
        getResourceID(),    // taskManagerLocation.getResourceID()
       taskExecutorRegistrationId,
       taskSlotTable.createSlotReport(getResourceID()),
       taskManagerConfiguration.getTimeout());
      
       // .........
       // monitor the resource manager as heartbeat target
      resourceManagerHeartbeatManager.monitorTarget(resourceManagerResourceId, new HeartbeatTarget() {
         @Override
         public void receiveHeartbeat(ResourceID resourceID, SlotReport slotReport) {
            resourceManagerGateway.heartbeatFromTaskManager(resourceID, slotReport);
         }
    
         @Override
         public void requestHeartbeat(ResourceID resourceID, SlotReport slotReport) {
            // the TaskManager won't send heartbeat requests to the ResourceManager
         }
      });
      // ........
   }
   
   private class ResourceManagerHeartbeatListener implements HeartbeatListener {
	  @Override
	  public void notifyHeartbeatTimeout(final ResourceID resourceId) { // 超时处理、尝试重新连接
		validateRunsInMainThread();
		// first check whether the timeout is still valid
		if (establishedResourceManagerConnection != null && establishedResourceManagerConnection.getResourceManagerResourceId().equals(resourceId)) {
			log.info("The heartbeat of ResourceManager with id {} timed out.", resourceId);
			reconnectToResourceManager(new TaskManagerException(
				String.format("The heartbeat of ResourceManager with id %s timed out.", resourceId)));
		} else {
			log.debug("Received heartbeat timeout for outdated ResourceManager id {}. Ignoring the timeout.", resourceId);
		}
	  }

	  @Override
	  public CompletableFuture retrievePayload(ResourceID resourceID) { // 心跳信息中 携带上报该TaskExecutor对应的所有slot资源
		validateRunsInMainThread();
		return CompletableFuture.completedFuture(taskSlotTable.createSlotReport(getResourceID()));  // taskManagerLocation.getResourceID()
	  }
   }
}

ResourceManager通过TaskExecutor.requestSlot()方法向对应的TaskExecutor发起rpc请求,要求其分配slot资源,由于ResourceManager知道所有slot的当前分配及空闲资源状况,因此分配请求会精确到具体的SlotID:

#TaskExecutor.requestSlot()
public CompletableFuture requestSlot(
   final SlotID slotId,
   final JobID jobId,
   final AllocationID allocationId,
   final String targetAddress,
   final ResourceManagerId resourceManagerId,
   final Time timeout) {
   // TODO: Filter invalid requests from the resource manager by using the instance/registration Id
   log.info("Receive slot request {} for job {} from resource manager with leader id {}.",
      allocationId, jobId, resourceManagerId);

   try {
      // 判断发送请求的RM是否是当前TaskExecutor注册的
      if (!isConnectedToResourceManager(resourceManagerId)) {
         final String message = String.format("TaskManager is not connected to the resource manager %s.", resourceManagerId);
         log.debug(message);
         throw new TaskManagerException(message);
      }

      // 如果当前slot是Free状态,则分配slot
      if (taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
         if (taskSlotTable.allocateSlot(slotId.getSlotNumber(), jobId, allocationId, taskManagerConfiguration.getTimeout())) {
            log.info("Allocated slot for {}.", allocationId);
         } else {
            log.info("Could not allocate slot for {}.", allocationId);
            throw new SlotAllocationException("Could not allocate slot.");
         }
      } else if (!taskSlotTable.isAllocated(slotId.getSlotNumber(), jobId, allocationId)) {
         // 如果slot已经被分配了,则抛出异常
         final String message = "The slot " + slotId + " has already been allocated for a different job.";
         log.info(message);
         final AllocationID allocationID = taskSlotTable.getCurrentAllocation(slotId.getSlotNumber());
         throw new SlotOccupiedException(message, allocationID, taskSlotTable.getOwningJob(allocationID));
      }

      // 将分配的slot提供给发送请求的JobManager
      if (jobManagerTable.contains(jobId)) { // 保存对应的jobId与JobManagerConnection
         offerSlotsToJobManager(jobId);      // 如果和对应的JobManager已经建立了连接,则向JobManager提供slot
      } else {
         // 否则,先和JobManager建立连接,连接建立后会调用offerSlotsToJobManager(jobId)方法
         try {
            jobLeaderService.addJob(jobId, targetAddress);
         } catch (Exception e) {
            // free the allocated slot
            try {
               taskSlotTable.freeSlot(allocationId);
            } catch (SlotNotFoundException slotNotFoundException) {
               // slot no longer existent, this should actually never happen, because we've
               // just allocated the slot. So let's fail hard in this case!
               onFatalError(slotNotFoundException);
            }

            // release local state under the allocation id.
            localStateStoresManager.releaseLocalStateForAllocationId(allocationId);

            // sanity check
            if (!taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
               onFatalError(new Exception("Could not free slot " + slotId));
            }
            throw new SlotAllocationException("Could not add job to job leader service.", e);
         }
      }
   } catch (TaskManagerException taskManagerException) {
      return FutureUtils.completedExceptionally(taskManagerException);
   }
   return CompletableFuture.completedFuture(Acknowledge.get());
}

在Slot被分配给之后,TaskExecutor需要将对应的slot提供给JobManager,而这主要是通过和JobManager进行slot汇报offerSlotsToJobManager(jobId)方法来实现的:

// ------------------------------------------------------------------------
//  Internal job manager connection methods 内部向作业管理器job manager提供的slot汇报连接方法
// ------------------------------------------------------------------------
private void offerSlotsToJobManager(final JobID jobId) {
   final JobManagerConnection jobManagerConnection = jobManagerTable.get(jobId); // 获取对应的jobManagerConnection连接信息
   if (jobManagerConnection == null) {
      log.debug("There is no job manager connection to the leader of job {}.", jobId);
   } else {
      if (taskSlotTable.hasAllocatedSlots(jobId)) {
         log.info("Offer reserved slots to the leader of job {}.", jobId);
         final JobMasterGateway jobMasterGateway = jobManagerConnection.getJobManagerGateway(); // 获取JobManager的RPC接口代理
         // 获取分配给当前Job的slot, 这里只会取得状态为allocated的slot
         final Iterator reservedSlotsIterator = taskSlotTable.getAllocatedSlots(jobId);
         final JobMasterId jobMasterId = jobManagerConnection.getJobMasterId();
         final Collection reservedSlots = new HashSet<>(2);
         while (reservedSlotsIterator.hasNext()) {
            SlotOffer offer = reservedSlotsIterator.next().generateSlotOffer();
            reservedSlots.add(offer);
         }
         // 通过RPC调用,将slot提供给JobMaster
         CompletableFuture> acceptedSlotsFuture = jobMasterGateway.offerSlots(
            getResourceID(),
            reservedSlots,
            taskManagerConfiguration.getTimeout());
         // 对当前分配的slotOffer进行结束后的额外资源处理
         acceptedSlotsFuture.whenCompleteAsync(
            handleAcceptedSlotOffers(jobId, jobMasterGateway, jobMasterId, reservedSlots),
            getMainThreadExecutor());
      } else {
         log.debug("There are no unassigned slots for the job {}.", jobId);
      }
   }
}

// slotOffer结束后的资源处理
@Nonnull
private BiConsumer, Throwable> handleAcceptedSlotOffers(JobID jobId, JobMasterGateway jobMasterGateway, JobMasterId jobMasterId, Collection offeredSlots) {
   return (Iterable acceptedSlots, Throwable throwable) -> {
      if (throwable != null) { // 发生异常
         if (throwable instanceof TimeoutException) {
            log.info("Slot offering to JobManager did not finish in time. Retrying the slot offering.");
            // We ran into a timeout. Try again.
            offerSlotsToJobManager(jobId); // 超时重试
         } else {
            log.warn("Slot offering to JobManager failed. Freeing the slots " +
               "and returning them to the ResourceManager.", throwable);

            // We encountered an exception. Free the slots and return them to the RM.
            // 发生异常,则释放所有的slot,并将其归还给对应的RM
            for (SlotOffer reservedSlot: offeredSlots) {
               freeSlotInternal(reservedSlot.getAllocationId(), throwable);
            }
         }
      } else { // 正常调用成功
         // check if the response is still valid
         if (isJobManagerConnectionValid(jobId, jobMasterId)) {
            // mark accepted slots active
            // 对于被JobMaster确认接受的slot,标记为Active状态
            for (SlotOffer acceptedSlot : acceptedSlots) {
               try {
                  if (!taskSlotTable.markSlotActive(acceptedSlot.getAllocationId())) {
                     // the slot is either free or releasing at the moment
                     final String message = "Could not mark slot " + jobId + " active.";
                     log.debug(message);
                     jobMasterGateway.failSlot(
                        getResourceID(),
                        acceptedSlot.getAllocationId(),
                        new FlinkException(message));
                  }
               } catch (SlotNotFoundException e) {
                  final String message = "Could not mark slot " + jobId + " active.";
                  jobMasterGateway.failSlot(
                     getResourceID(),
                     acceptedSlot.getAllocationId(),
                     new FlinkException(message));
               }
               offeredSlots.remove(acceptedSlot);
            }

            // 释放剩余没有被JobManager接受的slot
            final Exception e = new Exception("The slot was rejected by the JobManager.");
            for (SlotOffer rejectedSlot : offeredSlots) {
               freeSlotInternal(rejectedSlot.getAllocationId(), e);
            }
         } else {
            // discard the response since there is a new leader for the job
            log.debug("Discard offer slot response since there is a new leader " +
               "for the job {}.", jobId);
         }
      }
   };
}

通过freeSlot(AllocationID, Throwable)方法,可以向TaskExecutor请求释放和AllocationID相关联的slot资源:

#TaskExecutor#freeSlot
public CompletableFuture freeSlot(AllocationID allocationId, Throwable cause, Time timeout) {
   freeSlotInternal(allocationId, cause); // 具体的释放操作
   return CompletableFuture.completedFuture(Acknowledge.get());
}

private void freeSlotInternal(AllocationID allocationId, Throwable cause) {
   checkNotNull(allocationId);
   log.debug("Free slot with allocation id {} because: {}", allocationId, cause.getMessage());

   try {
      final JobID jobId = taskSlotTable.getOwningJob(allocationId);
      final int slotIndex = taskSlotTable.freeSlot(allocationId, cause); // 尝试释放allocationId绑定的slot

      if (slotIndex != -1) { // 成功释放slot
         if (isConnectedToResourceManager()) {
            // the slot was freed. Tell the RM about it
            ResourceManagerGateway resourceManagerGateway = establishedResourceManagerConnection.getResourceManagerGateway();
            // 告知ResourceManager当前slot可用
            resourceManagerGateway.notifySlotAvailable(
               establishedResourceManagerConnection.getTaskExecutorRegistrationId(),
               new SlotID(getResourceID(), slotIndex),
               allocationId);
         }

         if (jobId != null) {
            // check whether we still have allocated slots for the same job
            // 如果和allocationID绑定的Job已经没有分配的slot了,那么可以断开和JobMaster的连接了
            if (taskSlotTable.getAllocationIdsPerJob(jobId).isEmpty()) {
               // we can remove the job from the job leader service
               try {
                  jobLeaderService.removeJob(jobId);
               } catch (Exception e) {
                  log.info("Could not remove job {} from JobLeaderService.", jobId, e);
               }

               closeJobManagerConnection(
                  jobId,
                  new FlinkException("TaskExecutor " + getAddress() +
                     " has no more allocated slots for job " + jobId + '.'));
            }
         }
      }
   } catch (SlotNotFoundException e) {
      log.debug("Could not free slot for allocation id {}.", allocationId, e);
   }
   localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
}

 

你可能感兴趣的:(Flink学习,flink,slot,TaskExecutor,slotTable)