在Flink启动模式中,以集群模式开启而非MiniCluster模式开启的话,每个TaskManager都是独立运行的一个单独的JVM进程,并且在一个TaskManager中可能运行多个子任务,这些子任务都在各自独立的线程中运行。为了描述控制一个TaskManager中可以运行的任务的数量,引入了Task Slot的概念。
每一个Task Slot代表了TaskManager所拥有的计算资源的一个固定的子集。其可以共享对应的TaskManager的内存资源; 例如:一个拥有3个slot的TaskManager,每个slot可以使用1⁄3的内存。这样,运行在不同slot中的子任务不会竞争内存资源。目前Flink还不支持CPU资源的隔离,只支持内存资源的隔离。通过调整slot的数量,还可以控制子任务的隔离程度。例如:如果每个TaskManager只有1个slot,那么每个子任务都运行在单独的JVM进程中;每个TaskManager有多个slot的话,就意味着可以有更多的子任务运行在同一个JVM进程中。而在同一个JVM进程中的子任务,可以共享上下文信息、TCP连接和心跳消息,减少数据的网络传输,也能共享一些数据结构,一定程度上减少了每个子任务的消耗。
默认情况下,Flink允许子任务共享slot,前提是:它们属于同一个Job并且不是同一个operator的子任务。这样的结果是,在同一个slot中可能会运行Job的一个完整的pipeline。允许Slot共享有两个主要的好处:
Flink通过SlotSharingGroup和CoLocationGroup来确定在调度任务的时候如何进行资源共享,它们俩分别对应两种约束条件:
从整体上来看,在资源管理中涉及了JobMaster、ResourceManager、TaskManager三种组件。JobMaster是slot资源的使用者,其向ResourceManager申请资源,ResourceManager负责slot资源的分配、资源空闲回收释放以及slot资源不足的时候向Yarn、k8s等运行外部资源框架申请资源。TaskManager是slot资源的持有者,在其slotTable中记录了slot分配给了哪个作业的哪个task。
TaskSlot是TaskExecutor中为了便于对资源进行管理所进行的对slot的抽象,其主要包含当前task slot所在TaskExecutor中的索引位置index,并且包含该task Slot所拥有的资源描述文件resourceProfile,以及其所可能申请分配到的task任务及job;并且每个TaskSlot都有自己的slot state状态信息;该slot状态可能处于Active(已被JobManager使用)、Allocated(已经被分配给具体job,但job manager还未认领)、Releasing(其slot内部任务已经执行完成,等待task资源释放)、Free(空闲)这四种状态之中。其内部的主要属性如下:
public class TaskSlot {
/** Index of the task slot. */
private final int index;
/** Resource characteristics for this slot. */
private final ResourceProfile resourceProfile; // 资源描述文件 (cpu、memory、network memory等)
/** Tasks running in this slot. */
private final Map tasks; //
/** State of this slot. */
private TaskSlotState state;
/** Job id to which the slot has been allocated; null if not allocated. */
private JobID jobId;
/** Allocation id of this slot; null if not allocated. */
private AllocationID allocationId;
......
}
TaskSlot提供了修改状态的方法,如allocate(JobID newJobId, AllocationID newAllocationId)方法会将slot状态标记为Allocated状态;markFree()会将slot标记为Free状态,但只有在所有Task都被移除之后才能释放成功。slot在切换状态的时候会先判断它当前所处的状态。另外可以通过add(Task task)向slot中添加Task,需要保证这些Task 都来自同一个 Job。
// 向task slot添加任务
public boolean add(Task task) {
// Check that this slot has been assigned to the job sending this task
Preconditions.checkArgument(task.getJobID().equals(jobId), "The task's job id does not match the " +
"job id for which the slot has been allocated.");
Preconditions.checkArgument(task.getAllocationId().equals(allocationId), "The task's allocation " +
"id does not match the allocation id for which the slot has been allocated.");
Preconditions.checkState(TaskSlotState.ACTIVE == state, "The task slot is not in state active.");
Task oldTask = tasks.put(task.getExecutionId(), task);
if (oldTask != null) {
tasks.put(task.getExecutionId(), oldTask);
return false;
} else {
return true;
}
}
// 向task slot申请slot资源
public boolean allocate(JobID newJobId, AllocationID newAllocationId) {
if (TaskSlotState.FREE == state) {
// sanity checks
Preconditions.checkState(allocationId == null);
Preconditions.checkState(jobId == null);
this.jobId = Preconditions.checkNotNull(newJobId);
this.allocationId = Preconditions.checkNotNull(newAllocationId);
state = TaskSlotState.ALLOCATED;
return true;
} else if (TaskSlotState.ALLOCATED == state || TaskSlotState.ACTIVE == state) {
Preconditions.checkNotNull(newJobId);
Preconditions.checkNotNull(newAllocationId);
return newJobId.equals(jobId) && newAllocationId.equals(allocationId);
} else {
return false;
}
}
// 尝试标记slot资源为free状态
public boolean markFree() {
if (isEmpty()) {
state = TaskSlotState.FREE;
this.jobId = null;
this.allocationId = null;
return true;
} else {
return false;
}
}
在TaskExecutor中主要通过TaskSlotTable来管理它所拥有的所有slot资源(TaskSlot),其申请、释放资源等都是操作对应的taskSlot状态state以及其分配的job;
public class TaskSlotTable implements TimeoutListener {
/** Timer service used to time out allocated slots. */
private final TimerService timerService; // 对已分配但却未被jobmaster领取的slot进行 timeout超时检测
/** The list of all task slots. */
private final List taskSlots; // 所有taskslots的列表
/** Mapping from allocation id to task slot. */
private final Map allocationIDTaskSlotMap; //
/** Mapping from execution attempt id to task and task slot. */
private final Map taskSlotMappings; //
/** Mapping from job id to allocated slots for a job. */
private final Map> slotsPerJob; //
/** Interface for slot actions, such as freeing them or timing them out. */
private SlotActions slotActions;
/** Whether the table has been started. */
private boolean started;
}
其主要是通过allocateSlot(int index, JobID jobId, AllocationID allocationId, Time slotTimeout)方法可以将指定index索引的slot分配给指定AllocationID对应的请求,这个方法内部会委托调用对应taskSlot内部的allocate(JobID newJobId, AllocationID newAllocationId)方法进行slot资源申请分配。
其中在TaskSlotTable#allocateSlot()方法最后会带有一个超时时间参数slotTimeout。其用于对已分配的slot进行超时检测;其主要超时检测的工作方式为:在TaskSlotTable中有一个成员变量是TimerService
public boolean allocateSlot(int index, JobID jobId, AllocationID allocationId, Time slotTimeout) {
checkInit();
TaskSlot taskSlot = taskSlots.get(index); // 获取到对应的TaskSlot资源
boolean result = taskSlot.allocate(jobId, allocationId); // 申请slot
if (result) { // 申请成功
// update the allocation id to task slot map
allocationIDTaskSlotMap.put(allocationId, taskSlot);
// register a timeout for this slot since it's in state allocated
timerService.registerTimeout(allocationId, slotTimeout.getSize(), slotTimeout.getUnit()); // 注册超时器
// add this slot to the set of job slots
Set slots = slotsPerJob.get(jobId);
if (slots == null) {
slots = new HashSet<>(4);
slotsPerJob.put(jobId, slots);
}
slots.add(allocationId);
}
return result;
}
如果将对应的slot标记为Active,则在标记slot为Active的时候会取消在slot申请时关联分配的超时定时器:
public boolean markSlotActive(AllocationID allocationId) throws SlotNotFoundException {
checkInit();
TaskSlot taskSlot = getTaskSlot(allocationId);
if (taskSlot != null) {
if (taskSlot.markActive()) {
// unregister a potential timeout
LOG.info("Activate slot {}.", allocationId);
timerService.unregisterTimeout(allocationId); // 取消超时定时器
return true;
} else {
return false;
}
} else {
throw new SlotNotFoundException(allocationId);
}
}
通过createSlotReport可以获得一个SlotReport对象,SlotReport中包含了当前TaskExecutor中所有slot的状态以及它们的分配情况。在createSlotReport方法中,其主要构建对应的SlotID和SlotStatus状态;其中SlotID是用来唯一性(在整个周期内不会改变)标识每个TaskManager上的唯一slot对象;其是一个slot的唯一标识,主要包含两个属性,其中ResourceID表明该slot所在的TaskExecutor,slotNumber是该slot在TaskExecutor中的索引位置。
// ---------------------------------------------------------------------
// Slot report methods (构建SlotReport对象,其包含当前TaskExecutor中所有slot的状态以及它们的分配情况)
// ---------------------------------------------------------------------
public SlotReport createSlotReport(ResourceID resourceId) {
final int numberSlots = taskSlots.size(); // The list of all task slots
List slotStatuses = Arrays.asList(new SlotStatus[numberSlots]);
for (int i = 0; i < numberSlots; i++) {
TaskSlot taskSlot = taskSlots.get(i);
SlotID slotId = new SlotID(resourceId, taskSlot.getIndex()); // 唯一性的SlotID,其标识了 一个TaskManager上唯一的slot(ResourceID, index)
SlotStatus slotStatus = new SlotStatus(
slotId,
taskSlot.getResourceProfile(),
taskSlot.getJobId(), // 该slot分配的jobid or null
taskSlot.getAllocationId()); // 该slot分配的allocationID or null
slotStatuses.set(i, slotStatus);
}
final SlotReport slotReport = new SlotReport(slotStatuses);
return slotReport;
}
TaskExecutor的基本工作及其对应的组件、接口:
TaskExecutor需要向ResourceManager报告所有slot的状态,这样ResourceManager就知道了所有slot的分配情况。其主要通过1、在获取ResourceManager的连接时,通过注册对应的ResourceManagerLeaderListener监听函数异步回调NotifyOfLeaderCall通知;便会异步的与对应的RM建立连接reconnectToResourceManager(TaskExecutorToResourceManagerConnection);在成功建立连接到RM之后,便会通过RPC异步的调用resourceManager.registerTaskExecutor();向RM注册自己以及建立相关的心跳连接;在注册成功之后,其注册成功信息会通过ResourceManagerRegistrationListener监听函数异步回调;向RM上报自己的slot资源。其资源上报情况主要如下:
class TaskExecutor {
private void establishResourceManagerConnection(
ResourceManagerGateway resourceManagerGateway,
ResourceID resourceManagerResourceId,
InstanceID taskExecutorRegistrationId,
ClusterInformation clusterInformation) {
// 首次建立连接,向RM报告slot信息
final CompletableFuture slotReportResponseFuture = resourceManagerGateway.sendSlotReport(
getResourceID(), // taskManagerLocation.getResourceID()
taskExecutorRegistrationId,
taskSlotTable.createSlotReport(getResourceID()),
taskManagerConfiguration.getTimeout());
// .........
// monitor the resource manager as heartbeat target
resourceManagerHeartbeatManager.monitorTarget(resourceManagerResourceId, new HeartbeatTarget() {
@Override
public void receiveHeartbeat(ResourceID resourceID, SlotReport slotReport) {
resourceManagerGateway.heartbeatFromTaskManager(resourceID, slotReport);
}
@Override
public void requestHeartbeat(ResourceID resourceID, SlotReport slotReport) {
// the TaskManager won't send heartbeat requests to the ResourceManager
}
});
// ........
}
private class ResourceManagerHeartbeatListener implements HeartbeatListener {
@Override
public void notifyHeartbeatTimeout(final ResourceID resourceId) { // 超时处理、尝试重新连接
validateRunsInMainThread();
// first check whether the timeout is still valid
if (establishedResourceManagerConnection != null && establishedResourceManagerConnection.getResourceManagerResourceId().equals(resourceId)) {
log.info("The heartbeat of ResourceManager with id {} timed out.", resourceId);
reconnectToResourceManager(new TaskManagerException(
String.format("The heartbeat of ResourceManager with id %s timed out.", resourceId)));
} else {
log.debug("Received heartbeat timeout for outdated ResourceManager id {}. Ignoring the timeout.", resourceId);
}
}
@Override
public CompletableFuture retrievePayload(ResourceID resourceID) { // 心跳信息中 携带上报该TaskExecutor对应的所有slot资源
validateRunsInMainThread();
return CompletableFuture.completedFuture(taskSlotTable.createSlotReport(getResourceID())); // taskManagerLocation.getResourceID()
}
}
}
ResourceManager通过TaskExecutor.requestSlot()方法向对应的TaskExecutor发起rpc请求,要求其分配slot资源,由于ResourceManager知道所有slot的当前分配及空闲资源状况,因此分配请求会精确到具体的SlotID:
#TaskExecutor.requestSlot()
public CompletableFuture requestSlot(
final SlotID slotId,
final JobID jobId,
final AllocationID allocationId,
final String targetAddress,
final ResourceManagerId resourceManagerId,
final Time timeout) {
// TODO: Filter invalid requests from the resource manager by using the instance/registration Id
log.info("Receive slot request {} for job {} from resource manager with leader id {}.",
allocationId, jobId, resourceManagerId);
try {
// 判断发送请求的RM是否是当前TaskExecutor注册的
if (!isConnectedToResourceManager(resourceManagerId)) {
final String message = String.format("TaskManager is not connected to the resource manager %s.", resourceManagerId);
log.debug(message);
throw new TaskManagerException(message);
}
// 如果当前slot是Free状态,则分配slot
if (taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
if (taskSlotTable.allocateSlot(slotId.getSlotNumber(), jobId, allocationId, taskManagerConfiguration.getTimeout())) {
log.info("Allocated slot for {}.", allocationId);
} else {
log.info("Could not allocate slot for {}.", allocationId);
throw new SlotAllocationException("Could not allocate slot.");
}
} else if (!taskSlotTable.isAllocated(slotId.getSlotNumber(), jobId, allocationId)) {
// 如果slot已经被分配了,则抛出异常
final String message = "The slot " + slotId + " has already been allocated for a different job.";
log.info(message);
final AllocationID allocationID = taskSlotTable.getCurrentAllocation(slotId.getSlotNumber());
throw new SlotOccupiedException(message, allocationID, taskSlotTable.getOwningJob(allocationID));
}
// 将分配的slot提供给发送请求的JobManager
if (jobManagerTable.contains(jobId)) { // 保存对应的jobId与JobManagerConnection
offerSlotsToJobManager(jobId); // 如果和对应的JobManager已经建立了连接,则向JobManager提供slot
} else {
// 否则,先和JobManager建立连接,连接建立后会调用offerSlotsToJobManager(jobId)方法
try {
jobLeaderService.addJob(jobId, targetAddress);
} catch (Exception e) {
// free the allocated slot
try {
taskSlotTable.freeSlot(allocationId);
} catch (SlotNotFoundException slotNotFoundException) {
// slot no longer existent, this should actually never happen, because we've
// just allocated the slot. So let's fail hard in this case!
onFatalError(slotNotFoundException);
}
// release local state under the allocation id.
localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
// sanity check
if (!taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
onFatalError(new Exception("Could not free slot " + slotId));
}
throw new SlotAllocationException("Could not add job to job leader service.", e);
}
}
} catch (TaskManagerException taskManagerException) {
return FutureUtils.completedExceptionally(taskManagerException);
}
return CompletableFuture.completedFuture(Acknowledge.get());
}
在Slot被分配给之后,TaskExecutor需要将对应的slot提供给JobManager,而这主要是通过和JobManager进行slot汇报offerSlotsToJobManager(jobId)方法来实现的:
// ------------------------------------------------------------------------
// Internal job manager connection methods 内部向作业管理器job manager提供的slot汇报连接方法
// ------------------------------------------------------------------------
private void offerSlotsToJobManager(final JobID jobId) {
final JobManagerConnection jobManagerConnection = jobManagerTable.get(jobId); // 获取对应的jobManagerConnection连接信息
if (jobManagerConnection == null) {
log.debug("There is no job manager connection to the leader of job {}.", jobId);
} else {
if (taskSlotTable.hasAllocatedSlots(jobId)) {
log.info("Offer reserved slots to the leader of job {}.", jobId);
final JobMasterGateway jobMasterGateway = jobManagerConnection.getJobManagerGateway(); // 获取JobManager的RPC接口代理
// 获取分配给当前Job的slot, 这里只会取得状态为allocated的slot
final Iterator reservedSlotsIterator = taskSlotTable.getAllocatedSlots(jobId);
final JobMasterId jobMasterId = jobManagerConnection.getJobMasterId();
final Collection reservedSlots = new HashSet<>(2);
while (reservedSlotsIterator.hasNext()) {
SlotOffer offer = reservedSlotsIterator.next().generateSlotOffer();
reservedSlots.add(offer);
}
// 通过RPC调用,将slot提供给JobMaster
CompletableFuture> acceptedSlotsFuture = jobMasterGateway.offerSlots(
getResourceID(),
reservedSlots,
taskManagerConfiguration.getTimeout());
// 对当前分配的slotOffer进行结束后的额外资源处理
acceptedSlotsFuture.whenCompleteAsync(
handleAcceptedSlotOffers(jobId, jobMasterGateway, jobMasterId, reservedSlots),
getMainThreadExecutor());
} else {
log.debug("There are no unassigned slots for the job {}.", jobId);
}
}
}
// slotOffer结束后的资源处理
@Nonnull
private BiConsumer, Throwable> handleAcceptedSlotOffers(JobID jobId, JobMasterGateway jobMasterGateway, JobMasterId jobMasterId, Collection offeredSlots) {
return (Iterable acceptedSlots, Throwable throwable) -> {
if (throwable != null) { // 发生异常
if (throwable instanceof TimeoutException) {
log.info("Slot offering to JobManager did not finish in time. Retrying the slot offering.");
// We ran into a timeout. Try again.
offerSlotsToJobManager(jobId); // 超时重试
} else {
log.warn("Slot offering to JobManager failed. Freeing the slots " +
"and returning them to the ResourceManager.", throwable);
// We encountered an exception. Free the slots and return them to the RM.
// 发生异常,则释放所有的slot,并将其归还给对应的RM
for (SlotOffer reservedSlot: offeredSlots) {
freeSlotInternal(reservedSlot.getAllocationId(), throwable);
}
}
} else { // 正常调用成功
// check if the response is still valid
if (isJobManagerConnectionValid(jobId, jobMasterId)) {
// mark accepted slots active
// 对于被JobMaster确认接受的slot,标记为Active状态
for (SlotOffer acceptedSlot : acceptedSlots) {
try {
if (!taskSlotTable.markSlotActive(acceptedSlot.getAllocationId())) {
// the slot is either free or releasing at the moment
final String message = "Could not mark slot " + jobId + " active.";
log.debug(message);
jobMasterGateway.failSlot(
getResourceID(),
acceptedSlot.getAllocationId(),
new FlinkException(message));
}
} catch (SlotNotFoundException e) {
final String message = "Could not mark slot " + jobId + " active.";
jobMasterGateway.failSlot(
getResourceID(),
acceptedSlot.getAllocationId(),
new FlinkException(message));
}
offeredSlots.remove(acceptedSlot);
}
// 释放剩余没有被JobManager接受的slot
final Exception e = new Exception("The slot was rejected by the JobManager.");
for (SlotOffer rejectedSlot : offeredSlots) {
freeSlotInternal(rejectedSlot.getAllocationId(), e);
}
} else {
// discard the response since there is a new leader for the job
log.debug("Discard offer slot response since there is a new leader " +
"for the job {}.", jobId);
}
}
};
}
通过freeSlot(AllocationID, Throwable)方法,可以向TaskExecutor请求释放和AllocationID相关联的slot资源:
#TaskExecutor#freeSlot
public CompletableFuture freeSlot(AllocationID allocationId, Throwable cause, Time timeout) {
freeSlotInternal(allocationId, cause); // 具体的释放操作
return CompletableFuture.completedFuture(Acknowledge.get());
}
private void freeSlotInternal(AllocationID allocationId, Throwable cause) {
checkNotNull(allocationId);
log.debug("Free slot with allocation id {} because: {}", allocationId, cause.getMessage());
try {
final JobID jobId = taskSlotTable.getOwningJob(allocationId);
final int slotIndex = taskSlotTable.freeSlot(allocationId, cause); // 尝试释放allocationId绑定的slot
if (slotIndex != -1) { // 成功释放slot
if (isConnectedToResourceManager()) {
// the slot was freed. Tell the RM about it
ResourceManagerGateway resourceManagerGateway = establishedResourceManagerConnection.getResourceManagerGateway();
// 告知ResourceManager当前slot可用
resourceManagerGateway.notifySlotAvailable(
establishedResourceManagerConnection.getTaskExecutorRegistrationId(),
new SlotID(getResourceID(), slotIndex),
allocationId);
}
if (jobId != null) {
// check whether we still have allocated slots for the same job
// 如果和allocationID绑定的Job已经没有分配的slot了,那么可以断开和JobMaster的连接了
if (taskSlotTable.getAllocationIdsPerJob(jobId).isEmpty()) {
// we can remove the job from the job leader service
try {
jobLeaderService.removeJob(jobId);
} catch (Exception e) {
log.info("Could not remove job {} from JobLeaderService.", jobId, e);
}
closeJobManagerConnection(
jobId,
new FlinkException("TaskExecutor " + getAddress() +
" has no more allocated slots for job " + jobId + '.'));
}
}
}
} catch (SlotNotFoundException e) {
log.debug("Could not free slot for allocation id {}.", allocationId, e);
}
localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
}