前言
在Flink原理——容错机制一文中,已对checkpoint的机制有了较为基础的介绍,本文着重从源码方面去分析checkpoint的过程。当然本文只是分析做checkpoint的调度过程,只是尽量弄清楚整体的逻辑,没有弄清楚其实现细节,还是有遗憾的,后期还是努力去分析实现细节。文中若是有误,欢迎大伙留言指出!
本文基于Flink1.9。
1、参数设置
1.1 有关checkpoint常见的参数如下:
1 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
2 env.enableCheckpointing(10000); //默认是不开启的
3 env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); //默认为EXACTLY_ONCE
4 env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000); //默认为0,最大值为1年
5 env.getCheckpointConfig().setCheckpointTimeout(150000); //默认为10min
6 env.getCheckpointConfig().setMaxConcurrentCheckpoints(1); //默认为1
上述参数的默认值可见flink-streaming-java*.jar中的CheckpointConfig.java,配置值是通过该类中私有configureCheckpointing()的jobGraph.setSnapshotSettings(settings)传递给runtime层的,更多设置也可以参见该类。
1.2 参数分析
这里着重分析enableCheckpointing()设置的baseInterval和minPauseBetweenCheckpoint之间的关系。为分析两者的关系,这里先给出源码中定义
1 /** The base checkpoint interval. Actual trigger time may be affected by the 2 * max concurrent checkpoints and minimum-pause values */ 3 //checkpoint触发周期,时间触发时间还受maxConcurrentCheckpointAttempts和minPauseBetweenCheckpointsNanos影响 4 private final long baseInterval; 5 6 /** The min time(in ns) to delay after a checkpoint could be triggered. Allows to 7 * enforce minimum processing time between checkpoint attempts */ 8 //在可以触发checkpoint的时,两次checkpoint之间的时间间隔 9 private final long minPauseBetweenCheckpointsNanos;
当baseInterval 从此可以看出,checkpoint的触发虽然设置为周期性的,但是实际触发情况,还得考虑minPauseBetweenCheckpoint和maxConcurrentCheckpointAttempts,若maxConcurrentCheckpointAttempts为1,就算满足触发时间也需等待正在执行的checkpoint结束。 将JobGraph提交到Dispatcher后,会createJobManagerRunner和startJobManagerRunner,可以关注Dispatcher类中的createJobManagerRunner(...)方法。 该阶段会创建一个JobManagerRunner实例,在该过程和checkpoint有关的是会启动listener去监听job的状态。 在创建调度器中核心的语句如下: 通过调用到达生成ExecutionGraph的核心类ExecutionGraphBuilder的在buildGraph()方法,其中该方法主要是生成ExecutionGraph和设置checkpoint,下面给出其中的核心代码: 在enableCheckpointing()方法中主要是创建了checkpoint失败是的manager、设置了checkpoint的核心类CheckpointCoordinator。 至此,createJobManagerRunner阶段结束了,ExecutionGraph中checkpoint的配置就设置好了。 在该阶段中,在获得leaderShip之后,就会启动startJobExecution,这里只给出调用涉及的类和方法: startJobExecution()方法是JobMaster类中的私有方法,具体代码分析如下: 在LegacyScheduler类中的方法scheduleForExecution()调度过程如下: 下面具体分析触发checkpoint的核心方法startCheckpointScheduler()。 startCheckpointScheduler()方法结合注释还是比较好理解的,但由于方法太长这里就不全部贴出来了,先分析一下大致做什么了,然后给出其核心代码: 1)检查触发checkpoint的条件。如coordinator被关闭、周期性checkpoint被禁止、在没有开启强制checkpoint的情况下没有达到最小的checkpoint间隔以及超过并发的checkpoint个数等; 2)检查是否所有需要checkpoint和需要响应checkpoint的ACK(的task都处于running状态,否则抛出异常; 3)若均符合,执行checkpointID = checkpointIdCounter.getAndIncrement();以生成一个新的checkpointID,然后生成一个PendingCheckpoint。其中,PendingCheckpoint仅是一个启动了的checkpoint,但是还没有被确认,直到所有的task都确认了本次checkpoint,该checkpoint对象才转化为一个CompletedCheckpoint; 4)调度timer清理失败的checkpoint; 5)定义一个超时callback,如果checkpoint执行了很久还没完成,就把它取消; 6)触发MasterHooks,用户可以定义一些额外的操作,用以增强checkpoint的功能(如准备和清理外部资源); 核心代码如下: 不管是否以异步的方式触发checkpoint,最终调用的方法是Execution类中的私有方法triggerCheckpointHelper(...),具体代码如下: 至此,checkpointCoordinator就将做checkpoint的命令发送到TaskManager去了,下面着重分析TM中checkpoint的执行过程。 TaskManager 接收到触发checkpoint的RPC后,会触发生成checkpoint barrier。RpcTaskManagerGateway作为消息入口,其triggerCheckpoint(...)会调用TaskExecutor的triggerCheckpoint(...),具体过程如下: 在Task类的triggerCheckpointBarrier(...)方法中生成了一个Runable匿名类用于执行checkpoint,然后以异步的方式触发了该Runable,具体代码如下: SourceStreamTask和StreamTask调用triggerCheckpoint最终都是调用StreamTask类中的triggerCheckpoint(...)方法,其核心代码为: 在performCheckpoint(...)方法中,主要有以下两件事: 1、若task是running,则可以进行checkpoint,主要有以下三件事: 1)为checkpoint做准备,一般是什么不做的,直接接受checkpoint; 2)生成barrier,并以广播的形式发射到下游去; 3)触发本task保存state; 2、若不是running,通知下游取消本次checkpoint,方法是发送一个CancelCheckpointMarker,这是类似于Barrier的另一种消息。 具体代码如下: 接下来分析checkpointState(...)过程。 checkpointState(...)方法最终会调用StreamTask类中executeCheckpointing(),其中会创建一个异步对象AsyncCheckpointRunnable,用以报告该检查点已完成,关键代码如下: 进入AsyncCheckpointRunnable(...)中的run()方法,其中会调用StreamTask类中reportCompletedSnapshotStates(...)(对于一个无状态的job返回的null),进而调用TaskStateManagerImpl类中的reportTaskStateSnapshots(...)将TM的checkpoint汇报给JM,关键代码如下: 其逻辑是逻辑是通过rpc的方式远程调JobManager的相关方法完成报告事件。 通过RpcCheckpointResponder类中acknowledgeCheckpoint(...)来响应checkpoint返回的消息,该方法之后的调度过程和涉及的核心方法如下: Important: This method should only be called in the checkpoint lock scope 至此,checkpoint的整体流程分析完毕建议结合原理去理解,参考的三篇文献都是写的很好的,有时间建议看看。 Ref: [1]https://www.jianshu.com/p/a40a1b92f6a2 [2]https://www.cnblogs.com/bethunebtj/p/9168274.html [3] https://blog.csdn.net/qq475781638/article/details/92698301 1 // it does not make sense to schedule checkpoints more often then the desired
2 // time between checkpoints
3 long baseInterval = chkConfig.getCheckpointInterval();
4 if (baseInterval < minPauseBetweenCheckpoints) {
5 baseInterval = minPauseBetweenCheckpoints;
6 }
2、checkpoint调用过程
2.1 createJobManagerRunner阶段
1 #JobManagerRunner.java
2 public JobManagerRunner(...) throws Exception {
3
4 //..........
5
6 // make sure we cleanly shut down out JobManager services if initialization fails
7 try {
8 //..........
9 //加载JobGraph、library、leader选举等
10
11 // now start the JobManager
12 //启动JobManager
13 this.jobMasterService = jobMasterFactory.createJobMasterService(jobGraph, this, userCodeLoader);
14 }
15 catch (Throwable t) {
16 //......
17 }
18 }
19
20 //在DefaultJobMasterServiceFactory类的createJobMasterService()中新建一个JobMaster对象
21 //#JobMaster.java
22 public JobMaster(...) throws Exception {
23
24 //........
25 //该方法中主要做了参数检查,slotPool的创建、slotPool的schedul的创建等一系列的事情
26
27 //创建一个调度器
28 this.schedulerNG = createScheduler(jobManagerJobMetricGroup);
29 //......
30 }
1 //#LegacyScheduler.java中的LegacyScheduler()
2 //创建ExecutionGraph
3 this.executionGraph = createAndRestoreExecutionGraph(jobManagerJobMetricGroup, checkNotNull(shuffleMaster), checkNotNull(partitionTracker));
4
5
6 private ExecutionGraph createAndRestoreExecutionGraph(
7 JobManagerJobMetricGroup currentJobManagerJobMetricGroup,
8 ShuffleMaster> shuffleMaster,
9 PartitionTracker partitionTracker) throws Exception {
10
11
12 ExecutionGraph newExecutionGraph = createExecutionGraph(currentJobManagerJobMetricGroup, shuffleMaster, partitionTracker);
13
14 final CheckpointCoordinator checkpointCoordinator = newExecutionGraph.getCheckpointCoordinator();
15
16 if (checkpointCoordinator != null) {
17 // check whether we find a valid checkpoint
18 //若state没有被恢复是否可以通过savepoint恢复
19 //......
20 }
21 }
22
23 return newExecutionGraph;
24 }
1 //..............
2 //生成ExecutionGraph的核心方法,这里后期会详细分析
3 executionGraph.attachJobGraph(sortedTopology);
4
5 //.......................
6
7 //在enableCheckpointing中设置CheckpointCoordinator
8 executionGraph.enableCheckpointing(
9 chkConfig,
10 triggerVertices,
11 ackVertices,
12 confirmVertices,
13 hooks,
14 checkpointIdCounter,
15 completedCheckpoints,
16 rootBackend,
17 checkpointStatsTracker);
1 //#ExecutionGraph.java
2 public void enableCheckpointing(
3 CheckpointCoordinatorConfiguration chkConfig,
4 List
2.2 startJobManagerRunner阶段
1 //#JobManagerRunner.java类中
2 //grantLeadership(...)==>verifyJobSchedulingStatusAndStartJobManager(...)
3 //==>startJobMaster(...),该方法中核心代码为
4 startFuture = jobMasterService.start(new JobMasterId(leaderSessionId));
5
6 //进一步调用#JobMaster.java类中的start()==>startJobExecution(...)
1 //----------------------------------------------------------------------------------------------
2 // Internal methods
3 //----------------------------------------------------------------------------------------------
4
5 //-- job starting and stopping -----------------------------------------------------------------
6
7 private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {
8
9 validateRunsInMainThread();
10
11 checkNotNull(newJobMasterId, "The new JobMasterId must not be null.");
12
13 if (Objects.equals(getFencingToken(), newJobMasterId)) {
14 log.info("Already started the job execution with JobMasterId {}.", newJobMasterId);
15
16 return Acknowledge.get();
17 }
18
19 setNewFencingToken(newJobMasterId);
20 //启动slotPool并申请资源,该方法可以具体看看申请资源的过程
21 startJobMasterServices();
22
23 log.info("Starting execution of job {} ({}) under job master id {}.", jobGraph.getName(), jobGraph.getJobID(), newJobMasterId);
24 //执行ExecuteGraph的切入口,先判断job的状态是否为created的,后调执行executionGraph.scheduleForExecution();
25 resetAndStartScheduler();
26
27 return Acknowledge.get();
28 }
1 public void scheduleForExecution() throws JobException {
2
3 assertRunningInJobMasterMainThread();
4
5 final long currentGlobalModVersion = globalModVersion;
6 //任务执行之前进行状态切换从CREATED到RUNNING,
7 //transitionState(...)方法中会通过notifyJobStatusChange(newState, error)通知jobStatusListeners集合中listeners状态改变
8 if (transitionState(JobStatus.CREATED, JobStatus.RUNNING)) {
9 //根据启动算子调度模式不同,采用不同的调度方案
10 final CompletableFuture
1 // send the messages to the tasks that trigger their checkpoint
2 //遍历ExecutionVertex,是否异步触发checkpoint
3 for (Execution execution: executions) {
4 if (props.isSynchronous()) {
5 execution.triggerSynchronousSavepoint(checkpointID, timestamp, checkpointOptions, advanceToEndOfTime);
6 } else {
7 execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
8 }
9 }
1 //Execution.java
2 private void triggerCheckpointHelper(long checkpointId, long timestamp, CheckpointOptions checkpointOptions, boolean advanceToEndOfEventTime) {
3
4 final CheckpointType checkpointType = checkpointOptions.getCheckpointType();
5 if (advanceToEndOfEventTime && !(checkpointType.isSynchronous() && checkpointType.isSavepoint())) {
6 throw new IllegalArgumentException("Only synchronous savepoints are allowed to advance the watermark to MAX.");
7 }
8
9 final LogicalSlot slot = assignedResource;
10
11 if (slot != null) {
12 //TaskManagerGateway是用于与taskManager通信的组件
13 final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
14
15 taskManagerGateway.triggerCheckpoint(attemptId, getVertex().getJobId(), checkpointId, timestamp, checkpointOptions, advanceToEndOfEventTime);
16 } else {
17 LOG.debug("The execution has no slot assigned. This indicates that the execution is no longer running.");
18 }
19 }
2.3 TaskManager中checkpoint
1 //RpcTaskManagerGateway.java
2 public void triggerCheckpoint(ExecutionAttemptID executionAttemptID, JobID jobId, long checkpointId, long timestamp, CheckpointOptions checkpointOptions, boolean advanceToEndOfEventTime) {
3 taskExecutorGateway.triggerCheckpoint(
4 executionAttemptID,
5 checkpointId,
6 timestamp,
7 checkpointOptions,
8 advanceToEndOfEventTime);
9 }
10
11 //TaskExecutor.java
12 @Override
13 public CompletableFuture
1 public void triggerCheckpointBarrier(
2 final long checkpointID,
3 final long checkpointTimestamp,
4 final CheckpointOptions checkpointOptions,
5 final boolean advanceToEndOfEventTime) {
6
7 final AbstractInvokable invokable = this.invokable;
8 //创建一个CheckpointMetaData,该对象仅有checkpointID、checkpointTimestamp两个属性
9 final CheckpointMetaData checkpointMetaData = new CheckpointMetaData(checkpointID, checkpointTimestamp);
10
11 if (executionState == ExecutionState.RUNNING && invokable != null) {
12
13 //..............
14
15 Runnable runnable = new Runnable() {
16 @Override
17 public void run() {
18 // set safety net from the task's context for checkpointing thread
19 LOG.debug("Creating FileSystem stream leak safety net for {}", Thread.currentThread().getName());
20 FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(safetyNetCloseableRegistry);
21
22 try {
23 //根据SourceStreamTask和StreamTask调用不同的方法
24 boolean success = invokable.triggerCheckpoint(checkpointMetaData, checkpointOptions, advanceToEndOfEventTime);
25 if (!success) {
26 checkpointResponder.declineCheckpoint(
27 getJobID(), getExecutionId(), checkpointID,
28 new CheckpointException("Task Name" + taskName, CheckpointFailureReason.CHECKPOINT_DECLINED_TASK_NOT_READY));
29 }
30 }
31 catch (Throwable t) {
32 if (getExecutionState() == ExecutionState.RUNNING) {
33 failExternally(new Exception(
34 "Error while triggering checkpoint " + checkpointID + " for " +
35 taskNameWithSubtask, t));
36 } else {
37 LOG.debug("Encountered error while triggering checkpoint {} for " +
38 "{} ({}) while being not in state running.", checkpointID,
39 taskNameWithSubtask, executionId, t);
40 }
41 } finally {
42 FileSystemSafetyNet.setSafetyNetCloseableRegistryForThread(null);
43 }
44 }
45 };
46 //以异步的方式触发Runnable
47 executeAsyncCallRunnable(
48 runnable,
49 String.format("Checkpoint Trigger for %s (%s).", taskNameWithSubtask, executionId));
50 }
51 else {
52 LOG.debug("Declining checkpoint request for non-running task {} ({}).", taskNameWithSubtask, executionId);
53
54 // send back a message that we did not do the checkpoint
55 checkpointResponder.declineCheckpoint(jobId, executionId, checkpointID,
56 new CheckpointException("Task name with subtask : " + taskNameWithSubtask, CheckpointFailureReason.CHECKPOINT_DECLINED_TASK_NOT_READY));
57 }
58 }
1 //#StreamTask.java
2 return performCheckpoint(checkpointMetaData, checkpointOptions, checkpointMetrics, advanceToEndOfEventTime);
1 //#StreamTask.java
2 private boolean performCheckpoint(
3 CheckpointMetaData checkpointMetaData,
4 CheckpointOptions checkpointOptions,
5 CheckpointMetrics checkpointMetrics,
6 boolean advanceToEndOfTime) throws Exception {
7 //......
8
9 synchronized (lock) {
10 if (isRunning) {
11
12 if (checkpointOptions.getCheckpointType().isSynchronous()) {
13 syncSavepointLatch.setCheckpointId(checkpointId);
14
15 if (advanceToEndOfTime) {
16 advanceToEndOfEventTime();
17 }
18 }
19
20 // All of the following steps happen as an atomic step from the perspective of barriers and
21 // records/watermarks/timers/callbacks.
22 // We generally try to emit the checkpoint barrier as soon as possible to not affect downstream
23 // checkpoint alignments
24
25 // Step (1): Prepare the checkpoint, allow operators to do some pre-barrier work.
26 // The pre-barrier work should be nothing or minimal in the common case.
27 operatorChain.prepareSnapshotPreBarrier(checkpointId);
28
29 // Step (2): Send the checkpoint barrier downstream
30 operatorChain.broadcastCheckpointBarrier(
31 checkpointId,
32 checkpointMetaData.getTimestamp(),
33 checkpointOptions);
34
35 // Step (3): Take the state snapshot. This should be largely asynchronous, to not
36 // impact progress of the streaming topology
37 checkpointState(checkpointMetaData, checkpointOptions, checkpointMetrics);
38
39 return true;
40 }
41 else {
42 //.......
43 }
44 }
45 }
1 //#StreamTask.java类中executeCheckpointing()
2 public void executeCheckpointing() throws Exception {
3 startSyncPartNano = System.nanoTime();
4
5 try {
6 //调用StreamOperator进行snapshotState的入口方法,依算子不同而变
7 for (StreamOperator> op : allOperators) {
8 checkpointStreamOperator(op);
9 }
10 //.........
11
12 // we are transferring ownership over snapshotInProgressList for cleanup to the thread, active on submit
13 AsyncCheckpointRunnable asyncCheckpointRunnable = new AsyncCheckpointRunnable(
14 owner,
15 operatorSnapshotsInProgress,
16 checkpointMetaData,
17 checkpointMetrics,
18 startAsyncPartNano);
19
20 owner.cancelables.registerCloseable(asyncCheckpointRunnable);
21 owner.asyncOperationsThreadPool.execute(asyncCheckpointRunnable);
22
23 //.........
24 } catch (Exception ex) {
25 //.......
26 }
27 }
1 //TaskStateManagerImpl.java
2 checkpointResponder.acknowledgeCheckpoint(
3 jobId,
4 executionAttemptID,
5 checkpointId,
6 checkpointMetrics,
7 acknowledgedState);
2.4 JobManager处理checkpoint
1 //#JobMaster类中acknowledgeCheckpoint==>
2 //#LegacyScheduler类中acknowledgeCheckpoint==>
3 //#CheckpointCoordinator类中receiveAcknowledgeMessage(...)==>
4 //completePendingCheckpoint(checkpoint);
5
6 //