JobGraphStore的设计与实现

文章目录

  • 1. JobGraphStore的创建与启动
  • 2. Session集群启动JobGraph恢复

JobGraphStore可以存储JobGraph,当集群宕机后,可以从JobGraphStore中恢复之前提交运行的JobGraph,保证提交到集群运行的作业能够恢复正常。

在使用“工厂容器”创建“超级组件”时,其中一步就是创建DispatcherRunner。在创建DispatcherRunner的时候,有一个核心参数–HaServicesJobGraphStoreFactory。它是JobGraphStoreFactory的实现子类,可以通过工厂模式构造出JobGraphStore来。

// 创建DispatcherRunner,它会在后面被LeaderElectionService服务启动
// 由DispatcherRunnerFactory创建DispatcherRunner,Dispatcher组件要依赖DispatcherRunner来启动、运行,DispatcherRunner需要DispatcherRunnerFactory创建
// DispatcherRunner提供了Dispatcher启动运行、Leader选举的能力
dispatcherRunner = dispatcherRunnerFactory.createDispatcherRunner(
    highAvailabilityServices.getDispatcherLeaderElectionService(), // 从高可用服务中,获取到Dispatcher的“leader选举服务”
    fatalErrorHandler,
    // 创建JobGraphStoreFactory的实现子类,它会创建出JobGraphStore
    // DispatcherLeaderProcess之所以能够“恢复JobGraph”,完全是因为JobGraphListener监听了JobGraphStore,
    // JobGraphStore中对JobGraph的增加、删除,都能通过JobGraphListener通知到DispatcherLeaderProcess
    new HaServicesJobGraphStoreFactory(highAvailabilityServices),
    ioExecutor,
    rpcService,
    partialDispatcherServices);

JobGraphStore的实现子类中,只有ZooKeeperJobGraphStore可以提供JobGraph的持久化和恢复操作。 JobGraphStore通过JobGraphListener实现了对JobGraphStore增加、删除JobGraph的监听,监听方就是DispatcherLeaderProcess。当JobGraphStore中的JobGraph发生变化时,JobGraphListener就会立即通知DispatcherLeaderProcess,根据需要决定是否启动或停止JobGraph对应的作业。

1. JobGraphStore的创建与启动

在构建DispatcherRunner时就已经创建好了JobGraphStoreFactory

new HaServicesJobGraphStoreFactory(highAvailabilityServices)

在创建DispatcherLeaderProcess时,会顺便(使用工厂模式)创建JobGraphStore

/**
 * 使用DispatcherLeaderProcessFactory创建DispatcherLeaderProcess
 */
@Override
public DispatcherLeaderProcess create(UUID leaderSessionID) {
    return SessionDispatcherLeaderProcess.create(
        leaderSessionID,
        dispatcherGatewayServiceFactory,
        // 使用工厂模式创建JobGraphStore
        jobGraphStoreFactory.create(),
        ioExecutor,
        fatalErrorHandler);
}

HaServicesJobGraphStoreFactory作为JobGraphStoreFactory的实现子类,创建JobGraphStore的方法由高可用服务提供:

/**
 * 创建JobGraphStore
 */
@Override
public JobGraphStore create() {
    try {
        // 高可用服务 HighAvailabilityServices提供了创建JobGraphStore的方法
        return highAvailabilityServices.getJobGraphStore();
    } catch (Exception e) {
        throw new FlinkRuntimeException(
            String.format(
                "Could not create %s from %s.",
                JobGraphStore.class.getSimpleName(),
                highAvailabilityServices.getClass().getSimpleName()),
            e);
    }
}

2. Session集群启动JobGraph恢复

SessionDispatcherLeaderProcess启动时,会先将JobGraphStore启动起来。

/**
 * 启动JobGraphStore
 */
private void startServices() {
    try {
        jobGraphStore.start(this);
    } catch (Exception e) {
        throw new FlinkRuntimeException(
            String.format(
                "Could not start %s when trying to start the %s.",
                jobGraphStore.getClass().getSimpleName(),
                getClass().getSimpleName()),
            e);
    }
}

然后就要异步的从JobGraphStore中将JobGraph恢复出来

/**
 * 异步的将JobGraph从JobGraphStore中恢复出来
 */
private Collection<JobGraph> recoverJobs() {
    log.info("Recover all persisted job graphs.");
    // 从JobGraphStore中获取JobID列表
    final Collection<JobID> jobIds = getJobIds();
    final Collection<JobGraph> recoveredJobGraphs = new ArrayList<>();

    for (JobID jobId : jobIds) {
        // 根据JobID,从JobGraphStore中获取对应的JobGraph,并将其添加到List中
        recoveredJobGraphs.add(recoverJob(jobId));
    }

    log.info("Successfully recovered {} persisted job graphs.", recoveredJobGraphs.size());

    // 返回这个装有JobGraph的List
    return recoveredJobGraphs;
}

从得到JobGraph后,就要创建Dispatcher对JobGraph进行调度、执行。这一步本质上就是将需要恢复的JobGraph全都放到了Dispatcher的HashSet集合中,然后会遍历这个HashSet集合,Dispatcher会对Set集合内的每个JobGraph进行分发,并安排JobManager执行…

/**
 * 恢复JobGraph,使其重新由Dispatcher调度、执行
 */
private void startRecoveredJobs() {
    for (JobGraph recoveredJob : recoveredJobs) {
        // 让Dispatcher重新对JobGraph进行调度(安排JobManager执行)
        FutureUtils.assertNoException(runJob(recoveredJob)
                                      .handle(handleRecoveredJobStartError(recoveredJob.getJobID())));
    }

    // JobGraph恢复成功一个,就从HashSet集合中remove一个
    recoveredJobs.clear();
}

你可能感兴趣的:(Flink,flink)