一文搞定 Flink Job 的运行过程

背景

之前我们知道了Flink 是如何生成 StreamGraph 以及 如何生成 job 和 如何生成Task,现在我们通过 Flink Shell 将他们串起来,这样我们就学习了从写代码开始到 Flink 运行 task 的整个过程是怎么样的。

正文

我们经常通过 Flink Shell 提交代码,如 flink run -p 2 -m yarn-cluster -ynm test -c test ./test-1.0-SNAPSHOT.jar "file" "./test.properties"&通过 flink shell 我们可以知道 org.apache.flink.client.cli.CliFrontend 为整个 Flink Job 的入口类

/**
     * Submits the job based on the arguments.
     */
    public static void main(final String[] args) {
        EnvironmentInformation.logEnvironmentInfo(LOG, "Command Line Client", args);

        // 1. find the configuration directory
        final String configurationDirectory = getConfigurationDirectoryFromEnv();

        // 2. load the global configuration
        final Configuration configuration = GlobalConfiguration.loadConfiguration(configurationDirectory);

        // 3. load the custom command lines
        final List> customCommandLines = loadCustomCommandLines(
            configuration,
            configurationDirectory);

        try {
            final CliFrontend cli = new CliFrontend(
                configuration,
                customCommandLines);

            SecurityUtils.install(new SecurityConfiguration(cli.configuration));
            int retCode = SecurityUtils.getInstalledContext()
                    .runSecured(() -> cli.parseParameters(args));
            System.exit(retCode);
        }
        catch (Throwable t) {
            final Throwable strippedThrowable = ExceptionUtils.stripException(t, UndeclaredThrowableException.class);
            LOG.error("Fatal error while running command line interface.", strippedThrowable);
            strippedThrowable.printStackTrace();
            System.exit(31);
        }
    }

main 很简单,主要就两步,发现并加载配置文件,加载并解析命令。在解析命令的过程当中,如果传入的命令是 run,则可以一直追踪到 executeProgram 方法

protected void executeProgram(PackagedProgram program, ClusterClient client, int parallelism) throws ProgramMissingJobException, ProgramInvocationException {
        logAndSysout("Starting execution of program");
        final JobSubmissionResult result = client.run(program, parallelism);
        ......
    }

通过 client run 方法来执行,最终调用我们传入的主方法( 通过 -c 参数),然后就开始执行用户代码了,首先会构建 StreamGraph ,最终调用 StreamContextEnvironment execute(String jobName) 方法

@Override
    public JobExecutionResult execute(String jobName) throws Exception {
        
        Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");

        StreamGraph streamGraph = this.getStreamGraph();
        streamGraph.setJobName(jobName);

        transformations.clear();

        // execute the programs 存在 -d 时
        if (ctx instanceof DetachedEnvironment) {
            LOG.warn("Job was executed in detached mode, the results will be available on completion.");
            ((DetachedEnvironment) ctx).setDetachedPlan(streamGraph);
            return DetachedEnvironment.DetachedJobExecutionResult.INSTANCE;
        } else {
            return ctx
                .getClient()
                .run(streamGraph, ctx.getJars(), ctx.getClasspaths(), ctx.getUserCodeClassLoader(), ctx.getSavepointRestoreSettings())
                .getJobExecutionResult();
        }
    }

然后

public JobSubmissionResult run(FlinkPlan compiledPlan,
            List libraries, List classpaths, ClassLoader classLoader, SavepointRestoreSettings savepointSettings)
            throws ProgramInvocationException {
        // 构建 jobGraph
        JobGraph job = getJobGraph(flinkConfig, compiledPlan, libraries, classpaths, savepointSettings);
        //将 job 提交至 cluster 上
        return submitJob(job, classLoader);
    }

主要就是构建 jobGraph ,关于构建 jobGraph 的细节可以参考 如何构建 job ,构建成功之后就开始提交 job 了。我们以 MiniCluster 为例

@Override
    public JobSubmissionResult submitJob(JobGraph jobGraph, ClassLoader classLoader) throws ProgramInvocationException {
        final CompletableFuture jobSubmissionResultFuture = submitJob(jobGraph);
        ......
    }
public CompletableFuture submitJob(JobGraph jobGraph) {
        final CompletableFuture dispatcherGatewayFuture = getDispatcherGatewayFuture();
        
        // we have to allow queued scheduling in Flip-6 mode because we need to request slots
        // from the ResourceManager
        jobGraph.setAllowQueuedScheduling(true);
        
        final CompletableFuture blobServerAddressFuture = createBlobServerAddress(dispatcherGatewayFuture);
        
        // cache jars and files
        final CompletableFuture jarUploadFuture = uploadAndSetJobFiles(blobServerAddressFuture, jobGraph);
        
        final CompletableFuture acknowledgeCompletableFuture = jarUploadFuture
            .thenCombine(
                dispatcherGatewayFuture,
                // 这里真正 submit 操作,交给了 dispatcher 去执行
                (Void ack, DispatcherGateway dispatcherGateway) -> dispatcherGateway.submitJob(jobGraph, rpcTimeout))
            .thenCompose(Function.identity());
        
        return acknowledgeCompletableFuture.thenApply(
            (Acknowledge ignored) -> new JobSubmissionResult(jobGraph.getJobID()));
    }

接下来就到了 job 正式运行的时候了

private CompletableFuture runJob(JobGraph jobGraph) {
        Preconditions.checkState(!jobManagerRunnerFutures.containsKey(jobGraph.getJobID()));

        //创建 jobManagerRunner 同时也会创建 jobMaster,在创建 JobMaster 的时候构建了 ExecutionGraph
        final CompletableFuture jobManagerRunnerFuture = createJobManagerRunner(jobGraph);

        jobManagerRunnerFutures.put(jobGraph.getJobID(), jobManagerRunnerFuture);

        // start jobManagerRunner 同时也启动了 jobMaster 等一系列 service,然后就开始调度 executionGraph,execution.deploy task.start
        return jobManagerRunnerFuture
            .thenApply(FunctionUtils.uncheckedFunction(this::startJobManagerRunner))
            .thenApply(FunctionUtils.nullFn())
            .whenCompleteAsync(
                (ignored, throwable) -> {
                    if (throwable != null) {
                        jobManagerRunnerFutures.remove(jobGraph.getJobID());
                    }
                },
                getMainThreadExecutor());
    }

这部分内容与 如何构建Job 是一致的,省略若干,具体可以参考 如何构建 job ,需要强调一点就是当 执行到 ExecutionGraph 的 scheduleForExecution方法时

// 调度 execution
    public void scheduleForExecution() throws JobException {

        assertRunningInJobMasterMainThread();

        final long currentGlobalModVersion = globalModVersion;

        // 会启动 startCheckpointScheduler
        if (transitionState(JobStatus.CREATED, JobStatus.RUNNING)) {
            ......
    }

会启动 CheckpointScheduler 从而开始出发 checkpoint。
接下来就开始部署,可以参考 如何构建 job 和 如何生成Task
至此为止,从写代码到代码的计算执行,整个过程我们都已经学习清楚了。

总结

一文搞定 Flink Job 的运行过程_第1张图片
在这里插入图片描述

你可能感兴趣的:(一文搞定 Flink Job 的运行过程)