Flink源码分析系列文档目录

请点击：Flink 源码分析系列文档目录

背景

Flink 分布式缓存(Distributed Cache)可用于向作业的各个TaskManager分发文件。典型的使用场景为流推理作业时候向集群内分发训练模型。文件分发的操作由Flink自动进行，无需用户干预，使用非常方便。

使用方法可参考Flink 使用之配置与调优中使用分布式缓存章节。

另外可以参考官方文档的使用示例：

https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/dataset/overview/#distributed-cache

注册文件到分布式缓存中：

val env = ExecutionEnvironment.getExecutionEnvironment

// register a file from HDFS
env.registerCachedFile("hdfs:///path/to/your/file", "hdfsFile")

// register a local executable file (script, executable, ...)
env.registerCachedFile("file:///path/to/exec/file", "localExecFile", true)

// define your program and execute
...
val input: DataSet[String] = ...
val result: DataSet[Integer] = input.map(new MyMapper())
...
env.execute()

在Task Manager 算子中读取位于分布式缓存中的文件：

// extend a RichFunction to have access to the RuntimeContext
class MyMapper extends RichMapFunction[String, Int] {

  override def open(config: Configuration): Unit = {

    // access cached file via RuntimeContext and DistributedCache
    val myFile: File = getRuntimeContext.getDistributedCache.getFile("hdfsFile")
    // read the file (or navigate the directory)
    ...
  }

  override def map(value: String): Int = {
    // use content of cached file
    ...
  }
}

分布式缓存的使用方式较为简单，介绍到此为止。下面章节开始分布式缓存整个处理流程的源代码分析。

注册缓存文件

下面从ExecutionEnvironment的registerCachedFile方法开始分析Distributed Cache的执行流程。

public void registerCachedFile(String filePath, String name) {
    registerCachedFile(filePath, name, false);
}


public void registerCachedFile(String filePath, String name, boolean executable) {
    this.cacheFile.add(new Tuple2<>(name, new DistributedCacheEntry(filePath, executable)));
}

这个方法将需要缓存的文件封装为DistributedCacheEntry，存入cacheFile集合中，key为缓存文件name（用户获取缓存文件的时候以name为标识符），value为封装好的DistributedCacheEntry对象。

注意，还可以使用pipeline.cached-files配置项注明需要加入分布式缓存的文件。官网给出的描述如下：

Files to be registered at the distributed cache under the given name. The files will be accessible from any user-defined function in the (distributed) runtime under a local path. Files may be local files (which will be distributed via BlobServer), or files in a distributed file system. The runtime will copy the files temporarily to a local cache, if needed.

Example:
name:file1,path:`file:///tmp/file1`;name:file2,path:`hdfs://

该配置值的读取过程在StreamExecutionEnvironment或者是ExecutionEnvironment中。

configuration
        .getOptional(PipelineOptions.CACHED_FILES)
        .ifPresent(
                f -> {
                    this.cacheFile.clear();
                    this.cacheFile.addAll(DistributedCache.parseCachedFilesFromString(f));
                });

在所有需要缓存的文件信息保存到cacheFile变量之后，接下来是生成StreamGraph的流程。StreamExecutionEnvironment将其传递给StreamGraphGenerator。

private StreamGraphGenerator getStreamGraphGenerator(List> transformations) {
    if (transformations.size() <= 0) {
        throw new IllegalStateException(
                "No operators defined in streaming topology. Cannot execute.");
    }

    // We copy the transformation so that newly added transformations cannot intervene with the
    // stream graph generation.
    return new StreamGraphGenerator(
                    new ArrayList<>(transformations), config, checkpointCfg, configuration)
            .setStateBackend(defaultStateBackend)
            .setChangelogStateBackendEnabled(changelogStateBackendEnabled)
            .setSavepointDir(defaultSavepointDirectory)
            .setChaining(isChainingEnabled)
            .setUserArtifacts(cacheFile)
            .setTimeCharacteristic(timeCharacteristic)
            .setDefaultBufferTimeout(bufferTimeout)
            .setSlotSharingGroupResource(slotSharingGroupResources);
}

cacheFile在StreamGraphGenerator持有，它的成员变量名称改变成了userArtifacts。
在StreamGraphGenerator的generate方法中调用了configureStreamGraph(streamGraph)，这个方法将userAtrifacts传递给了StreamGraph。

private void configureStreamGraph(final StreamGraph graph) {
    // ...
    graph.setUserArtifacts(userArtifacts);
    // ...
}

然后流程到了从StreamGraph生成JobGraph的阶段。我们查看StreamingJobGraphGenerator的createJobGraph方法。这个方法将StreamGraph携带的userArtifacts传递给JobGraph。

private JobGraph createJobGraph() {
    // ...
    final Map distributedCacheEntries =
        JobGraphUtils.prepareUserArtifactEntries(
                streamGraph.getUserArtifacts().stream()
                        .collect(Collectors.toMap(e -> e.f0, e -> e.f1)),
                jobGraph.getJobID());

    for (Map.Entry entry :
            distributedCacheEntries.entrySet()) {
        jobGraph.addUserArtifact(entry.getKey(), entry.getValue());
    }
    // ...
}

缓存文件内容上传

跟踪JobGraph::getUserArtifacts调用位置。我们发现有两处调用值得关注：

ClientUtils.extractAndUploadJobGraphFiles
YarnClusterDescriptor.startAppMaster

下面分别分析这两种情况。

ClientUtils

ClientUtils的extractAndUploadJobGraphFiles方法顾名思义，解压并上传JobGraph文件。在客户端提交生成的JobGraph的时候执行。

public static void extractAndUploadJobGraphFiles(
        JobGraph jobGraph, SupplierWithException clientSupplier)
        throws FlinkException {
    List userJars = jobGraph.getUserJars();
    Collection> userArtifacts =
            jobGraph.getUserArtifacts().entrySet().stream()
                    .map(
                            entry ->
                                    Tuple2.of(
                                            entry.getKey(),
                                            new Path(entry.getValue().filePath)))
                    .collect(Collectors.toList());

    uploadJobGraphFiles(jobGraph, userJars, userArtifacts, clientSupplier);
}

该方法调用uploadJobGraphFiles上传userJars和userArtifacts。

public static void uploadJobGraphFiles(
        JobGraph jobGraph,
        Collection userJars,
        Collection> userArtifacts,
        SupplierWithException clientSupplier)
        throws FlinkException {
    if (!userJars.isEmpty() || !userArtifacts.isEmpty()) {
        try (BlobClient client = clientSupplier.get()) {
            uploadAndSetUserJars(jobGraph, userJars, client);
            uploadAndSetUserArtifacts(jobGraph, userArtifacts, client);
        } catch (IOException ioe) {
            throw new FlinkException("Could not upload job files.", ioe);
        }
    }
    jobGraph.writeUserArtifactEntriesToConfiguration();
}

上传userArtifacts的逻辑位于uploadAndSetUserArtifacts方法，我们继续跟踪调用：

private static void uploadAndSetUserArtifacts(
        JobGraph jobGraph,
        Collection> artifactPaths,
        BlobClient blobClient)
        throws IOException {
    Collection> blobKeys =
            uploadUserArtifacts(jobGraph.getJobID(), artifactPaths, blobClient);
    setUserArtifactBlobKeys(jobGraph, blobKeys);
}

uploadUserArtifacts方法将文件上传到BlobServer。然后通过setUserArtifactBlobKeys方法在JobGraph中设置文件对应的blobKey。

private static Collection> uploadUserArtifacts(
        JobID jobID, Collection> userArtifacts, BlobClient blobClient)
        throws IOException {
    Collection> blobKeys =
            new ArrayList<>(userArtifacts.size());
    for (Tuple2 userArtifact : userArtifacts) {
        // only upload local files
        if (!userArtifact.f1.getFileSystem().isDistributedFS()) {
            final PermanentBlobKey blobKey = blobClient.uploadFile(jobID, userArtifact.f1);
            blobKeys.add(Tuple2.of(userArtifact.f0, blobKey));
        }
    }
    return blobKeys;
}

private static void setUserArtifactBlobKeys(
        JobGraph jobGraph, Collection> blobKeys)
        throws IOException {
    for (Tuple2 blobKey : blobKeys) {
        jobGraph.setUserArtifactBlobKey(blobKey.f0, blobKey.f1);
    }
}

最后到了JobGraph，我们看下setUserArtifactBlobKey方法：

public void setUserArtifactBlobKey(String entryName, PermanentBlobKey blobKey)
        throws IOException {
    byte[] serializedBlobKey;
    serializedBlobKey = InstantiationUtil.serializeObject(blobKey);

    userArtifacts.computeIfPresent(
            entryName,
            (key, originalEntry) ->
                    new DistributedCache.DistributedCacheEntry(
                            originalEntry.filePath,
                            originalEntry.isExecutable,
                            serializedBlobKey,
                            originalEntry.isZipped));
}

该方法将缓存文件对应的BlobKey补充到userArtifacts集合中。

YarnClusterDescriptor

还有一个上传用户缓存文件的地方位于YarnClusterDescriptor::startAppMaster。此处调用的时机是Flink on Yarn集群启动的时候。

下面这段代码为startAppMaster涉及文件上传的代码片段。使用application或者是yarn session模式提交作业的时候jobGraph为空（Application模式用户代码main方法的执行位于Flink Yarn集群，此时jobGraph还没有生成。Yarn session模式仅仅是启动了一个Flink yarn集群，作业还没有运行）。只有per-job方式提交的时候，jobGraph才不为空。所以说下面这段代码仅仅针对于per-job模式的缓存文件上传。

Flink Yarn 集群的启动流程参见Flink 源码之 yarn-session 启动流程。

// only for per job mode
if (jobGraph != null) {
    for (Map.Entry entry :
            jobGraph.getUserArtifacts().entrySet()) {
        // only upload local files
        // 通过判断文件路径的scheme（file:///或者hdfs://等），确定文件位于远程共享存储还是本地存储
        // 只上传位于本地存储的文件到yarn集群
        if (!Utils.isRemotePath(entry.getValue().filePath)) {
            Path localPath = new Path(entry.getValue().filePath);
            // 上传本地文件到yarn集群（application目录，application访问级别）
            Tuple2 remoteFileInfo =
                    fileUploader.uploadLocalFileToRemote(localPath, entry.getKey());
            // 修改jobGraph中保存的缓存文件路径为上传到yarn集群之后的文件路径
            jobGraph.setUserArtifactRemotePath(
                    entry.getKey(), remoteFileInfo.f0.toString());
        }
    }
    
    // 将缓存文件信息写回到configuration中
    jobGraph.writeUserArtifactEntriesToConfiguration();
}

JobGraph的setUserArtifactRemotePath方法将userArtifact中缓存的文件的路径从本地路径替换为上传到yarn集群之后的路径。代码如下：


public void setUserArtifactRemotePath(String entryName, String remotePath) {
    userArtifacts.computeIfPresent(
            entryName,
            (key, originalEntry) ->
                    new DistributedCache.DistributedCacheEntry(
                            remotePath,
                            originalEntry.isExecutable,
                            null,
                            originalEntry.isZipped));
}

最后调用JobGraph::writeUserArtifactEntriesToConfiguration方法，将缓存文件的名称和路径等配置信息写入到configuration中。

public void writeUserArtifactEntriesToConfiguration() {
    for (Map.Entry userArtifact :
            userArtifacts.entrySet()) {
        DistributedCache.writeFileInfoToConfig(
                userArtifact.getKey(), userArtifact.getValue(), jobConfiguration);
    }
}

缓存文件的获取

缓存文件的获取位于TaskManager中。我们查看Task类的doRun方法。该方法创建了一系列后台复制任务，复制缓存文件到TM本地目录。相关代码片段如下所示：

// ...

// next, kick off the background copying of files for the distributed cache
try {
    // 从configuration中读取DistributedCache中所有缓存的文件信息
    for (Map.Entry entry :
            DistributedCache.readFileInfoFromConfig(jobConfiguration)) {
        LOG.info("Obtaining local cache file for '{}'.", entry.getKey());
        // 将缓存的文件下载到缓存目录中
        // 在专用的线程池中运行
        // 会自动判断文件是否具有BlobKey，选择从BlobServer下载还是从远程文件系统中下载
        Future cp =
                fileCache.createTmpFile(
                        entry.getKey(), entry.getValue(), jobId, executionId);
        // 将下载任务Future放置到distributedCacheEntries集合中
        distributedCacheEntries.put(entry.getKey(), cp);
    }
} catch (Exception e) {
    throw new Exception(
            String.format(
                    "Exception while adding files to distributed cache of task %s (%s).",
                    taskNameWithSubtask, executionId),
            e);
}

// ...

然后Task将distributedCacheEntries传入到RuntimeEnvironment中：

// distributedCacheEntries传入到RuntimeEnvironment中
Environment env =
        new RuntimeEnvironment(
                jobId,
                vertexId,
                executionId,
                executionConfig,
                taskInfo,
                jobConfiguration,
                taskConfiguration,
                userCodeClassLoader,
                memoryManager,
                ioManager,
                broadcastVariableManager,
                taskStateManager,
                aggregateManager,
                accumulatorRegistry,
                kvStateRegistry,
                inputSplitProvider,
                distributedCacheEntries,
                consumableNotifyingPartitionWriters,
                inputGates,
                taskEventDispatcher,
                checkpointResponder,
                operatorCoordinatorEventGateway,
                taskManagerConfig,
                metrics,
                this,
                externalResourceInfoProvider);

Environment接下来从task传入AbstractStreamOperator中，然后传递给StreamingRuntimeContext。

我们查看AbstractStreamOperator的setup方法：

final Environment environment = containingTask.getEnvironment();
// ...

this.runtimeContext =
        new StreamingRuntimeContext(
                environment,
                environment.getAccumulatorRegistry().getUserMap(),
                getMetricGroup(),
                getOperatorID(),
                getProcessingTimeService(),
                null,
                environment.getExternalResourceInfoProvider());

其中新建了一个StreamingRuntimeContext，传入了environment。

StreamingRuntimeContext的构造函数如下所示：

public StreamingRuntimeContext(
        Environment env,
        Map> accumulators,
        OperatorMetricGroup operatorMetricGroup,
        OperatorID operatorID,
        ProcessingTimeService processingTimeService,
        @Nullable KeyedStateStore keyedStateStore,
        ExternalResourceInfoProvider externalResourceInfoProvider) {
    super(
            checkNotNull(env).getTaskInfo(),
            env.getUserCodeClassLoader(),
            env.getExecutionConfig(),
            accumulators,
            env.getDistributedCacheEntries(),
            operatorMetricGroup);
    this.taskEnvironment = env;
    this.streamConfig = new StreamConfig(env.getTaskConfiguration());
    this.operatorUniqueID = checkNotNull(operatorID).toString();
    this.processingTimeService = processingTimeService;
    this.keyedStateStore = keyedStateStore;
    this.externalResourceInfoProvider = externalResourceInfoProvider;
}

它的父类为AbstractRuntimeUDFContext，继续跟踪它的构造函数，代码如下：

public AbstractRuntimeUDFContext(
        TaskInfo taskInfo,
        UserCodeClassLoader userCodeClassLoader,
        ExecutionConfig executionConfig,
        Map> accumulators,
        Map> cpTasks,
        OperatorMetricGroup metrics) {
    this.taskInfo = checkNotNull(taskInfo);
    this.userCodeClassLoader = userCodeClassLoader;
    this.executionConfig = executionConfig;
    this.distributedCache = new DistributedCache(checkNotNull(cpTasks));
    this.accumulators = checkNotNull(accumulators);
    this.metrics = metrics;
}

Task创建的copytask终于传递到了AbstractRuntimeUDFContext中，重新包装到distributedCache之内。

DistributedCache构造函数如下所示：

public DistributedCache(Map> cacheCopyTasks) {
    this.cacheCopyTasks = cacheCopyTasks;
}

最终，DistributedCache将copyTask缓存到cacheCopyTasks。

用户代码获取DistributedCache中缓存的文件

用户算子需要继承RichXXXFunction。RichXXXFunction可以通过RuntimeContext的getDistributedCache方法获取DistributedCache，然后读取需要的内容。示例代码如下：

val demoFile = getRuntimeContext.getDistributedCache.getFile("demo")

这里的getRuntimeContext获取的正是AbstractRuntimeUDFContext对象，它的getDistributedCache返回了distributedCache对象。

@Override
public DistributedCache getDistributedCache() {
    return this.distributedCache;
}

从缓存读取文件的逻辑位于getFile方法。它等待文件从BlobServer或者远程文件系统复制到TM本地之后，返回文件位于本地存储的路径。代码如下：

public File getFile(String name) {
    if (name == null) {
        throw new NullPointerException("name must not be null");
    }

    Future future = cacheCopyTasks.get(name);
    if (future == null) {
        throw new IllegalArgumentException(
                "File with name '"
                        + name
                        + "' is not available."
                        + " Did you forget to register the file?");
    }

    try {
        // 阻塞等待后台文件复制工作结束之后，获取复制后文件所在路径
        final Path path = future.get();
        // 获取合规的URI，包含scheme，authority和path，然后返回
        URI tmp = path.makeQualified(path.getFileSystem()).toUri();
        return new File(tmp);
    } catch (ExecutionException e) {
        throw new RuntimeException("An error occurred while copying the file.", e.getCause());
    } catch (Exception e) {
        throw new RuntimeException(
                "Error while getting the file registered under '"
                        + name
                        + "' from the distributed cache",
                e);
    }
}

本博客为作者原创，欢迎大家参与讨论和批评指正。如需转载请注明出处。

Flink 源码之 Distributed Cache