Flink 源码之 Distributed Cache

Flink源码分析系列文档目录

请点击:Flink 源码分析系列文档目录

背景

Flink 分布式缓存(Distributed Cache)可用于向作业的各个TaskManager分发文件。典型的使用场景为流推理作业时候向集群内分发训练模型。文件分发的操作由Flink自动进行,无需用户干预,使用非常方便。

使用方法可参考Flink 使用之配置与调优中使用分布式缓存章节。

另外可以参考官方文档的使用示例:

https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/dataset/overview/#distributed-cache

注册文件到分布式缓存中:

val env = ExecutionEnvironment.getExecutionEnvironment

// register a file from HDFS
env.registerCachedFile("hdfs:///path/to/your/file", "hdfsFile")

// register a local executable file (script, executable, ...)
env.registerCachedFile("file:///path/to/exec/file", "localExecFile", true)

// define your program and execute
...
val input: DataSet[String] = ...
val result: DataSet[Integer] = input.map(new MyMapper())
...
env.execute()

在Task Manager 算子中读取位于分布式缓存中的文件:

// extend a RichFunction to have access to the RuntimeContext
class MyMapper extends RichMapFunction[String, Int] {

  override def open(config: Configuration): Unit = {

    // access cached file via RuntimeContext and DistributedCache
    val myFile: File = getRuntimeContext.getDistributedCache.getFile("hdfsFile")
    // read the file (or navigate the directory)
    ...
  }

  override def map(value: String): Int = {
    // use content of cached file
    ...
  }
}

分布式缓存的使用方式较为简单,介绍到此为止。下面章节开始分布式缓存整个处理流程的源代码分析。

注册缓存文件

下面从ExecutionEnvironmentregisterCachedFile方法开始分析Distributed Cache的执行流程。

public void registerCachedFile(String filePath, String name) {
    registerCachedFile(filePath, name, false);
}


public void registerCachedFile(String filePath, String name, boolean executable) {
    this.cacheFile.add(new Tuple2<>(name, new DistributedCacheEntry(filePath, executable)));
}

这个方法将需要缓存的文件封装为DistributedCacheEntry,存入cacheFile集合中,key为缓存文件name(用户获取缓存文件的时候以name为标识符),value为封装好的DistributedCacheEntry对象。

注意,还可以使用pipeline.cached-files配置项注明需要加入分布式缓存的文件。官网给出的描述如下:

Files to be registered at the distributed cache under the given name. The files will be accessible from any user-defined function in the (distributed) runtime under a local path. Files may be local files (which will be distributed via BlobServer), or files in a distributed file system. The runtime will copy the files temporarily to a local cache, if needed.

Example:
name:file1,path:`file:///tmp/file1`;name:file2,path:`hdfs://

该配置值的读取过程在StreamExecutionEnvironment或者是ExecutionEnvironment中。

configuration
        .getOptional(PipelineOptions.CACHED_FILES)
        .ifPresent(
                f -> {
                    this.cacheFile.clear();
                    this.cacheFile.addAll(DistributedCache.parseCachedFilesFromString(f));
                });

在所有需要缓存的文件信息保存到cacheFile变量之后,接下来是生成StreamGraph的流程。StreamExecutionEnvironment将其传递给StreamGraphGenerator

private StreamGraphGenerator getStreamGraphGenerator(List> transformations) {
    if (transformations.size() <= 0) {
        throw new IllegalStateException(
                "No operators defined in streaming topology. Cannot execute.");
    }

    // We copy the transformation so that newly added transformations cannot intervene with the
    // stream graph generation.
    return new StreamGraphGenerator(
                    new ArrayList<>(transformations), config, checkpointCfg, configuration)
            .setStateBackend(defaultStateBackend)
            .setChangelogStateBackendEnabled(changelogStateBackendEnabled)
            .setSavepointDir(defaultSavepointDirectory)
            .setChaining(isChainingEnabled)
            .setUserArtifacts(cacheFile)
            .setTimeCharacteristic(timeCharacteristic)
            .setDefaultBufferTimeout(bufferTimeout)
            .setSlotSharingGroupResource(slotSharingGroupResources);
}

cacheFileStreamGraphGenerator持有,它的成员变量名称改变成了userArtifacts
StreamGraphGeneratorgenerate方法中调用了configureStreamGraph(streamGraph),这个方法将userAtrifacts传递给了StreamGraph

private void configureStreamGraph(final StreamGraph graph) {
    // ...
    graph.setUserArtifacts(userArtifacts);
    // ...
}

然后流程到了从StreamGraph生成JobGraph的阶段。我们查看StreamingJobGraphGeneratorcreateJobGraph方法。这个方法将StreamGraph携带的userArtifacts传递给JobGraph

private JobGraph createJobGraph() {
    // ...
    final Map distributedCacheEntries =
        JobGraphUtils.prepareUserArtifactEntries(
                streamGraph.getUserArtifacts().stream()
                        .collect(Collectors.toMap(e -> e.f0, e -> e.f1)),
                jobGraph.getJobID());

    for (Map.Entry entry :
            distributedCacheEntries.entrySet()) {
        jobGraph.addUserArtifact(entry.getKey(), entry.getValue());
    }
    // ...
}

缓存文件内容上传

跟踪JobGraph::getUserArtifacts调用位置。我们发现有两处调用值得关注:

  • ClientUtils.extractAndUploadJobGraphFiles
  • YarnClusterDescriptor.startAppMaster

下面分别分析这两种情况。

ClientUtils

ClientUtilsextractAndUploadJobGraphFiles方法顾名思义,解压并上传JobGraph文件。在客户端提交生成的JobGraph的时候执行。

public static void extractAndUploadJobGraphFiles(
        JobGraph jobGraph, SupplierWithException clientSupplier)
        throws FlinkException {
    List userJars = jobGraph.getUserJars();
    Collection> userArtifacts =
            jobGraph.getUserArtifacts().entrySet().stream()
                    .map(
                            entry ->
                                    Tuple2.of(
                                            entry.getKey(),
                                            new Path(entry.getValue().filePath)))
                    .collect(Collectors.toList());

    uploadJobGraphFiles(jobGraph, userJars, userArtifacts, clientSupplier);
}

该方法调用uploadJobGraphFiles上传userJarsuserArtifacts

public static void uploadJobGraphFiles(
        JobGraph jobGraph,
        Collection userJars,
        Collection> userArtifacts,
        SupplierWithException clientSupplier)
        throws FlinkException {
    if (!userJars.isEmpty() || !userArtifacts.isEmpty()) {
        try (BlobClient client = clientSupplier.get()) {
            uploadAndSetUserJars(jobGraph, userJars, client);
            uploadAndSetUserArtifacts(jobGraph, userArtifacts, client);
        } catch (IOException ioe) {
            throw new FlinkException("Could not upload job files.", ioe);
        }
    }
    jobGraph.writeUserArtifactEntriesToConfiguration();
}

上传userArtifacts的逻辑位于uploadAndSetUserArtifacts方法,我们继续跟踪调用:

private static void uploadAndSetUserArtifacts(
        JobGraph jobGraph,
        Collection> artifactPaths,
        BlobClient blobClient)
        throws IOException {
    Collection> blobKeys =
            uploadUserArtifacts(jobGraph.getJobID(), artifactPaths, blobClient);
    setUserArtifactBlobKeys(jobGraph, blobKeys);
}

uploadUserArtifacts方法将文件上传到BlobServer。然后通过setUserArtifactBlobKeys方法在JobGraph中设置文件对应的blobKey

private static Collection> uploadUserArtifacts(
        JobID jobID, Collection> userArtifacts, BlobClient blobClient)
        throws IOException {
    Collection> blobKeys =
            new ArrayList<>(userArtifacts.size());
    for (Tuple2 userArtifact : userArtifacts) {
        // only upload local files
        if (!userArtifact.f1.getFileSystem().isDistributedFS()) {
            final PermanentBlobKey blobKey = blobClient.uploadFile(jobID, userArtifact.f1);
            blobKeys.add(Tuple2.of(userArtifact.f0, blobKey));
        }
    }
    return blobKeys;
}

private static void setUserArtifactBlobKeys(
        JobGraph jobGraph, Collection> blobKeys)
        throws IOException {
    for (Tuple2 blobKey : blobKeys) {
        jobGraph.setUserArtifactBlobKey(blobKey.f0, blobKey.f1);
    }
}

最后到了JobGraph,我们看下setUserArtifactBlobKey方法:

public void setUserArtifactBlobKey(String entryName, PermanentBlobKey blobKey)
        throws IOException {
    byte[] serializedBlobKey;
    serializedBlobKey = InstantiationUtil.serializeObject(blobKey);

    userArtifacts.computeIfPresent(
            entryName,
            (key, originalEntry) ->
                    new DistributedCache.DistributedCacheEntry(
                            originalEntry.filePath,
                            originalEntry.isExecutable,
                            serializedBlobKey,
                            originalEntry.isZipped));
}

该方法将缓存文件对应的BlobKey补充到userArtifacts集合中。

YarnClusterDescriptor

还有一个上传用户缓存文件的地方位于YarnClusterDescriptor::startAppMaster。此处调用的时机是Flink on Yarn集群启动的时候。

下面这段代码为startAppMaster涉及文件上传的代码片段。使用application或者是yarn session模式提交作业的时候jobGraph为空(Application模式用户代码main方法的执行位于Flink Yarn集群,此时jobGraph还没有生成。Yarn session模式仅仅是启动了一个Flink yarn集群,作业还没有运行)。只有per-job方式提交的时候,jobGraph才不为空。所以说下面这段代码仅仅针对于per-job模式的缓存文件上传。

Flink Yarn 集群的启动流程参见Flink 源码之 yarn-session 启动流程。

// only for per job mode
if (jobGraph != null) {
    for (Map.Entry entry :
            jobGraph.getUserArtifacts().entrySet()) {
        // only upload local files
        // 通过判断文件路径的scheme(file:///或者hdfs://等),确定文件位于远程共享存储还是本地存储
        // 只上传位于本地存储的文件到yarn集群
        if (!Utils.isRemotePath(entry.getValue().filePath)) {
            Path localPath = new Path(entry.getValue().filePath);
            // 上传本地文件到yarn集群(application目录,application访问级别)
            Tuple2 remoteFileInfo =
                    fileUploader.uploadLocalFileToRemote(localPath, entry.getKey());
            // 修改jobGraph中保存的缓存文件路径为上传到yarn集群之后的文件路径
            jobGraph.setUserArtifactRemotePath(
                    entry.getKey(), remoteFileInfo.f0.toString());
        }
    }
    
    // 将缓存文件信息写回到configuration中
    jobGraph.writeUserArtifactEntriesToConfiguration();
}

JobGraphsetUserArtifactRemotePath方法将userArtifact中缓存的文件的路径从本地路径替换为上传到yarn集群之后的路径。代码如下:


public void setUserArtifactRemotePath(String entryName, String remotePath) {
    userArtifacts.computeIfPresent(
            entryName,
            (key, originalEntry) ->
                    new DistributedCache.DistributedCacheEntry(
                            remotePath,
                            originalEntry.isExecutable,
                            null,
                            originalEntry.isZipped));
}

最后调用JobGraph::writeUserArtifactEntriesToConfiguration方法,将缓存文件的名称和路径等配置信息写入到configuration中。

public void writeUserArtifactEntriesToConfiguration() {
    for (Map.Entry userArtifact :
            userArtifacts.entrySet()) {
        DistributedCache.writeFileInfoToConfig(
                userArtifact.getKey(), userArtifact.getValue(), jobConfiguration);
    }
}

缓存文件的获取

缓存文件的获取位于TaskManager中。我们查看Task类的doRun方法。该方法创建了一系列后台复制任务,复制缓存文件到TM本地目录。相关代码片段如下所示:

// ...

// next, kick off the background copying of files for the distributed cache
try {
    // 从configuration中读取DistributedCache中所有缓存的文件信息
    for (Map.Entry entry :
            DistributedCache.readFileInfoFromConfig(jobConfiguration)) {
        LOG.info("Obtaining local cache file for '{}'.", entry.getKey());
        // 将缓存的文件下载到缓存目录中
        // 在专用的线程池中运行
        // 会自动判断文件是否具有BlobKey,选择从BlobServer下载还是从远程文件系统中下载
        Future cp =
                fileCache.createTmpFile(
                        entry.getKey(), entry.getValue(), jobId, executionId);
        // 将下载任务Future放置到distributedCacheEntries集合中
        distributedCacheEntries.put(entry.getKey(), cp);
    }
} catch (Exception e) {
    throw new Exception(
            String.format(
                    "Exception while adding files to distributed cache of task %s (%s).",
                    taskNameWithSubtask, executionId),
            e);
}

// ...

然后TaskdistributedCacheEntries传入到RuntimeEnvironment中:

// distributedCacheEntries传入到RuntimeEnvironment中
Environment env =
        new RuntimeEnvironment(
                jobId,
                vertexId,
                executionId,
                executionConfig,
                taskInfo,
                jobConfiguration,
                taskConfiguration,
                userCodeClassLoader,
                memoryManager,
                ioManager,
                broadcastVariableManager,
                taskStateManager,
                aggregateManager,
                accumulatorRegistry,
                kvStateRegistry,
                inputSplitProvider,
                distributedCacheEntries,
                consumableNotifyingPartitionWriters,
                inputGates,
                taskEventDispatcher,
                checkpointResponder,
                operatorCoordinatorEventGateway,
                taskManagerConfig,
                metrics,
                this,
                externalResourceInfoProvider);

Environment接下来从task传入AbstractStreamOperator中,然后传递给StreamingRuntimeContext

我们查看AbstractStreamOperatorsetup方法:

final Environment environment = containingTask.getEnvironment();
// ...

this.runtimeContext =
        new StreamingRuntimeContext(
                environment,
                environment.getAccumulatorRegistry().getUserMap(),
                getMetricGroup(),
                getOperatorID(),
                getProcessingTimeService(),
                null,
                environment.getExternalResourceInfoProvider());

其中新建了一个StreamingRuntimeContext,传入了environment

StreamingRuntimeContext的构造函数如下所示:

public StreamingRuntimeContext(
        Environment env,
        Map> accumulators,
        OperatorMetricGroup operatorMetricGroup,
        OperatorID operatorID,
        ProcessingTimeService processingTimeService,
        @Nullable KeyedStateStore keyedStateStore,
        ExternalResourceInfoProvider externalResourceInfoProvider) {
    super(
            checkNotNull(env).getTaskInfo(),
            env.getUserCodeClassLoader(),
            env.getExecutionConfig(),
            accumulators,
            env.getDistributedCacheEntries(),
            operatorMetricGroup);
    this.taskEnvironment = env;
    this.streamConfig = new StreamConfig(env.getTaskConfiguration());
    this.operatorUniqueID = checkNotNull(operatorID).toString();
    this.processingTimeService = processingTimeService;
    this.keyedStateStore = keyedStateStore;
    this.externalResourceInfoProvider = externalResourceInfoProvider;
}

它的父类为AbstractRuntimeUDFContext,继续跟踪它的构造函数,代码如下:

public AbstractRuntimeUDFContext(
        TaskInfo taskInfo,
        UserCodeClassLoader userCodeClassLoader,
        ExecutionConfig executionConfig,
        Map> accumulators,
        Map> cpTasks,
        OperatorMetricGroup metrics) {
    this.taskInfo = checkNotNull(taskInfo);
    this.userCodeClassLoader = userCodeClassLoader;
    this.executionConfig = executionConfig;
    this.distributedCache = new DistributedCache(checkNotNull(cpTasks));
    this.accumulators = checkNotNull(accumulators);
    this.metrics = metrics;
}

Task创建的copytask终于传递到了AbstractRuntimeUDFContext中,重新包装到distributedCache之内。

DistributedCache构造函数如下所示:

public DistributedCache(Map> cacheCopyTasks) {
    this.cacheCopyTasks = cacheCopyTasks;
}

最终,DistributedCache将copyTask缓存到cacheCopyTasks

用户代码获取DistributedCache中缓存的文件

用户算子需要继承RichXXXFunctionRichXXXFunction可以通过RuntimeContextgetDistributedCache方法获取DistributedCache,然后读取需要的内容。示例代码如下:

val demoFile = getRuntimeContext.getDistributedCache.getFile("demo")

这里的getRuntimeContext获取的正是AbstractRuntimeUDFContext对象,它的getDistributedCache返回了distributedCache对象。

@Override
public DistributedCache getDistributedCache() {
    return this.distributedCache;
}

从缓存读取文件的逻辑位于getFile方法。它等待文件从BlobServer或者远程文件系统复制到TM本地之后,返回文件位于本地存储的路径。代码如下:

public File getFile(String name) {
    if (name == null) {
        throw new NullPointerException("name must not be null");
    }

    Future future = cacheCopyTasks.get(name);
    if (future == null) {
        throw new IllegalArgumentException(
                "File with name '"
                        + name
                        + "' is not available."
                        + " Did you forget to register the file?");
    }

    try {
        // 阻塞等待后台文件复制工作结束之后,获取复制后文件所在路径
        final Path path = future.get();
        // 获取合规的URI,包含scheme,authority和path,然后返回
        URI tmp = path.makeQualified(path.getFileSystem()).toUri();
        return new File(tmp);
    } catch (ExecutionException e) {
        throw new RuntimeException("An error occurred while copying the file.", e.getCause());
    } catch (Exception e) {
        throw new RuntimeException(
                "Error while getting the file registered under '"
                        + name
                        + "' from the distributed cache",
                e);
    }
}

本博客为作者原创,欢迎大家参与讨论和批评指正。如需转载请注明出处。

你可能感兴趣的:(Flink 源码之 Distributed Cache)