Flink源码阅读(一)--- StreamGraph 的生成

本文内容是基于Flink 1.9来讲解。在执行Flink任务的时候,会涉及到三个Graph,分别是StreamGraph,JobGraph,ExecutionGraph。其中StreamGraph和JobGraph是在client端生成的,ExecutionGraph是在JobMaster中执行的。

  • StreamGraph是根据用户代码生成的最原始执行图,也就是直接翻译用户逻辑得到的图
  • JobGraph是对StreamGraph进行优化,比如设置哪些算子可以chain,减少网络开销
  • ExecutionGraph是用于作业调度的执行图,对JobGraph加了并行度的概念

本篇文章首先介绍下StreamGraph的生成

1. transformations生成

Flink引擎有很多算子,比如map, flatMap, join等,这些算子都会生成一个transformation。比如对于flatMap算子,咱们跟下源码,看下DataStream#flatMap方法

    public  SingleOutputStreamOperator flatMap(FlatMapFunction flatMapper) {

        TypeInformation outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),
                getType(), Utils.getCallLocationName(), true);

        return transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));

    }
  • 先获取返回值类型相关信息
  • transform算子,跟进去源码继续看的话,会首先构建一个OneInputTransformation对象,然后把该对象加入StreamExecutionEnvironment 的 transformations对象中

2. StreamGraph生成入口 StreamGraphGenerator#generate()方法

    public StreamGraph generate() {
        streamGraph = new StreamGraph(executionConfig, checkpointConfig);
        streamGraph.setStateBackend(stateBackend);
        streamGraph.setChaining(chaining);
        streamGraph.setScheduleMode(scheduleMode);
        streamGraph.setUserArtifacts(userArtifacts);
        streamGraph.setTimeCharacteristic(timeCharacteristic);
        streamGraph.setJobName(jobName);
        streamGraph.setBlockingConnectionsBetweenChains(blockingConnectionsBetweenChains);

        alreadyTransformed = new HashMap<>();

        for (Transformation transformation: transformations) {
            transform(transformation);
        }

        final StreamGraph builtStreamGraph = streamGraph;

        alreadyTransformed.clear();
        alreadyTransformed = null;
        streamGraph = null;

        return builtStreamGraph;
    }

这个generate方法会对所有的transformations进行转换,咱们接着看下transform逻辑

    /**
     * Transforms one {@code Transformation}.
     *
     * 

This checks whether we already transformed it and exits early in that case. If not it * delegates to one of the transformation specific methods. */ private Collection transform(Transformation transform) { if (alreadyTransformed.containsKey(transform)) { return alreadyTransformed.get(transform); } LOG.debug("Transforming " + transform); if (transform.getMaxParallelism() <= 0) { // if the max parallelism hasn't been set, then first use the job wide max parallelism // from the ExecutionConfig. int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism(); if (globalMaxParallelismFromConfig > 0) { transform.setMaxParallelism(globalMaxParallelismFromConfig); } } // call at least once to trigger exceptions about MissingTypeInfo transform.getOutputType(); Collection transformedIds; if (transform instanceof OneInputTransformation) { transformedIds = transformOneInputTransform((OneInputTransformation) transform); } else if (transform instanceof TwoInputTransformation) { transformedIds = transformTwoInputTransform((TwoInputTransformation) transform); } else if (transform instanceof SourceTransformation) { transformedIds = transformSource((SourceTransformation) transform); } else if (transform instanceof SinkTransformation) { transformedIds = transformSink((SinkTransformation) transform); } else if (transform instanceof UnionTransformation) { transformedIds = transformUnion((UnionTransformation) transform); } else if (transform instanceof SplitTransformation) { transformedIds = transformSplit((SplitTransformation) transform); } else if (transform instanceof SelectTransformation) { transformedIds = transformSelect((SelectTransformation) transform); } else if (transform instanceof FeedbackTransformation) { transformedIds = transformFeedback((FeedbackTransformation) transform); } else if (transform instanceof CoFeedbackTransformation) { transformedIds = transformCoFeedback((CoFeedbackTransformation) transform); } else if (transform instanceof PartitionTransformation) { transformedIds = transformPartition((PartitionTransformation) transform); } else if (transform instanceof SideOutputTransformation) { transformedIds = transformSideOutput((SideOutputTransformation) transform); } else { throw new IllegalStateException("Unknown transformation: " + transform); } // need this check because the iterate transformation adds itself before // transforming the feedback edges if (!alreadyTransformed.containsKey(transform)) { alreadyTransformed.put(transform, transformedIds); } if (transform.getBufferTimeout() >= 0) { streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout()); } else { streamGraph.setBufferTimeout(transform.getId(), defaultBufferTimeout); } if (transform.getUid() != null) { streamGraph.setTransformationUID(transform.getId(), transform.getUid()); } if (transform.getUserProvidedNodeHash() != null) { streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash()); } if (!streamGraph.getExecutionConfig().hasAutoGeneratedUIDsEnabled()) { if (transform instanceof PhysicalTransformation && transform.getUserProvidedNodeHash() == null && transform.getUid() == null) { throw new IllegalStateException("Auto generated UIDs have been disabled " + "but no UID or hash has been assigned to operator " + transform.getName()); } } if (transform.getMinResources() != null && transform.getPreferredResources() != null) { streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources()); } return transformedIds; }

从源码可以看出,transform在构建的时候,会有多种类型,比如分为Source, Sink, OneInput, Split等。比如flatMap,就属于OneInputTransformation,接下来以比较常见的transformOneInputTransform进行介绍。

    /**
     * Transforms a {@code OneInputTransformation}.
     *
     * 

This recursively transforms the inputs, creates a new {@code StreamNode} in the graph and * wired the inputs to this new node. */ private Collection transformOneInputTransform(OneInputTransformation transform) { Collection inputIds = transform(transform.getInput()); // the recursive call might have already transformed this if (alreadyTransformed.containsKey(transform)) { return alreadyTransformed.get(transform); } String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds); streamGraph.addOperator(transform.getId(), slotSharingGroup, transform.getCoLocationGroupKey(), transform.getOperatorFactory(), transform.getInputType(), transform.getOutputType(), transform.getName()); if (transform.getStateKeySelector() != null) { TypeSerializer keySerializer = transform.getStateKeyType().createSerializer(executionConfig); streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer); } int parallelism = transform.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT ? transform.getParallelism() : executionConfig.getParallelism(); streamGraph.setParallelism(transform.getId(), parallelism); streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism()); for (Integer inputId: inputIds) { streamGraph.addEdge(inputId, transform.getId(), 0); } return Collections.singleton(transform.getId()); }

  • 递归调用该算子所有的input节点
  • 把该Operator加入streamGraph中,实际会生成一个StreamNode对象加入streamGraph。
       1. 首先调用streamGraph.addOperator
       2. 然后调用addNode方法
    protected StreamNode addNode(Integer vertexID,
        @Nullable String slotSharingGroup,
        @Nullable String coLocationGroup,
        Class vertexClass,
        StreamOperatorFactory operatorFactory,
        String operatorName) {

        if (streamNodes.containsKey(vertexID)) {
            throw new RuntimeException("Duplicate vertexID " + vertexID);
        }

        StreamNode vertex = new StreamNode(
            vertexID,
            slotSharingGroup,
            coLocationGroup,
            operatorFactory,
            operatorName,
            new ArrayList>(),
            vertexClass);

        streamNodes.put(vertexID, vertex);

        return vertex;
    }

这个addNode方法,会把该operator转换成一个StreamNode,然后加到StreamGraph中,vertexID对应transform.getId()

  • 为该Operator的所有输入与该Operator之间加上StreamEdge

这样通过 StreamNode 和 SteamEdge,就构建出了 DAG 中的所有节点和边,以及它们之间的连接关系,拓扑结构也就建立了。

3. 小结

StreamGraph其实就是由用户代码中涉及到transformations转换来的,SteamEdge用来表示transformation之间的连接关系,StreamNode用来表示具体的operator。

  • 从sink节点开始遍历
  • 每个transformation,会在StreamGraph中新创建一个StreamNode,并且把新创建的StreamNode和它所有的input之间添加SteamEdge。
  • Partitioning, split/select 和 union 并不会在StreamNode中增加一个真实的StreamNode,而是创建一个具有特殊属性的虚拟节点,比如partitioning, selector等,也就是在边上加了属性信息。

你可能感兴趣的:(Flink源码阅读(一)--- StreamGraph 的生成)