Flink源码篇-FlinkStreaming执行计划生成流程

    无知不可怕,毁掉自己的是骄傲

1,示例

       很多人使用Flink的时候有没有考虑过执行计划是如何生成的,例如Spark的RDD拓扑有向无环图是怎么生成的,打印出来的执行计划应该怎么理解,我们先看一个示例,执行以下System.out.println(env.getExecutionPlan());

{
  "nodes" : [ {
    "id" : 1,    图节点ID,也就是transform的ID
    "type" : "Source: 添加了一个source",   这个就是图名称
    "pact" : "Data Source",   类型,数据源
    "contents" : "Source: 添加了一个source",   描述内容
    "parallelism" : 1   并行度
  }, {
    "id" : 2,    
    "type" : "Flat Map",
    "pact" : "Operator",
    "contents" : "Flat Map",
    "parallelism" : 8,
    "predecessors" : [ {    这个是源节点
      "id" : 1,
      "ship_strategy" : "REBALANCE", 策略
      "side" : "second"
    } ]
  }, {
    "id" : 4,
    "type" : "Keyed Aggregation",
    "pact" : "Operator",
    "contents" : "Keyed Aggregation",
    "parallelism" : 8,
    "predecessors" : [ {
      "id" : 2,
      "ship_strategy" : "HASH",
      "side" : "second"
    } ]
  }, {
    "id" : 5,
    "type" : "Sink: Print to Std. Out",
    "pact" : "Data Sink",
    "contents" : "Sink: Print to Std. Out",
    "parallelism" : 8,
    "predecessors" : [ {
      "id" : 4,
      "ship_strategy" : "FORWARD",
      "side" : "second"
    } ]
  } ]
}

2,代码解析

public String getExecutionPlan() {
    return getStreamGraph(getJobName(), false).getStreamingPlanAsJSON();
}
先看如下这个方法,首先一个执行计划需要去执行生成有向无环图(类似于Spark,Hive,Flink的数据血缘可以借鉴以下他们的方法哦),获取到StreamGrap对象之后get得到一个json串就是我们的执行计划

3,代码下钻

public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
        //这个方法就是包装了一个步骤,主要把if模块包装了以下,用来做transformations的清理
        StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
        if (clearTransformations) {
            this.transformations.clear();
        }
        return streamGraph;
}
//调用该方法生成一个流图生成器
private StreamGraphGenerator getStreamGraphGenerator() {
    if (transformations.size() <= 0) { //校验是否有算子,这个跟上一章source,Sink,transform挂载有关,有兴趣的可以再去看下
        throw new IllegalStateException(
                "No operators defined in streaming topology. Cannot execute.");
    }
    
    final RuntimeExecutionMode executionMode = configuration.get(ExecutionOptions.RUNTIME_MODE);
    //这里设置了一大堆属性,有兴趣的可以全部进去一个一个属性了解一下;
    return new StreamGraphGenerator(transformations, config, checkpointCfg, getConfiguration())
            .setRuntimeExecutionMode(executionMode)
            .setStateBackend(defaultStateBackend)
            .setChaining(isChainingEnabled)
            .setUserArtifacts(cacheFile)
            .setTimeCharacteristic(timeCharacteristic)
            .setDefaultBufferTimeout(bufferTimeout);
}


//重点来了,真正生成执行计划图的方法
 public StreamGraph generate() {
    //这里直接new了一个streamGrap对象
    streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
    //这里获取了一下是否已批module运行的判断
    shouldExecuteInBatchMode = shouldExecuteInBatchMode(runtimeExecutionMode);
    //添加一些设置
    configureStreamGraph(streamGraph);
    //创建一个hashMap对象,主要是之后要用递归由这个对象来终止
    alreadyTransformed = new HashMap<>();
    //这里就开始生成校验算子了
    for (Transformation transformation : transformations) {
        transform(transformation); //重点关注一下这里的代码,主要是根据这个代码来添加属性的
    }

    final StreamGraph builtStreamGraph = streamGraph;

    alreadyTransformed.clear();
    alreadyTransformed = null;
    streamGraph = null;

    return builtStreamGraph;
}
private Collection transform(Transformation transform) {
    //前面创建了一个map,主要是校验算子是否已经处理过
    if (alreadyTransformed.containsKey(transform)) {
        return alreadyTransformed.get(transform);
    }
    //打印出没有加入的算子
    LOG.debug("Transforming " + transform);
    //获取算子的并行度,如果并行度没有设置就进入if判断
    if (transform.getMaxParallelism() <= 0) {
        // if the max parallelism hasn't been set, then first use the job wide max parallelism
        // from the ExecutionConfig.
        int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
        if (globalMaxParallelismFromConfig > 0) {
            transform.setMaxParallelism(globalMaxParallelismFromConfig);
        }
    }

    // call at least once to trigger exceptions about MissingTypeInfo
    transform.getOutputType();//这就是做个校验;

    @SuppressWarnings("unchecked")
    //这个方法就是根据当前的transformation获取真正的执行
    final TransformationTranslator> translator = 
        (TransformationTranslator>)
        translatorMap.get(transform.getClass());
    //如果能够获取到
    Collection transformedIds;
    if (translator != null) {
        //跳转一下这个方法,处理图转换的;
        transformedIds = translate(translator, transform);
    } else {
        transformedIds = legacyTransform(transform);
    }

    // need this check because the iterate transformation adds itself before
    // transforming the feedback edges
    if (!alreadyTransformed.containsKey(transform)) {
        alreadyTransformed.put(transform, transformedIds);
    }

    return transformedIds;
}

private Collection translate(
            final TransformationTranslator> translator,
            final Transformation transform) {
        checkNotNull(translator);
        checkNotNull(transform);
    
        //这里采用了一个递归的方式来进行节点添加,如果父类未被转换,则会转换父类,比如第一个transformation是flat_map,它的父类source就没被转换,
        final List> allInputIds = getParentInputIds(transform.getInputs());

        // the recursive call might have already transformed this 将其注册过的算子添加到其中
        if (alreadyTransformed.containsKey(transform)) {
            return alreadyTransformed.get(transform);
        }
        // 获取slot的资源组名
        final String slotSharingGroup =
                determineSlotSharingGroup(
                        transform.getSlotSharingGroup(),
                        allInputIds.stream()
                                .flatMap(Collection::stream)
                                .collect(Collectors.toList()));

        final TransformationTranslator.Context context =
                new ContextImpl(this, streamGraph, slotSharingGroup, configuration);

        return shouldExecuteInBatchMode
                ? translator.translateForBatch(transform, context)
                : translator.translateForStreaming(transform, context);
}

private List> getParentInputIds(
            @Nullable final Collection> parentTransformations) {
        final List> allInputIds = new ArrayList<>();
        if (parentTransformations == null) {
            return allInputIds;
        }
            
        for (Transformation transformation : parentTransformations) {
            //这个方法的重点是这里哦,递归调用transform方法;
            allInputIds.add(transform(transformation));
        }
        return allInputIds;
    }

//调用该方法添加Node
protected Collection translateInternal(
            final Transformation transformation,
            final StreamOperatorFactory operatorFactory,
            final TypeInformation inputType,
            @Nullable final KeySelector stateKeySelector,
            @Nullable final TypeInformation stateKeyType,
            final Context context) {
        checkNotNull(transformation);
        checkNotNull(operatorFactory);
        checkNotNull(inputType);
        checkNotNull(context);

        final StreamGraph streamGraph = context.getStreamGraph();
        final String slotSharingGroup = context.getSlotSharingGroup();
        final int transformationId = transformation.getId();
        final ExecutionConfig executionConfig = streamGraph.getExecutionConfig();

        streamGraph.addOperator(
                transformationId,
                slotSharingGroup,
                transformation.getCoLocationGroupKey(),
                operatorFactory,
                inputType,
                transformation.getOutputType(),
                transformation.getName());

        if (stateKeySelector != null) {
            TypeSerializer keySerializer = stateKeyType.createSerializer(executionConfig);
            streamGraph.setOneInputStateKey(transformationId, stateKeySelector, keySerializer);
        }

        int parallelism =
                transformation.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT
                        ? transformation.getParallelism()
                        : executionConfig.getParallelism();
        streamGraph.setParallelism(transformationId, parallelism);
        streamGraph.setMaxParallelism(transformationId, transformation.getMaxParallelism());

        final List> parentTransformations = transformation.getInputs();
        checkState(
                parentTransformations.size() == 1,
                "Expected exactly one input transformation but found "
                        + parentTransformations.size());

        for (Integer inputId : context.getStreamNodeIds(parentTransformations.get(0))) {
            streamGraph.addEdge(inputId, transformationId, 0);
        }

        return Collections.singleton(transformationId);
    }

private  void addOperator(
            Integer vertexID,
            @Nullable String slotSharingGroup,
            @Nullable String coLocationGroup,
            StreamOperatorFactory operatorFactory,
            TypeInformation inTypeInfo,
            TypeInformation outTypeInfo,
            String operatorName,
            Class invokableClass) {

        addNode(
                vertexID,
                slotSharingGroup,
                coLocationGroup,
                invokableClass,
                operatorFactory,
                operatorName);
        setSerializers(vertexID, createSerializer(inTypeInfo), null, createSerializer(outTypeInfo));
以上方法就操作完了,获取的对象内容如下,然后调用getStreamingPlanAsJSON()方法来获取最后的JSON字符串
public String getJSON() {
        ObjectNode json = mapper.createObjectNode();
        ArrayNode nodes = mapper.createArrayNode();
        json.put("nodes", nodes);

        List operatorIDs = new ArrayList<>(streamGraph.getVertexIDs());
        Comparator operatorIDComparator =
                Comparator.comparingInt(
                                (Integer id) -> streamGraph.getSinkIDs().contains(id) ? 1 : 0)
                        .thenComparingInt(id -> id);
        operatorIDs.sort(operatorIDComparator);

        visit(nodes, operatorIDs, new HashMap<>());

        return json.toPrettyString();
    }

到此为止Flink的执行计划就生成完成了,我们可以通过https://flink.apache.org/visualizer/来进行执行计划的解析,看一下自己的链路,作为一个合格的大数据开发者,我们一定要懂得去看执行计划,无论是Spark,Hive,Flink都是包含执行计划,通过执行计划我们可以看自己的SQL亦或者API写的是否良好;

你可能感兴趣的:(Flink,flink)