很多人使用Flink的时候有没有考虑过执行计划是如何生成的,例如Spark的RDD拓扑有向无环图是怎么生成的,打印出来的执行计划应该怎么理解,我们先看一个示例,执行以下System.out.println(env.getExecutionPlan());
{
"nodes" : [ {
"id" : 1, 图节点ID,也就是transform的ID
"type" : "Source: 添加了一个source", 这个就是图名称
"pact" : "Data Source", 类型,数据源
"contents" : "Source: 添加了一个source", 描述内容
"parallelism" : 1 并行度
}, {
"id" : 2,
"type" : "Flat Map",
"pact" : "Operator",
"contents" : "Flat Map",
"parallelism" : 8,
"predecessors" : [ { 这个是源节点
"id" : 1,
"ship_strategy" : "REBALANCE", 策略
"side" : "second"
} ]
}, {
"id" : 4,
"type" : "Keyed Aggregation",
"pact" : "Operator",
"contents" : "Keyed Aggregation",
"parallelism" : 8,
"predecessors" : [ {
"id" : 2,
"ship_strategy" : "HASH",
"side" : "second"
} ]
}, {
"id" : 5,
"type" : "Sink: Print to Std. Out",
"pact" : "Data Sink",
"contents" : "Sink: Print to Std. Out",
"parallelism" : 8,
"predecessors" : [ {
"id" : 4,
"ship_strategy" : "FORWARD",
"side" : "second"
} ]
} ]
}
public String getExecutionPlan() {
return getStreamGraph(getJobName(), false).getStreamingPlanAsJSON();
}
先看如下这个方法,首先一个执行计划需要去执行生成有向无环图(类似于Spark,Hive,Flink的数据血缘可以借鉴以下他们的方法哦),获取到StreamGrap对象之后get得到一个json串就是我们的执行计划
public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
//这个方法就是包装了一个步骤,主要把if模块包装了以下,用来做transformations的清理
StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
if (clearTransformations) {
this.transformations.clear();
}
return streamGraph;
}
//调用该方法生成一个流图生成器
private StreamGraphGenerator getStreamGraphGenerator() {
if (transformations.size() <= 0) { //校验是否有算子,这个跟上一章source,Sink,transform挂载有关,有兴趣的可以再去看下
throw new IllegalStateException(
"No operators defined in streaming topology. Cannot execute.");
}
final RuntimeExecutionMode executionMode = configuration.get(ExecutionOptions.RUNTIME_MODE);
//这里设置了一大堆属性,有兴趣的可以全部进去一个一个属性了解一下;
return new StreamGraphGenerator(transformations, config, checkpointCfg, getConfiguration())
.setRuntimeExecutionMode(executionMode)
.setStateBackend(defaultStateBackend)
.setChaining(isChainingEnabled)
.setUserArtifacts(cacheFile)
.setTimeCharacteristic(timeCharacteristic)
.setDefaultBufferTimeout(bufferTimeout);
}
//重点来了,真正生成执行计划图的方法
public StreamGraph generate() {
//这里直接new了一个streamGrap对象
streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
//这里获取了一下是否已批module运行的判断
shouldExecuteInBatchMode = shouldExecuteInBatchMode(runtimeExecutionMode);
//添加一些设置
configureStreamGraph(streamGraph);
//创建一个hashMap对象,主要是之后要用递归由这个对象来终止
alreadyTransformed = new HashMap<>();
//这里就开始生成校验算子了
for (Transformation> transformation : transformations) {
transform(transformation); //重点关注一下这里的代码,主要是根据这个代码来添加属性的
}
final StreamGraph builtStreamGraph = streamGraph;
alreadyTransformed.clear();
alreadyTransformed = null;
streamGraph = null;
return builtStreamGraph;
}
private Collection transform(Transformation> transform) {
//前面创建了一个map,主要是校验算子是否已经处理过
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
//打印出没有加入的算子
LOG.debug("Transforming " + transform);
//获取算子的并行度,如果并行度没有设置就进入if判断
if (transform.getMaxParallelism() <= 0) {
// if the max parallelism hasn't been set, then first use the job wide max parallelism
// from the ExecutionConfig.
int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
if (globalMaxParallelismFromConfig > 0) {
transform.setMaxParallelism(globalMaxParallelismFromConfig);
}
}
// call at least once to trigger exceptions about MissingTypeInfo
transform.getOutputType();//这就是做个校验;
@SuppressWarnings("unchecked")
//这个方法就是根据当前的transformation获取真正的执行
final TransformationTranslator, Transformation>> translator =
(TransformationTranslator, Transformation>>)
translatorMap.get(transform.getClass());
//如果能够获取到
Collection transformedIds;
if (translator != null) {
//跳转一下这个方法,处理图转换的;
transformedIds = translate(translator, transform);
} else {
transformedIds = legacyTransform(transform);
}
// need this check because the iterate transformation adds itself before
// transforming the feedback edges
if (!alreadyTransformed.containsKey(transform)) {
alreadyTransformed.put(transform, transformedIds);
}
return transformedIds;
}
private Collection translate(
final TransformationTranslator, Transformation>> translator,
final Transformation> transform) {
checkNotNull(translator);
checkNotNull(transform);
//这里采用了一个递归的方式来进行节点添加,如果父类未被转换,则会转换父类,比如第一个transformation是flat_map,它的父类source就没被转换,
final List> allInputIds = getParentInputIds(transform.getInputs());
// the recursive call might have already transformed this 将其注册过的算子添加到其中
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
// 获取slot的资源组名
final String slotSharingGroup =
determineSlotSharingGroup(
transform.getSlotSharingGroup(),
allInputIds.stream()
.flatMap(Collection::stream)
.collect(Collectors.toList()));
final TransformationTranslator.Context context =
new ContextImpl(this, streamGraph, slotSharingGroup, configuration);
return shouldExecuteInBatchMode
? translator.translateForBatch(transform, context)
: translator.translateForStreaming(transform, context);
}
private List> getParentInputIds(
@Nullable final Collection> parentTransformations) {
final List> allInputIds = new ArrayList<>();
if (parentTransformations == null) {
return allInputIds;
}
for (Transformation> transformation : parentTransformations) {
//这个方法的重点是这里哦,递归调用transform方法;
allInputIds.add(transform(transformation));
}
return allInputIds;
}
//调用该方法添加Node
protected Collection translateInternal(
final Transformation transformation,
final StreamOperatorFactory operatorFactory,
final TypeInformation inputType,
@Nullable final KeySelector stateKeySelector,
@Nullable final TypeInformation> stateKeyType,
final Context context) {
checkNotNull(transformation);
checkNotNull(operatorFactory);
checkNotNull(inputType);
checkNotNull(context);
final StreamGraph streamGraph = context.getStreamGraph();
final String slotSharingGroup = context.getSlotSharingGroup();
final int transformationId = transformation.getId();
final ExecutionConfig executionConfig = streamGraph.getExecutionConfig();
streamGraph.addOperator(
transformationId,
slotSharingGroup,
transformation.getCoLocationGroupKey(),
operatorFactory,
inputType,
transformation.getOutputType(),
transformation.getName());
if (stateKeySelector != null) {
TypeSerializer> keySerializer = stateKeyType.createSerializer(executionConfig);
streamGraph.setOneInputStateKey(transformationId, stateKeySelector, keySerializer);
}
int parallelism =
transformation.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT
? transformation.getParallelism()
: executionConfig.getParallelism();
streamGraph.setParallelism(transformationId, parallelism);
streamGraph.setMaxParallelism(transformationId, transformation.getMaxParallelism());
final List> parentTransformations = transformation.getInputs();
checkState(
parentTransformations.size() == 1,
"Expected exactly one input transformation but found "
+ parentTransformations.size());
for (Integer inputId : context.getStreamNodeIds(parentTransformations.get(0))) {
streamGraph.addEdge(inputId, transformationId, 0);
}
return Collections.singleton(transformationId);
}
private void addOperator(
Integer vertexID,
@Nullable String slotSharingGroup,
@Nullable String coLocationGroup,
StreamOperatorFactory operatorFactory,
TypeInformation inTypeInfo,
TypeInformation outTypeInfo,
String operatorName,
Class extends AbstractInvokable> invokableClass) {
addNode(
vertexID,
slotSharingGroup,
coLocationGroup,
invokableClass,
operatorFactory,
operatorName);
setSerializers(vertexID, createSerializer(inTypeInfo), null, createSerializer(outTypeInfo));
以上方法就操作完了,获取的对象内容如下,然后调用getStreamingPlanAsJSON()方法来获取最后的JSON字符串
public String getJSON() {
ObjectNode json = mapper.createObjectNode();
ArrayNode nodes = mapper.createArrayNode();
json.put("nodes", nodes);
List operatorIDs = new ArrayList<>(streamGraph.getVertexIDs());
Comparator operatorIDComparator =
Comparator.comparingInt(
(Integer id) -> streamGraph.getSinkIDs().contains(id) ? 1 : 0)
.thenComparingInt(id -> id);
operatorIDs.sort(operatorIDComparator);
visit(nodes, operatorIDs, new HashMap<>());
return json.toPrettyString();
}
到此为止Flink的执行计划就生成完成了,我们可以通过https://flink.apache.org/visualizer/来进行执行计划的解析,看一下自己的链路,作为一个合格的大数据开发者,我们一定要懂得去看执行计划,无论是Spark,Hive,Flink都是包含执行计划,通过执行计划我们可以看自己的SQL亦或者API写的是否良好;