简介
JobGraph 可以认为是 StreamGraph 的优化图,它将一些符合特定条件的 operators 合并成一个 operator chain,以减少数据在节点之间序列化/反序列化以及网络通信带来的资源消耗。
入口函数
与 StreamGraph 的生成类似,调用 StreamGraph.getJobGraph() 就可以得到对应的 JobGraph,底层会创建一个 StreamingJobGraphGenerator 以创建 JobGraph:new StreamingJobGraphGenerator(streamGraph, jobID).createJobGraph()
。
private JobGraph createJobGraph() {
// ...
// Generate deterministic hashes for the nodes in order to identify them across
// submission iff they didn't change.
Map hashes =
defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);
// Generate legacy version hashes for backwards compatibility
List
核心是这两步:
- 调用 traverseStreamGraphAndGenerateHashes 为每个节点生成哈希(唯一标识);
- 调用 setChaining 优化算子流程,将一些算子 chain 在一起,减少序列化/反序列化等网络通信开销。
traverseStreamGraphAndGenerateHashes
在 StreamGraph 中我们提到,创建 StreamGraph 时创建的 StreamNode ID,是由 Transformation ID 转换而来,而 Transformation ID 是一个不断递增的静态变量,因此会出现以下情况:在同一个进程中,我们用 DataStream API 先后构建了两个算子流程完全一致的作业 A 和 B,但他们底层的 Transformation ID 完全不同。从作业和图结构角度上,这两个作业完全一致,因此我们需要引入另一套 id 机制去标识作业,这就是 Operator ID。
traverseStreamGraphAndGenerateHashes 的作用就是根据节点在 StreamGraph 中的位置,生成对应的哈希值作为节点标识,Flink 默认使用 StreamGraphHasherV2 生成节点哈希。
// The hash function used to generate the hash
final HashFunction hashFunction = Hashing.murmur3_128(0);
final Map hashes = new HashMap<>();
首先,该方法先收集 StreamGraph 所有的 sources,为了确保对相同的 StreamGraph 每次生成的哈希一致,在拿到所有 source IDs 后会做一次排序。
We need to make the source order deterministic. The source IDs not returned in the same order, which means that submitting the same program twice might result in different traversal, which breaks the deterministic hash assignment.
List sources = new ArrayList<>();
for (Integer sourceNodeId : streamGraph.getSourceIDs()) {
sources.add(sourceNodeId);
}
Collections.sort(sources);
然后,StreamGraphHasherV2 使用宽度优先遍历算法来遍历这些节点(利用队列):
- 对队列中每个节点尝试生成哈希,若成功生成,则将该节点的所有下游节点也添加到队列中;
- 若生成失败,说明该节点尚未到生成时机(比如该节点有些上游节点还没被遍历到),因此先将其从队列中移除,等待该节点的另一个上游节点被遍历到再将该节点添加回队列中。
1 2
| |
| |
| 3
\ /
\ /
4
如上述例子,1 和 2 都是 sources 节点,会被先添加到队列中,此时队列中的节点为:[1. 2],当经过第一次遍历后,节点 1、2 的哈希计算完毕,我们将它们的下游节点按序放入队列,此时队列中的节点变成了 [4, 3],此时我们先从队列中取到了节点 4 尝试计算哈希,按我们上述所说的,这次哈希计算会失败,从而进入到 else 分支,4 节点从队列中被移除,然后我们再取出节点 3 进行哈希计算,在计算完毕后将它的下游节点 4 再度放入到队列中:[4]。这样在下一次遍历时再度计算节点 4 的哈希,此时节点 4 的所有上游节点都已被遍历过,可以成功计算得到哈希。
//
// Traverse the graph in a breadth-first manner. Keep in mind that
// the graph is not a tree and multiple paths to nodes can exist.
//
Set visited = new HashSet<>();
Queue remaining = new ArrayDeque<>();
// Start with source nodes
for (Integer sourceNodeId : sources) {
remaining.add(streamGraph.getStreamNode(sourceNodeId));
visited.add(sourceNodeId);
}
StreamNode currentNode;
while ((currentNode = remaining.poll()) != null) {
// Generate the hash code. Because multiple path exist to each
// node, we might not have all required inputs available to
// generate the hash code.
if (generateNodeHash(
currentNode,
hashFunction,
hashes,
streamGraph.isChainingEnabled(),
streamGraph)) {
// Add the child nodes
for (StreamEdge outEdge : currentNode.getOutEdges()) {
StreamNode child = streamGraph.getTargetVertex(outEdge);
if (!visited.contains(child.getId())) {
remaining.add(child);
visited.add(child.getId());
}
}
} else {
// We will revisit this later.
visited.remove(currentNode.getId());
}
}
在 generateNodeHash 中我们可以看到哈希的具体计算过程:根据用户是否指定了 Transformation UID,分别调用 generateDeterministicHash 和 generateUserSpecifiedHash 计算哈希,并将 StreamNode ID 到哈希的映射结果放入 hashes。
private boolean generateNodeHash(
StreamNode node,
HashFunction hashFunction,
Map hashes,
boolean isChainingEnabled,
StreamGraph streamGraph) {
// Check for user-specified ID
String userSpecifiedHash = node.getTransformationUID();
if (userSpecifiedHash == null) {
// Check that all input nodes have their hashes computed
for (StreamEdge inEdge : node.getInEdges()) {
// If the input node has not been visited yet, the current
// node will be visited again at a later point when all input
// nodes have been visited and their hashes set.
if (!hashes.containsKey(inEdge.getSourceId())) {
return false;
}
}
Hasher hasher = hashFunction.newHasher();
byte[] hash =
generateDeterministicHash(node, hasher, hashes, isChainingEnabled, streamGraph);
if (hashes.put(node.getId(), hash) != null) {
// Sanity check
throw new IllegalStateException(
"Unexpected state. Tried to add node hash "
+ "twice. This is probably a bug in the JobGraph generator.");
}
return true;
} else {
Hasher hasher = hashFunction.newHasher();
byte[] hash = generateUserSpecifiedHash(node, hasher);
for (byte[] previousHash : hashes.values()) {
if (Arrays.equals(previousHash, hash)) {
throw new IllegalArgumentException(
"Hash collision on user-specified ID "
+ "\""
+ userSpecifiedHash
+ "\". "
+ "Most likely cause is a non-unique ID. Please check that all IDs "
+ "specified via `uid(String)` are unique.");
}
}
if (hashes.put(node.getId(), hash) != null) {
// Sanity check
throw new IllegalStateException(
"Unexpected state. Tried to add node hash "
+ "twice. This is probably a bug in the JobGraph generator.");
}
return true;
}
}
generateDeterministicHash
Flink 将 hashes.size() 作为哈希算法的输入值(即以当前节点在 StreamGraph 中的遍历位置作为哈希算法的输入)的原因在注释中也已说明:
Include stream node to hash. We use the current size of the computed hashes as the ID. We cannot use the node's ID, because it is assigned from a static counter. This will result in two identical programs having different hashes.
使用 StreamNode ID 可能会使得两个相同程序得到不一样的哈希计算结果。
需要注意的是,该节点的哈希值还与该节点和下游节点能够 chain 在一起的个数有关,最后还需要跟其上游节点的哈希值进行异或操作。
private byte[] generateDeterministicHash(
StreamNode node,
Hasher hasher,
Map hashes,
boolean isChainingEnabled,
StreamGraph streamGraph) {
// Include stream node to hash. We use the current size of the computed
// hashes as the ID. We cannot use the node's ID, because it is
// assigned from a static counter. This will result in two identical
// programs having different hashes.
// hasher.putInt(hashes.size())
generateNodeLocalHash(hasher, hashes.size());
// Include chained nodes to hash
for (StreamEdge outEdge : node.getOutEdges()) {
if (isChainable(outEdge, isChainingEnabled, streamGraph)) {
// Use the hash size again, because the nodes are chained to
// this node. This does not add a hash for the chained nodes.
// hasher.putInt(hashes.size())
generateNodeLocalHash(hasher, hashes.size());
}
}
byte[] hash = hasher.hash().asBytes();
// Make sure that all input nodes have their hash set before entering
// this loop (calling this method).
for (StreamEdge inEdge : node.getInEdges()) {
byte[] otherHash = hashes.get(inEdge.getSourceId());
// Sanity check
if (otherHash == null) {
throw new IllegalStateException(
"Missing hash for input node "
+ streamGraph.getSourceVertex(inEdge)
+ ". Cannot generate hash for "
+ node
+ ".");
}
for (int j = 0; j < hash.length; j++) {
hash[j] = (byte) (hash[j] * 37 ^ otherHash[j]);
}
}
// ...
return hash;
}
generateUserSpecifiedHash
generateUserSpecifiedHash 相对 generateDeterministicHash 简单很多,只要把用户指定的 Transformation UID 作为哈希算法的输入计算即可得到哈希值。
setChaining
创建 JobGraph 要做的最核心的事是将 operators 合并成 chain,入口在 createJobGraph() 的 setChaining 中:
private void setChaining(Map hashes, List
Flink 先对 StreamGraph 的每个 source 创建了一个链头(chainEntryPoints),然后从每个链头开始用贪心思想尝试构造一条条 chain,具体可看 createChain:
private List createChain(
final Integer currentNodeId,
final int chainIndex,
final OperatorChainInfo chainInfo,
final Map chainEntryPoints) {
Integer startNodeId = chainInfo.getStartNodeId();
if (!builtVertices.contains(startNodeId)) {
// 当前 chain 到下一个 chain 的所有 edges
List transitiveOutEdges = new ArrayList();
List chainableOutputs = new ArrayList();
List nonChainableOutputs = new ArrayList();
StreamNode currentNode = streamGraph.getStreamNode(currentNodeId);
// 1. 对当前节点和当前节点的每个下游节点判断是否能 chain 在一起,具体的判断条件后续说明,
// 根据是否能 chain 在一起,分别将 StreamEdge 添加到 chainableOutputs/nonChainableOutputs。
for (StreamEdge outEdge : currentNode.getOutEdges()) {
if (isChainable(outEdge, streamGraph)) {
chainableOutputs.add(outEdge);
} else {
nonChainableOutputs.add(outEdge);
}
}
// 2. 对能 chain 在一起的,递归调用 createChain 将下游节点加入到当前的 cahin 里。
for (StreamEdge chainable : chainableOutputs) {
transitiveOutEdges.addAll(
createChain(
chainable.getTargetId(),
chainIndex + 1,
chainInfo,
chainEntryPoints));
}
// 3. 对不能 chain 在一起的下游节点,新建一个 OperatorChainInfo,
// 并将当前节点和该下游节点之间的 edge 添加到 transitiveOutEdges 中,
// 返回给上层方法。
for (StreamEdge nonChainable : nonChainableOutputs) {
transitiveOutEdges.add(nonChainable);
createChain(
nonChainable.getTargetId(),
1, // operators start at position 1 because 0 is for chained source inputs
chainEntryPoints.computeIfAbsent(
nonChainable.getTargetId(),
(k) -> chainInfo.newChain(nonChainable.getTargetId())),
chainEntryPoints);
}
chainedNames.put(
currentNodeId,
createChainedName(
currentNodeId,
chainableOutputs,
Optional.ofNullable(chainEntryPoints.get(currentNodeId))));
chainedMinResources.put(
currentNodeId, createChainedMinResources(currentNodeId, chainableOutputs));
chainedPreferredResources.put(
currentNodeId,
createChainedPreferredResources(currentNodeId, chainableOutputs));
// 4. 将当前节点加入到 chain 中,并创建当前节点的 OperatorID,
// 该 OperatorID 其实就是前面计算的哈希值,也是后面 JobVertex 的 ID。
OperatorID currentOperatorId =
chainInfo.addNodeToChain(currentNodeId, chainedNames.get(currentNodeId));
if (currentNode.getInputFormat() != null) {
getOrCreateFormatContainer(startNodeId)
.addInputFormat(currentOperatorId, currentNode.getInputFormat());
}
if (currentNode.getOutputFormat() != null) {
getOrCreateFormatContainer(startNodeId)
.addOutputFormat(currentOperatorId, currentNode.getOutputFormat());
}
// 5. 如果当前节点是 chain 的首节点,那么创建 JobVertex 并返回 JobVertex 对应的配置,
// 否则创建一个新的配置。
StreamConfig config =
currentNodeId.equals(startNodeId)
? createJobVertex(startNodeId, chainInfo)
: new StreamConfig(new Configuration());
setVertexConfig(
currentNodeId,
config,
chainableOutputs,
nonChainableOutputs,
chainInfo.getChainedSources());
if (currentNodeId.equals(startNodeId)) {
config.setChainStart();
config.setChainIndex(chainIndex);
config.setOperatorName(streamGraph.getStreamNode(currentNodeId).getOperatorName());
// 6. 为当前 JobVertex 和下游 JobVertex 建立连接,表现为创建 IntermediateDataSet 和 JobEdge,
// 并将 IntermediateDataSet 添加到上游节点的 results,将 JobEdge 添加到下游节点的 input 中。
for (StreamEdge edge : transitiveOutEdges) {
connect(startNodeId, edge);
}
config.setOutEdgesInOrder(transitiveOutEdges);
config.setTransitiveChainedTaskConfigs(chainedConfigs.get(startNodeId));
} else {
// 7. 若当前节点不是 chain 的首节点,那么把该节点的配置记录到 chainedConfigs 中。
chainedConfigs.computeIfAbsent(
startNodeId, k -> new HashMap());
config.setChainIndex(chainIndex);
StreamNode node = streamGraph.getStreamNode(currentNodeId);
config.setOperatorName(node.getOperatorName());
chainedConfigs.get(startNodeId).put(currentNodeId, config);
}
config.setOperatorID(currentOperatorId);
if (chainableOutputs.isEmpty()) {
config.setChainEnd();
}
return transitiveOutEdges;
} else {
return new ArrayList<>();
}
}
这部分建立 chain 的流程很复杂,简单来说就是用贪心思想,在这个方法中有一个 OperatorChainInfo chainInfo 保存当前的所在的 chain 信息,然后将该节点加入到当前 chain,并检查该节点和其下游节点是否能 chain 在一起,能的话将当前的 chainInfo 传给下游节点递归调用 createChain,否则就新建一个 OperatorChainInfo 作为下游节点的 chain。然后将当前节点的信息记录到 chainInfo 中,同时获取对应的 OperatorID,这个 OperatorID 其实就是 traverseStreamGraphAndGenerateHashes 中计算出来的哈希值(如果该节点是 chain 的首节点,那么这个 OperatorID 也是 JobVertex ID)。
最后,判断当前节点是否是该 chain 的首节点,如果是,那么对应当前节点会生成一个 JobVertex,否则就把该节点的信息记录到 chainedConfigs 中,方便后面 chain 的首节点获取。当创建 JobVertex 的时候,还会建立为当前 JobVertex 和下游 JobVertex 建立连接:IntermediateDataSet 和 JobEdge。JobVertex 之间的连接和我们平时的图不一样,它的连接是:JobVertex(A) - IntermediateDataSet - JobEdge - JobVertex(B),如下面的 demo 所示。
JobVertexA JobVertexB
results: inputs:
- [0]: - [0]:
consumer ---------- JobEdge ---------- - [1]:
- [1]: - [2]:
- ... - ...
isChainable
public static boolean isChainable(StreamEdge edge, StreamGraph streamGraph) {
StreamNode downStreamVertex = streamGraph.getTargetVertex(edge);
return downStreamVertex.getInEdges().size() == 1 && isChainableInput(edge, streamGraph);
}
private static boolean isChainableInput(StreamEdge edge, StreamGraph streamGraph) {
StreamNode upStreamVertex = streamGraph.getSourceVertex(edge);
StreamNode downStreamVertex = streamGraph.getTargetVertex(edge);
if (!(upStreamVertex.isSameSlotSharingGroup(downStreamVertex)
&& areOperatorsChainable(upStreamVertex, downStreamVertex, streamGraph)
&& (edge.getPartitioner() instanceof ForwardPartitioner)
&& edge.getShuffleMode() != ShuffleMode.BATCH
&& upStreamVertex.getParallelism() == downStreamVertex.getParallelism()
&& streamGraph.isChainingEnabled())) {
return false;
}
// check that we do not have a union operation, because unions currently only work
// through the network/byte-channel stack.
// we check that by testing that each "type" (which means input position) is used only once
for (StreamEdge inEdge : downStreamVertex.getInEdges()) {
if (inEdge != edge && inEdge.getTypeNumber() == edge.getTypeNumber()) {
return false;
}
}
return true;
}
根据代码的判断条件,我们可以总结出节点之间能够 chain 在一起需满足的条件如下:
- 下游节点仅有上游节点一个输入;
- 上游节点和下游节点在同一个 slotSharingGroup(默认满足);
- 上游节点和下游节点的 ChainingStrategy 必须符合一定条件(比如上下游节点的 ChainStrategy 都不可以为 NEVER)。
- 上游节点和下游节点必须通过 ForwardPartitioner 发送数据;
- shuffle 模式不能是 Batch;
- 上下游算子的并发度相同;
- StreamGraph 的 chaining 配置项为 true(没有调用 org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#disableOperatorChaining 手动禁止)。