- Window Join
Tumbling Window Join
Sliding Window Join
Session Window Join
- Interval Join
- CoGroup
Window Join and CoGroup
- Window Join 是基于时间窗口对两个流进行关联操作。
- 相比于 Join 操作, CoGroup 提供了一个更为通用的方式来处理两个流在相同的窗口内匹配的元素。 Join 复用了 CoGroup 的实现逻辑。它们的使用方式如下:
//join
stream.join(otherStream)
.where()
.equalTo()
.window()
.apply()
//coGroup
stream.coGroup(otherStream)
.where()
.equalTo()
.window()
.apply()
从 JoinFunction 和 CogroupFunction 接口的定义中可以大致看出它们的区别:
public interface JoinFunction extends Function, Serializable {
OUT join(IN1 first, IN2 second) throws Exception;
}
public interface CoGroupFunction extends Function, Serializable {
void coGroup(Iterable first, Iterable second, Collector out) throws Exception;
}
可以看出来,JoinFunction
主要关注的是两个流中按照 key 匹配的每一对元素,而 CoGroupFunction 的参数则是两个中 key 相同的所有元素。JoinFunction 的逻辑更类似于 INNER JOIN,而 CoGroupFunction
除了可以实现 INNER JOIN
,也可以实现 OUTER JOIN
Window Join
分为三种, Tumbing Window join
、Sliding Window join
、Session Window Join
Window 类型的join
的实现机制,通过将数据缓存在Window State
中,当窗口触发计算是,执行join
操作
public class JoinedStreams {
public static class WithWindow {
public DataStream apply(JoinFunction function, TypeInformation resultType) {
//clean the closure
function = input1.getExecutionEnvironment().clean(function);
//Join 操作被转换为 CoGroup
coGroupedWindowedStream = input1.coGroup(input2)
.where(keySelector1)
.equalTo(keySelector2)
.window(windowAssigner)
.trigger(trigger)
.evictor(evictor)
.allowedLateness(allowedLateness);
//JoinFunction 被包装为 CoGroupFunction
return coGroupedWindowedStream
.apply(new JoinCoGroupFunction<>(function), resultType);
}
}
/**
* CoGroup function that does a nested-loop join to get the join result.
*/
private static class JoinCoGroupFunction
extends WrappingFunction>
implements CoGroupFunction {
private static final long serialVersionUID = 1L;
public JoinCoGroupFunction(JoinFunction wrappedFunction) {
super(wrappedFunction);
}
@Override
public void coGroup(Iterable first, Iterable second, Collector out) throws Exception {
for (T1 val1: first) {
for (T2 val2: second) {
//每一个匹配的元素对
out.collect(wrappedFunction.join(val1, val2));
}
}
}
}
}
那么 CoGroup 又是怎么实现两个流的操作的呢?Flink 其实是通过一个变换,将两个流转换成一个流进行处理,转换之后数据流中的每一条消息都有一个标记来记录这个消息是属于左边的流还是右边的流,这样窗口的操作就和单个流的实现一样了。等到窗口被触发的时候,再按照标记将窗口内的元素分为左边的一组和右边的一组,然后交给 CoGroupFunction
进行处理
public class CoGroupedStreams {
public static class WithWindow {
public DataStream apply(CoGroupFunction function, TypeInformation resultType) {
//clean the closure
function = input1.getExecutionEnvironment().clean(function);
UnionTypeInfo unionType = new UnionTypeInfo<>(input1.getType(), input2.getType());
UnionKeySelector unionKeySelector = new UnionKeySelector<>(keySelector1, keySelector2);
DataStream> taggedInput1 = input1
.map(new Input1Tagger())
.setParallelism(input1.getParallelism())
.returns(unionType); //左边流
DataStream> taggedInput2 = input2
.map(new Input2Tagger())
.setParallelism(input2.getParallelism())
.returns(unionType); //右边流
//合并成一个数据流
DataStream> unionStream = taggedInput1.union(taggedInput2);
windowedStream =
new KeyedStream, KEY>(unionStream, unionKeySelector, keyType)
.window(windowAssigner);
if (trigger != null) {
windowedStream.trigger(trigger);
}
if (evictor != null) {
windowedStream.evictor(evictor);
}
if (allowedLateness != null) {
windowedStream.allowedLateness(allowedLateness);
}
return windowedStream.apply(new CoGroupWindowFunction(function), resultType);
}
}
//将 CoGroupFunction 封装为 WindowFunction
private static class CoGroupWindowFunction
extends WrappingFunction>
implements WindowFunction, T, KEY, W> {
public CoGroupWindowFunction(CoGroupFunction userFunction) {
super(userFunction);
}
@Override
public void apply(KEY key,
W window,
Iterable> values,
Collector out) throws Exception {
List oneValues = new ArrayList<>();
List twoValues = new ArrayList<>();
//窗口内的所有元素按标记重新分为左边的一组和右边的一组
for (TaggedUnion val: values) {
if (val.isOne()) {
oneValues.add(val.getOne());
} else {
twoValues.add(val.getTwo());
}
}
//调用 CoGroupFunction
wrappedFunction.coGroup(oneValues, twoValues, out);
}
}
}
Connected Streams
Window Join 可以方便地对两个数据流进行关联操作。但有些使用场景中,我们需要的并非关联操作,ConnectedStreams 提供了更为通用的双流操作
ConnectedStreams
配合 CoProcessFunction
或 KeyedCoProcessFunction
使用,KeyedCoProcessFunction
要求连接的两个 stream 都是 KeyedStream
,并且 key
的类型一致。
ConnectedStreams
配合 CoProcessFunction
生成 CoProcessOperator
,在运行时被调度为 TwoInputStreamTask
,从名字也可以看书来,这个 Task
处理的是两个输入。我们简单看一下 CoProcessOperator
的实现
public class CoProcessOperator
extends AbstractUdfStreamOperator>
implements TwoInputStreamOperator {
@Override
public void processElement1(StreamRecord element) throws Exception {
collector.setTimestamp(element);
context.element = element;
userFunction.processElement1(element.getValue(), context, collector);
context.element = null;
}
@Override
public void processElement2(StreamRecord element) throws Exception {
collector.setTimestamp(element);
context.element = element;
userFunction.processElement2(element.getValue(), context, collector);
context.element = null;
}
}
CoProcessOperator
内部区分了两个流的处理,分别调用 CoProcessFunction.processElement1() 和 userFunction.processElement2() 进行处理。对于 KeyedCoProcessOperator
也是类似的机制。
通过内部的共享状态,可以在双流上实现很多复杂的操作。接下来我们就介绍 Flink 基于 Connected Streams 实现的另一种双流关联操作 - Interval Join。
Interval Join
默认情况下,这些是包含边界的,但是可以通过.lowerboundexclusive()和. upperboundexclusive()进行设置,如果设置了,则不包含边界
stream
.keyBy()
.intervalJoin(otherStream.keyBy())
.between(
Interval Join 是基于 ConnectedStreams 实现的:
public class KeyedStream extends DataStream {
public static class IntervalJoined {
public SingleOutputStreamOperator process(
ProcessJoinFunction processJoinFunction,
TypeInformation outputType) {
Preconditions.checkNotNull(processJoinFunction);
Preconditions.checkNotNull(outputType);
final ProcessJoinFunction cleanedUdf = left.getExecutionEnvironment().clean(processJoinFunction);
final IntervalJoinOperator operator =
new IntervalJoinOperator<>(
lowerBound,
upperBound,
lowerBoundInclusive,
upperBoundInclusive,
left.getType().createSerializer(left.getExecutionConfig()),
right.getType().createSerializer(right.getExecutionConfig()),
cleanedUdf
);
return left
.connect(right)
.keyBy(keySelector1, keySelector2)
.transform("Interval Join", outputType, operator);
}
}
}
在 IntervalJoinOperator 中,使用两个 MapState 分别保存两个数据流到达的消息,MapState 的 key 是消息的时间。当一个数据流有新消息到达时,就会去另一个数据流的状态中查找时间落在匹配范围内的消息,然后进行关联处理。每一条消息会注册一个定时器,在时间越过该消息的有效范围后从状态中清除该消息。
public class IntervalJoinOperator
extends AbstractUdfStreamOperator>
implements TwoInputStreamOperator, Triggerable {
//左流的状态buffer
private transient MapState>> leftBuffer;
//右流的状态buffer
private transient MapState>> rightBuffer;
@Override
public void processElement1(StreamRecord record) throws Exception {
//处理左流元素,processElement参数列表最后一位代表是否是左流元素,用于区分
processElement(record, leftBuffer, rightBuffer, lowerBound, upperBound, true);
}
@Override
public void processElement2(StreamRecord record) throws Exception {
//处理左流元素
processElement(record, rightBuffer, leftBuffer, -upperBound, -lowerBound, false);
}
private void processElement(
final StreamRecord record,
final MapState>> ourBuffer,
final MapState>> otherBuffer,
final long relativeLowerBound,
final long relativeUpperBound,
final boolean isLeft) throws Exception {
final THIS ourValue = record.getValue();
//获取数据的eventtime时间
final long ourTimestamp = record.getTimestamp();
if (ourTimestamp == Long.MIN_VALUE) {
throw new FlinkException("Long.MIN_VALUE timestamp: Elements used in " +
"interval stream joins need to have timestamps meaningful timestamps.");
}
// 判断数据的event time是否小于水印,小于丢弃
if (isLate(ourTimestamp)) {
return;
}
//将消息加入状态中,MapState的key为当前消息的时间戳
addToBuffer(ourBuffer, ourValue, ourTimestamp);
//从另一个数据流的状态中查找匹配的记录,遍历mapstate的数据
for (Map.Entry>> bucket: otherBuffer.entries()) {
final long timestamp = bucket.getKey();
//判断bucket的时间是否在
消息时间+LowerBound < key<消息时间+UpperBound
if (timestamp < ourTimestamp + relativeLowerBound ||
timestamp > ourTimestamp + relativeUpperBound) {
continue;
}
//将bucket中的数据取出,传递到下游
for (BufferEntry entry: bucket.getValue()) {
if (isLeft) {
collect((T1) ourValue, (T2) entry.element, ourTimestamp, timestamp);
} else {
collect((T1) entry.element, (T2) ourValue, timestamp, ourTimestamp);
}
}
}
//注册清理状态的timer,水印超过cleanupTime 触发
long cleanupTime = (relativeUpperBound > 0L) ? ourTimestamp + relativeUpperBound : ourTimestamp;
if (isLeft) {
internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_LEFT, cleanupTime);
} else {
internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_RIGHT, cleanupTime);
}
}
}
//定时器触发的回调函数
@Override
public void onEventTime(InternalTimer timer) throws Exception {
long timerTimestamp = timer.getTimestamp();
String namespace = timer.getNamespace();
logger.trace("onEventTime @ {}", timerTimestamp);
// 通过namespace判断是左流的状态还是右流的状态
// 注意区分左右的清除逻辑,因为左右流的到来是有先后顺序的
switch (namespace) {
case CLEANUP_NAMESPACE_LEFT: {
//左流先到,定时upperBound时间后清理
long timestamp = (upperBound <= 0L) ? timerTimestamp : timerTimestamp - upperBound;
logger.trace("Removing from left buffer @ {}", timestamp);
leftBuffer.remove(timestamp);
break;
}
case CLEANUP_NAMESPACE_RIGHT: {
//右流是晚来的数据不需要等待,当watermark大于数据时间就可以清理掉
long timestamp = (lowerBound <= 0L) ? timerTimestamp + lowerBound : timerTimestamp;
logger.trace("Removing from right buffer @ {}", timestamp);
rightBuffer.remove(timestamp);
break;
}
default:
throw new RuntimeException("Invalid namespace " + namespace);
}
}
参考
https://blog.csdn.net/u013516966/article/details/102952239
https://blog.jrwang.me/2019/flink-source-code-two-stream-join/
https://mp.weixin.qq.com/s/MoIS0qQlvk6N_hnQU6r2SA