Flink DataStream 实现双流 Join 的主要方式有 WindowJoin、connect 和 IntervalJoin ,以下从源码角度介绍其使用和实现。
-
Union
1.1 使用
用户需保证左右两流数据类型相同,对两流进行合并操作。
stream
.union(otherStream)
1.2 原理
新建 UnionTransformation ,并且取左右 DataStream 的 Transformation 作为 inputs。若 DataStream 的数据类型不同,则报错。
public final DataStream union(DataStream... streams) {
List> unionedTransforms = new ArrayList<>();
unionedTransforms.add(this.transformation);
for (DataStream newStream : streams) {
if (!getType().equals(newStream.getType())) {
throw new IllegalArgumentException(
"Cannot union streams of different types: "
+ getType()
+ " and "
+ newStream.getType());
}
unionedTransforms.add(newStream.getTransformation());
}
return new DataStream<>(this.environment, new UnionTransformation<>(unionedTransforms));
}
-
Cogroup
用户自定义 join key 、 窗口以及 CoGroupFunction,对左右流相同 key ,相同窗口的数据进行处理。CoGroupFunction 的输入是,左流和右流在当前窗口中的数据。
1.1 使用
stream.coGroup(otherStream)
.where()
.equalTo()
.window()
.apply()
public interface CoGroupFunction extends Function, Serializable {
void coGroup(Iterable first, Iterable second, Collector out) throws Exception;
}
1.2 原理
1.2.1 DataStream 对象演变
DataStream 调用 coGroup 方法生成 CoGroupedStreams。
public CoGroupedStreams coGroup(DataStream otherStream) {
return new CoGroupedStreams<>(this, otherStream);
}
CoGroupedStreams 调用 where 方法生成 CoGroupedStreams.Where(以下简称 Where),
Where 调用 equalTo 方法生成 CoGroupedStreams.Where.EqualTo(以下简称 EqualTo),
EqualTo 调用 window 方法生成 CoGroupedStreams.WithWindow(以下简称 WithWindow) ,WithWindow 掉用 trigger 方法生成 WithWindow【可选】,WithWindow 掉用 evictor 方法生成 WithWindow【可选】,
WithWindow 掉用 allowedLateness 方法生成 WithWindow【可选】,WithWindow 掉用 apply 方法生成
DataStream。结论:cogroup 基于 union 和 window 实现,window 的实现参考 Flink 源码解读(三) Timer & WaterMark & Window。
public class CoGroupedStreams {
public Where where(KeySelector keySelector) {
Preconditions.checkNotNull(keySelector);
final TypeInformation keyType = TypeExtractor.getKeySelectorTypes(keySelector, input1.getType());
return where(keySelector, keyType);
}
public Where where(KeySelector keySelector, TypeInformation keyType) {
Preconditions.checkNotNull(keySelector);
Preconditions.checkNotNull(keyType);
return new Where<>(input1.clean(keySelector), keyType);
}
public class Where {
private final KeySelector keySelector1;
private final TypeInformation keyType;
Where(KeySelector keySelector1, TypeInformation keyType) {
this.keySelector1 = keySelector1;
this.keyType = keyType;
}
public EqualTo equalTo(KeySelector keySelector) {
Preconditions.checkNotNull(keySelector);
final TypeInformation otherKey =
TypeExtractor.getKeySelectorTypes(keySelector, input2.getType());
return equalTo(keySelector, otherKey);
}
public EqualTo equalTo(KeySelector keySelector, TypeInformation keyType) {
return new EqualTo(input2.clean(keySelector));
}
@Public
public class EqualTo {
private final KeySelector keySelector2;
EqualTo(KeySelector keySelector2) {
this.keySelector2 = requireNonNull(keySelector2);
}
@PublicEvolving
public WithWindow window(
WindowAssigner super TaggedUnion, W> assigner) {
return new WithWindow<>(
input1,
input2,
keySelector1,
keySelector2,
keyType,
assigner,
null,
null,
null);
}
}
}
}
public class CoGroupedStreams {
public static class WithWindow {
@PublicEvolving
public WithWindow trigger(
Trigger super TaggedUnion, ? super W> newTrigger) {
return new WithWindow<>(
input1,
input2,
keySelector1,
keySelector2,
keyType,
windowAssigner,
newTrigger,
evictor,
allowedLateness);
}
@PublicEvolving
public WithWindow evictor(
Evictor super TaggedUnion, ? super W> newEvictor) {
return new WithWindow<>(
input1,
input2,
keySelector1,
keySelector2,
keyType,
windowAssigner,
trigger,
newEvictor,
allowedLateness);
}
@PublicEvolving
public WithWindow allowedLateness(Time newLateness) {
return new WithWindow<>(
input1,
input2,
keySelector1,
keySelector2,
keyType,
windowAssigner,
trigger,
evictor,
newLateness);
}
public DataStream apply(CoGroupFunction function) {
TypeInformation resultType =
TypeExtractor.getCoGroupReturnTypes(
function, input1.getType(), input2.getType(), "CoGroup", false);
return apply(function, resultType);
}
public DataStream apply(
CoGroupFunction function, TypeInformation resultType) {
function = input1.getExecutionEnvironment().clean(function);
UnionTypeInfo unionType =
new UnionTypeInfo<>(input1.getType(), input2.getType());
UnionKeySelector unionKeySelector =
new UnionKeySelector<>(keySelector1, keySelector2);
DataStream> taggedInput1 =
input1.map(new Input1Tagger())
.setParallelism(input1.getParallelism())
.returns(unionType);
DataStream> taggedInput2 =
input2.map(new Input2Tagger())
.setParallelism(input2.getParallelism())
.returns(unionType);
DataStream> unionStream = taggedInput1.union(taggedInput2);
windowedStream =new KeyedStream, KEY>(unionStream, unionKeySelector, keyType).window(windowAssigner);
if (trigger != null) {
windowedStream.trigger(trigger);
}
if (evictor != null) {
windowedStream.evictor(evictor);
}
if (allowedLateness != null) {
windowedStream.allowedLateness(allowedLateness);
}
return windowedStream.apply(new CoGroupWindowFunction(function), resultType);
}
}
public class CoGroupedStreams {
private static class CoGroupWindowFunction
extends WrappingFunction>
implements WindowFunction, T, KEY, W> {
private static final long serialVersionUID = 1L;
public CoGroupWindowFunction(CoGroupFunction userFunction) {
super(userFunction);
}
@Override
public void apply(KEY key, W window, Iterable> values, Collector out)
throws Exception {
List oneValues = new ArrayList<>();
List twoValues = new ArrayList<>();
for (TaggedUnion val : values) {
if (val.isOne()) {
oneValues.add(val.getOne());
} else {
twoValues.add(val.getTwo());
}
}
wrappedFunction.coGroup(oneValues, twoValues, out);
}
}
}
1.2.2 Transformation 对象演变
UnionTransformation -- > PartitionTransformation --> OneInputTransformation(SimpleOperatorFactory(WindowOperator(
InternalIterableWindowFunction(
WindowFunction(CoGroupFunction)))/EvictingWindowOperator))
-
Join
2.1 使用
stream.join(otherStream)
.where()
.equalTo()
.window()
.apply()
2.2 原理
2.2.1 DataStream 对象演变
DataStream 调用 join 方法生成 JoinedStreams。
public JoinedStreams join(DataStream otherStream) {
return new JoinedStreams<>(this, otherStream);
}
JoinedStreams 调用 where 方法生成 JoinedStreams.Where(以下简称 Where),
Where 调用 equalTo 方法生成 JoinedStreams.Where.EqualTo(以下简称 EqualTo),
EqualTo 调用 window 方法生成 JoinedStreams.WithWindow(以下简称 WithWindow) ,WithWindow 掉用 trigger 方法生成 WithWindow,WithWindow 掉用 evictor 方法生成 WithWindow,
WithWindow 掉用 allowedLateness 方法生成 WithWindow,WithWindow 掉用 apply 方法生成
DataStream。结论:join 基于 cogroup 实现,cogroup 的实现参考本文章节 1.2。
public class JoinedStreams {
public Where where(KeySelector keySelector) {
requireNonNull(keySelector);
final TypeInformation keyType =
TypeExtractor.getKeySelectorTypes(keySelector, input1.getType());
return where(keySelector, keyType);
}
public Where where(KeySelector keySelector, TypeInformation keyType) {
requireNonNull(keySelector);
requireNonNull(keyType);
return new Where<>(input1.clean(keySelector), keyType);
}
@Public
public class Where {
private final KeySelector keySelector1;
private final TypeInformation keyType;
Where(KeySelector keySelector1, TypeInformation keyType) {
this.keySelector1 = keySelector1;
this.keyType = keyType;
}
public EqualTo equalTo(KeySelector keySelector) {
requireNonNull(keySelector);
final TypeInformation otherKey =
TypeExtractor.getKeySelectorTypes(keySelector, input2.getType());
return equalTo(keySelector, otherKey);
}
public EqualTo equalTo(KeySelector keySelector, TypeInformation keyType) {
return new EqualTo(input2.clean(keySelector));
}
@Public
public class EqualTo {
private final KeySelector keySelector2;
EqualTo(KeySelector keySelector2) {
this.keySelector2 = requireNonNull(keySelector2);
}
public WithWindow window(
WindowAssigner super TaggedUnion, W> assigner) {
return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,assigner,null,null,null);
}
}
}
}
public class JoinedStreams {
public static class WithWindow {
public WithWindow trigger(
Trigger super TaggedUnion, ? super W> newTrigger) {
return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,windowAssigner,newTrigger,evictor,allowedLateness);
}
public WithWindow evictor(
Evictor super TaggedUnion, ? super W> newEvictor) {
return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,windowAssigner,newTrigger,evictor,allowedLateness);
}
public WithWindow allowedLateness(Time newLateness) {
return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,windowAssigner,newTrigger,evictor,allowedLateness);
}
public DataStream apply(JoinFunction function) {
return apply(function, resultType);
}
public DataStream apply(
FlatJoinFunction function, TypeInformation resultType) {
coGroupedWindowedStream =
input1.coGroup(input2)
.where(keySelector1)
.equalTo(keySelector2)
.window(windowAssigner)
.trigger(trigger)
.evictor(evictor)
.allowedLateness(allowedLateness);
return coGroupedWindowedStream.apply(new FlatJoinCoGroupFunction<>(function), resultType);
}
}
}
2.2.2 Transformation 对象演变
UnionTransformation -- > PartitionTransformation --> OneInputTransformation(SimpleOperatorFactory(WindowOperator(
InternalIterableWindowFunction(
WindowFunction(CoGroupFunction(JoinFunction))))/EvictingWindowOperator))
-
Connect
4.1 使用
stream
.connect(otherStream)
.process()
4.2 原理
4.2.1 DataStream 对象演变
DataStream 调用 connect 方法生成 ConnectedStreams。
public ConnectedStreams connect(DataStream dataStream) {
return new ConnectedStreams<>(environment, this, dataStream);
}
public class ConnectedStreams {
public SingleOutputStreamOperator process(
CoProcessFunction coProcessFunction, TypeInformation outputType) {
TwoInputStreamOperator operator;
if ((inputStream1 instanceof KeyedStream) && (inputStream2 instanceof KeyedStream)) {
operator = new LegacyKeyedCoProcessOperator<>(inputStream1.clean(coProcessFunction));
} else {
operator = new CoProcessOperator<>(inputStream1.clean(coProcessFunction));
}
return transform("Co-Process", outputType, operator);
}
}
4.2.2 Transformation 对象演变
TwoInputTransformation(SimpleOperatorFactory(CoProcessOperator(CoProcessFunction)))
4.2.3 CoProcessOperator
public class CoProcessOperator
extends AbstractUdfStreamOperator>
implements TwoInputStreamOperator {
public void processElement1(StreamRecord element) throws Exception {
collector.setTimestamp(element);
context.element = element;
userFunction.processElement1(element.getValue(), context, collector);
context.element = null;
}
@Override
public void processElement2(StreamRecord element) throws Exception {
collector.setTimestamp(element);
context.element = element;
userFunction.processElement2(element.getValue(), context, collector);
context.element = null;
}
}
-
IntervalJoin
5.1 使用
stream
.keyBy()
.intervalJoin(otherStream.keyBy())
.between(
5.2 原理
5.2.1 KeyedStream 对象演变
KeyedStream 调用 intervalJoin 方法生成 IntervalJoin。
public class KeyedStream extends DataStream {
public IntervalJoin intervalJoin(KeyedStream otherStream) {
return new IntervalJoin<>(this, otherStream);
}
}
public static class IntervalJoin {
public IntervalJoined between(Time lowerBound, Time upperBound) {
if (timeBehaviour != TimeBehaviour.EventTime) {
throw new UnsupportedTimeCharacteristicException(
"Time-bounded stream joins are only supported in event time");
}
checkNotNull(lowerBound, "A lower bound needs to be provided for a time-bounded join");
checkNotNull(upperBound, "An upper bound needs to be provided for a time-bounded join");
return new IntervalJoined<>(
streamOne,
streamTwo,
lowerBound.toMilliseconds(),
upperBound.toMilliseconds(),
true,
true);
}
public static class IntervalJoined {
public SingleOutputStreamOperator process(
ProcessJoinFunction processJoinFunction,
TypeInformation outputType) {
Preconditions.checkNotNull(processJoinFunction);
Preconditions.checkNotNull(outputType);
final ProcessJoinFunction cleanedUdf =
left.getExecutionEnvironment().clean(processJoinFunction);
final IntervalJoinOperator operator =
new IntervalJoinOperator<>(
lowerBound,
upperBound,
lowerBoundInclusive,
upperBoundInclusive,
left.getType().createSerializer(left.getExecutionConfig()),
right.getType().createSerializer(right.getExecutionConfig()),
cleanedUdf);
return left.connect(right)
.keyBy(keySelector1, keySelector2)
.transform("Interval Join", outputType, operator);
}
}
}
结论:intervaljoin 基于 connect 实现,使用 relativeLowerBound 和 relativeUpperBound 进行过滤。
5.2.2 Transformation 对象演变
PartitionTransformation -->TwoInputTransformation(SimpleOperatorFactory(CoProcessOperator(CoProcessFunction)))
5.2.3 IntervalJoinOperator
public class IntervalJoinOperator
extends AbstractUdfStreamOperator>
implements TwoInputStreamOperator, Triggerable {
public void processElement1(StreamRecord record) throws Exception {
processElement(record, leftBuffer, rightBuffer, lowerBound, upperBound, true);
}
public void processElement2(StreamRecord record) throws Exception {
processElement(record, rightBuffer, leftBuffer, -upperBound, -lowerBound, false);
}
private void processElement(
final StreamRecord record,
final MapState>> ourBuffer,
final MapState>> otherBuffer,
final long relativeLowerBound,
final long relativeUpperBound,
final boolean isLeft)
throws Exception {
if (isLate(ourTimestamp)) {
return;
}
addToBuffer(ourBuffer, ourValue, ourTimestamp);
for (Map.Entry>> bucket : otherBuffer.entries()) {
final long timestamp = bucket.getKey();
if (timestamp < ourTimestamp + relativeLowerBound
|| timestamp > ourTimestamp + relativeUpperBound) {
continue;
}
for (BufferEntry entry : bucket.getValue()) {
if (isLeft) {
collect((T1) ourValue, (T2) entry.element, ourTimestamp, timestamp);
} else {
collect((T1) entry.element, (T2) ourValue, timestamp, ourTimestamp);
}
}
}
long cleanupTime =
(relativeUpperBound > 0L) ? ourTimestamp + relativeUpperBound : ourTimestamp;
if (isLeft) {
internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_LEFT, cleanupTime);
} else {
internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_RIGHT, cleanupTime);
}
}
}