Flink 源码解读(四) Flink DataSream 双流 Join 实现

Flink DataStream 实现双流 Join 的主要方式有 WindowJoin、connect 和 IntervalJoin ,以下从源码角度介绍其使用和实现。

  1. Union

1.1 使用

用户需保证左右两流数据类型相同,对两流进行合并操作。

 stream
    .union(otherStream)

1.2 原理

新建 UnionTransformation ,并且取左右 DataStream 的 Transformation 作为 inputs。若 DataStream 的数据类型不同,则报错。


public final DataStream union(DataStream... streams) {
    List> unionedTransforms = new ArrayList<>();
    unionedTransforms.add(this.transformation);

    for (DataStream newStream : streams) {
        if (!getType().equals(newStream.getType())) {
            throw new IllegalArgumentException(
                    "Cannot union streams of different types: "
                            + getType()
                            + " and "
                            + newStream.getType());
        }

        unionedTransforms.add(newStream.getTransformation());
    }
    return new DataStream<>(this.environment, new UnionTransformation<>(unionedTransforms));
}

  1. Cogroup

用户自定义 join key 、 窗口以及 CoGroupFunction,对左右流相同 key ,相同窗口的数据进行处理。CoGroupFunction 的输入是,左流和右流在当前窗口中的数据。

1.1 使用

stream.coGroup(otherStream)
    .where()
    .equalTo()
    .window()
    .apply()

public interface CoGroupFunction extends Function, Serializable {
    void coGroup(Iterable first, Iterable second, Collector out) throws Exception;
}

1.2 原理

1.2.1 DataStream 对象演变

DataStream 调用 coGroup 方法生成 CoGroupedStreams。


public  CoGroupedStreams coGroup(DataStream otherStream) {
   return new CoGroupedStreams<>(this, otherStream);
}

CoGroupedStreams 调用 where 方法生成 CoGroupedStreams.Where(以下简称 Where),

Where 调用 equalTo 方法生成 CoGroupedStreams.Where.EqualTo(以下简称 EqualTo),

EqualTo 调用 window 方法生成 CoGroupedStreams.WithWindow(以下简称 WithWindow) ,WithWindow 掉用 trigger 方法生成 WithWindow【可选】,WithWindow 掉用 evictor 方法生成 WithWindow【可选】,

WithWindow 掉用 allowedLateness 方法生成 WithWindow【可选】,WithWindow 掉用 apply 方法生成

DataStream。结论:cogroup 基于 union 和 window 实现,window 的实现参考 Flink 源码解读(三) Timer & WaterMark & Window。

public class CoGroupedStreams {
    public  Where where(KeySelector keySelector)  {
       Preconditions.checkNotNull(keySelector);
       final TypeInformation keyType = TypeExtractor.getKeySelectorTypes(keySelector, input1.getType());
       return where(keySelector, keyType);
    }

    public  Where where(KeySelector keySelector, TypeInformation keyType)  {
       Preconditions.checkNotNull(keySelector);
       Preconditions.checkNotNull(keyType);
       return new Where<>(input1.clean(keySelector), keyType);
    }

    public class Where {

        private final KeySelector keySelector1;
        private final TypeInformation keyType;

        Where(KeySelector keySelector1, TypeInformation keyType) {
            this.keySelector1 = keySelector1;
            this.keyType = keyType;
        }

        public EqualTo equalTo(KeySelector keySelector) {
            Preconditions.checkNotNull(keySelector);
            final TypeInformation otherKey =
                    TypeExtractor.getKeySelectorTypes(keySelector, input2.getType());
            return equalTo(keySelector, otherKey);
        }

        public EqualTo equalTo(KeySelector keySelector, TypeInformation keyType) {    
            return new EqualTo(input2.clean(keySelector));
        }

        @Public
        public class EqualTo {

            private final KeySelector keySelector2;

            EqualTo(KeySelector keySelector2) {
                this.keySelector2 = requireNonNull(keySelector2);
            }

            @PublicEvolving
            public  WithWindow window(
                    WindowAssigner, W> assigner) {
                return new WithWindow<>(
                        input1,
                        input2,
                        keySelector1,
                        keySelector2,
                        keyType,
                        assigner,
                        null,
                        null,
                        null);
            }
        }
    }
}

public class CoGroupedStreams {

public static class WithWindow {

    @PublicEvolving
    public WithWindow trigger(
            Trigger, ? super W> newTrigger) {
        return new WithWindow<>(
                input1,
                input2,
                keySelector1,
                keySelector2,
                keyType,
                windowAssigner,
                newTrigger,
                evictor,
                allowedLateness);
    }

    @PublicEvolving
    public WithWindow evictor(
            Evictor, ? super W> newEvictor) {
        return new WithWindow<>(
                input1,
                input2,
                keySelector1,
                keySelector2,
                keyType,
                windowAssigner,
                trigger,
                newEvictor,
                allowedLateness);
    }

    @PublicEvolving
    public WithWindow allowedLateness(Time newLateness) {
        return new WithWindow<>(
                input1,
                input2,
                keySelector1,
                keySelector2,
                keyType,
                windowAssigner,
                trigger,
                evictor,
                newLateness);
    }

    public  DataStream apply(CoGroupFunction function) {

        TypeInformation resultType =
                TypeExtractor.getCoGroupReturnTypes(
                        function, input1.getType(), input2.getType(), "CoGroup", false);

        return apply(function, resultType);
    }

    public  DataStream apply(
            CoGroupFunction function, TypeInformation resultType) {

        function = input1.getExecutionEnvironment().clean(function);

        UnionTypeInfo unionType =
                new UnionTypeInfo<>(input1.getType(), input2.getType());
        UnionKeySelector unionKeySelector =
                new UnionKeySelector<>(keySelector1, keySelector2);

        DataStream> taggedInput1 =
                input1.map(new Input1Tagger())
                        .setParallelism(input1.getParallelism())
                        .returns(unionType);
        DataStream> taggedInput2 =
                input2.map(new Input2Tagger())
                        .setParallelism(input2.getParallelism())
                        .returns(unionType);

        DataStream> unionStream = taggedInput1.union(taggedInput2);

        windowedStream =new KeyedStream, KEY>(unionStream, unionKeySelector, keyType).window(windowAssigner);

        if (trigger != null) {
            windowedStream.trigger(trigger);
        }
        if (evictor != null) {
            windowedStream.evictor(evictor);
        }
        if (allowedLateness != null) {
            windowedStream.allowedLateness(allowedLateness);
        }

        return windowedStream.apply(new CoGroupWindowFunction(function), resultType);
    }
}

public class CoGroupedStreams {
    private static class CoGroupWindowFunction
            extends WrappingFunction>
            implements WindowFunction, T, KEY, W> {

        private static final long serialVersionUID = 1L;

        public CoGroupWindowFunction(CoGroupFunction userFunction) {
            super(userFunction);
        }

        @Override
        public void apply(KEY key, W window, Iterable> values, Collector out)
                throws Exception {

            List oneValues = new ArrayList<>();
            List twoValues = new ArrayList<>();

            for (TaggedUnion val : values) {
                if (val.isOne()) {
                    oneValues.add(val.getOne());
                } else {
                    twoValues.add(val.getTwo());
                }
            }
            wrappedFunction.coGroup(oneValues, twoValues, out);
        }
    }
}

1.2.2 Transformation 对象演变

UnionTransformation -- > PartitionTransformation --> OneInputTransformation(SimpleOperatorFactory(WindowOperator(

InternalIterableWindowFunction(

WindowFunction(CoGroupFunction)))/EvictingWindowOperator))

  1. Join

2.1 使用

stream.join(otherStream)
    .where()
    .equalTo()
    .window()
    .apply()

2.2 原理

2.2.1 DataStream 对象演变

DataStream 调用 join 方法生成 JoinedStreams。

public  JoinedStreams join(DataStream otherStream) {
    return new JoinedStreams<>(this, otherStream);
}

JoinedStreams 调用 where 方法生成 JoinedStreams.Where(以下简称 Where),

Where 调用 equalTo 方法生成 JoinedStreams.Where.EqualTo(以下简称 EqualTo),

EqualTo 调用 window 方法生成 JoinedStreams.WithWindow(以下简称 WithWindow) ,WithWindow 掉用 trigger 方法生成 WithWindow,WithWindow 掉用 evictor 方法生成 WithWindow,

WithWindow 掉用 allowedLateness 方法生成 WithWindow,WithWindow 掉用 apply 方法生成

DataStream。结论:join 基于 cogroup 实现,cogroup 的实现参考本文章节 1.2。

public class JoinedStreams {

    public  Where where(KeySelector keySelector) {
        requireNonNull(keySelector);
        final TypeInformation keyType =
                TypeExtractor.getKeySelectorTypes(keySelector, input1.getType());
        return where(keySelector, keyType);
    }

    public  Where where(KeySelector keySelector, TypeInformation keyType) {
        requireNonNull(keySelector);
        requireNonNull(keyType);
        return new Where<>(input1.clean(keySelector), keyType);
    }

    @Public
    public class Where {

        private final KeySelector keySelector1;
        private final TypeInformation keyType;

        Where(KeySelector keySelector1, TypeInformation keyType) {
            this.keySelector1 = keySelector1;
            this.keyType = keyType;
        }

        public EqualTo equalTo(KeySelector keySelector) {
            requireNonNull(keySelector);
            final TypeInformation otherKey =
                    TypeExtractor.getKeySelectorTypes(keySelector, input2.getType());
            return equalTo(keySelector, otherKey);
        }

        public EqualTo equalTo(KeySelector keySelector, TypeInformation keyType) {
            return new EqualTo(input2.clean(keySelector));
        }

        @Public
        public class EqualTo {

            private final KeySelector keySelector2;

            EqualTo(KeySelector keySelector2) {
                this.keySelector2 = requireNonNull(keySelector2);
            }

            public  WithWindow window(
                    WindowAssigner, W> assigner) {
                return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,assigner,null,null,null);
            }
        }
    }
}

public class JoinedStreams {

    public static class WithWindow {

        public WithWindow trigger(
                Trigger, ? super W> newTrigger) {
            return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,windowAssigner,newTrigger,evictor,allowedLateness);
        }

        public WithWindow evictor(
                Evictor, ? super W> newEvictor) {
            return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,windowAssigner,newTrigger,evictor,allowedLateness);
        }

        public WithWindow allowedLateness(Time newLateness) {
            return new WithWindow<>(input1,input2,keySelector1,keySelector2,keyType,windowAssigner,newTrigger,evictor,allowedLateness);        
        }

        public  DataStream apply(JoinFunction function) {
            return apply(function, resultType);
        }

        public  DataStream apply(
                FlatJoinFunction function, TypeInformation resultType) {
            coGroupedWindowedStream =
                    input1.coGroup(input2)
                            .where(keySelector1)
                            .equalTo(keySelector2)
                            .window(windowAssigner)
                            .trigger(trigger)
                            .evictor(evictor)
                            .allowedLateness(allowedLateness);

            return coGroupedWindowedStream.apply(new FlatJoinCoGroupFunction<>(function), resultType);
        }
    }
}

2.2.2 Transformation 对象演变

UnionTransformation -- > PartitionTransformation --> OneInputTransformation(SimpleOperatorFactory(WindowOperator(

InternalIterableWindowFunction(

WindowFunction(CoGroupFunction(JoinFunction))))/EvictingWindowOperator))

  1. Connect

4.1 使用

 stream
    .connect(otherStream)
    .process()

4.2 原理

4.2.1 DataStream 对象演变

DataStream 调用 connect 方法生成 ConnectedStreams。


public  ConnectedStreams connect(DataStream dataStream) {
    return new ConnectedStreams<>(environment, this, dataStream);
}


public class ConnectedStreams {

    public  SingleOutputStreamOperator process(
            CoProcessFunction coProcessFunction, TypeInformation outputType) {

        TwoInputStreamOperator operator;

        if ((inputStream1 instanceof KeyedStream) && (inputStream2 instanceof KeyedStream)) {
            operator = new LegacyKeyedCoProcessOperator<>(inputStream1.clean(coProcessFunction));
        } else {
            operator = new CoProcessOperator<>(inputStream1.clean(coProcessFunction));
        }

        return transform("Co-Process", outputType, operator);
    }
}

4.2.2 Transformation 对象演变

TwoInputTransformation(SimpleOperatorFactory(CoProcessOperator(CoProcessFunction)))

4.2.3 CoProcessOperator


public class CoProcessOperator
        extends AbstractUdfStreamOperator>
        implements TwoInputStreamOperator {

    public void processElement1(StreamRecord element) throws Exception {
        collector.setTimestamp(element);
        context.element = element;
        userFunction.processElement1(element.getValue(), context, collector);
        context.element = null;
    }

    @Override
    public void processElement2(StreamRecord element) throws Exception {
        collector.setTimestamp(element);
        context.element = element;
        userFunction.processElement2(element.getValue(), context, collector);
        context.element = null;
    }
}

  1. IntervalJoin

5.1 使用

 stream
    .keyBy()
    .intervalJoin(otherStream.keyBy())
    .between(

5.2 原理

5.2.1 KeyedStream 对象演变

KeyedStream 调用 intervalJoin 方法生成 IntervalJoin。


public class KeyedStream extends DataStream {
    public  IntervalJoin intervalJoin(KeyedStream otherStream) {
        return new IntervalJoin<>(this, otherStream);
    }
}


public static class IntervalJoin {
    public IntervalJoined between(Time lowerBound, Time upperBound) {
        if (timeBehaviour != TimeBehaviour.EventTime) {
            throw new UnsupportedTimeCharacteristicException(
                    "Time-bounded stream joins are only supported in event time");
        }

        checkNotNull(lowerBound, "A lower bound needs to be provided for a time-bounded join");
        checkNotNull(upperBound, "An upper bound needs to be provided for a time-bounded join");

        return new IntervalJoined<>(
                streamOne,
                streamTwo,
                lowerBound.toMilliseconds(),
                upperBound.toMilliseconds(),
                true,
                true);

    }

    public static class IntervalJoined {

        public  SingleOutputStreamOperator process(
                ProcessJoinFunction processJoinFunction,
                TypeInformation outputType) {
            Preconditions.checkNotNull(processJoinFunction);
            Preconditions.checkNotNull(outputType);

            final ProcessJoinFunction cleanedUdf =
                    left.getExecutionEnvironment().clean(processJoinFunction);

            final IntervalJoinOperator operator =
                    new IntervalJoinOperator<>(
                            lowerBound,
                            upperBound,
                            lowerBoundInclusive,
                            upperBoundInclusive,
                            left.getType().createSerializer(left.getExecutionConfig()),
                            right.getType().createSerializer(right.getExecutionConfig()),
                            cleanedUdf);

            return left.connect(right)
                    .keyBy(keySelector1, keySelector2)
                    .transform("Interval Join", outputType, operator);
        }
    }        
}

结论:intervaljoin 基于 connect 实现,使用 relativeLowerBound 和 relativeUpperBound 进行过滤。

5.2.2 Transformation 对象演变

PartitionTransformation -->TwoInputTransformation(SimpleOperatorFactory(CoProcessOperator(CoProcessFunction)))

5.2.3 IntervalJoinOperator


public class IntervalJoinOperator
        extends AbstractUdfStreamOperator>
        implements TwoInputStreamOperator, Triggerable {

    public void processElement1(StreamRecord record) throws Exception {
        processElement(record, leftBuffer, rightBuffer, lowerBound, upperBound, true);
    }

    public void processElement2(StreamRecord record) throws Exception {
        processElement(record, rightBuffer, leftBuffer, -upperBound, -lowerBound, false);
    }

    private  void processElement(
            final StreamRecord record,
            final MapState>> ourBuffer,
            final MapState>> otherBuffer,
            final long relativeLowerBound,
            final long relativeUpperBound,
            final boolean isLeft)
            throws Exception {

        if (isLate(ourTimestamp)) {
            return;
        }

        addToBuffer(ourBuffer, ourValue, ourTimestamp);

        for (Map.Entry>> bucket : otherBuffer.entries()) {
            final long timestamp = bucket.getKey();

            if (timestamp < ourTimestamp + relativeLowerBound
                    || timestamp > ourTimestamp + relativeUpperBound) {
                continue;
            }

            for (BufferEntry entry : bucket.getValue()) {
                if (isLeft) {
                    collect((T1) ourValue, (T2) entry.element, ourTimestamp, timestamp);
                } else {
                    collect((T1) entry.element, (T2) ourValue, timestamp, ourTimestamp);
                }
            }
        }

        long cleanupTime =
                (relativeUpperBound > 0L) ? ourTimestamp + relativeUpperBound : ourTimestamp;
        if (isLeft) {
            internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_LEFT, cleanupTime);
        } else {
            internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_RIGHT, cleanupTime);
        }
    }              
}

你可能感兴趣的:(Flink 源码解读(四) Flink DataSream 双流 Join 实现)