Flink中的多流转换

文末附下载地址

1. 分流

在Flink的使用过程中,经常可能会遇到将一个流的数据拆分成多个流,此时就需要将一个DataStream拆分成独立的两个或多个DataStream,一般情况下可能需要根据一些条件将不同的数据过滤出来写入不同的流。

在1.13版本中,使用处理函数(process function)的侧输出流(side output)将一个流进行拆分。处理函数本身可以认为是一个转换算子,它的输出类型比较单一,处理之后得到的仍然是一个DataStream。但是侧输出流并不受此限制,可以任意自定义输出数据,看起来就像从主流分叉出来的支流。

将一个流拆分成多个流,首先需要定义一个输出标签(OutputTag),在处理时将被标记的数据写入单独的流中,之后通过getSideOutput获取对应的被输出标签所标记的流。

public class SplitStreamByOutputTag {
    private static final Logger log = LoggerFactory.getLogger(SplitStreamByOutputTag.class);

    private static OutputTag<Tuple3<String, String, Long>> MaryTag = new OutputTag<Tuple3<String, String, Long>>("Mary-pv") {
    };
    private static OutputTag<Tuple3<String, String, Long>> BobTag = new OutputTag<Tuple3<String, String, Long>>("Mary" + "-pv") {
    };

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        DataStreamSource<Event> stream = env.addSource(new ClickSource());

        SingleOutputStreamOperator<Event> processedStream = stream.process(new ProcessFunction<Event, Event>() {
            @Override
            public void processElement(Event value, ProcessFunction<Event, Event>.Context ctx, Collector<Event> out) throws Exception {
                if (value.getUser().equals("Mary")) {
                    ctx.output(MaryTag, new Tuple3<>(value.getUser(), value.getUrl(), value.getTimestamp()));
                } else if (value.getUser().equals("Bob")) {
                    ctx.output(BobTag, Tuple3.of(value.getUser(), value.getUrl(), value.getTimestamp()));
                } else {
                    out.collect(value);
                }
            }
        });

        processedStream.print("Other>>>>>");
        processedStream.getSideOutput(MaryTag).print("Mary>>>>>");
        processedStream.getSideOutput(BobTag).print("Bob>>>>>");

        env.execute();
    }
}

2. 合流

2.1 基本合流操作
2.1.1 Union

最简单的合流操作就是将多条流合并在一起,在Flink中的算子为Union。Union操作要求所有合并的流的类型必须一致,合并之后的新流会包含流中的所有元素,数据类型不变。这种操作比较简单粗暴,就类似于高速路上的岔道,两个道路的车直接汇入主路一样。需要注意的是,Union操作的参数可以是多个DataStream,最后的结果也是DataStream。

同时还需要注意,在事件时间语义下,水位线是时间的进度标志,不同的流中的水位线进展快慢可能不一样,将它们合并在一起之后,对于合并之后的水位线也是以最小的为准,这样才可以保证所有流都不会再传来之前的数据。

public class UnionExample {
    private static final Logger log = LoggerFactory.getLogger(UnionExample.class);

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Event> stream1 = env.socketTextStream("localhost", 9998)
                .map(data -> {
                    String[] fields = data.split(",");
                    return new Event(fields[0].trim(), fields[1].trim(), Long.valueOf(fields[2].trim()));
                })
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                        .withTimestampAssigner((SerializableTimestampAssigner<Event>) (element, recordTimestamp) -> element.getTimestamp()));

        stream1.print("stream1>>>>>");

        SingleOutputStreamOperator<Event> stream2 = env.socketTextStream("localhost", 9997)
                .map(data -> {
                    String[] fields = data.split(",");
                    return new Event(fields[0].trim(), fields[1].trim(), Long.valueOf(fields[2].trim()));
                })
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5))
                        .withTimestampAssigner((SerializableTimestampAssigner<Event>) (element, recordTimestamp) -> element.getTimestamp()));

        stream2.print("stream2>>>>>");

        // 合并
        stream1.union(stream2)
                .process(new ProcessFunction<Event, String>() {
                    @Override
                    public void processElement(Event value, ProcessFunction<Event, String>.Context ctx, Collector<String> out) throws Exception {
                        out.collect("watermark:" + ctx.timerService().currentWatermark());
                    }
                }).print();

        env.execute();
    }
}
2.1.2 Connect

流的Union操作虽然简单粗暴,但是严重受限于数据类型不能改变,所以在实际的应用中比较少。除了Union之外,还有一种合流的操作Connect。为了处理更灵活,Connect允许流的数据类型不同,但是合并后的流只能是一种数据类型,所以有一种新的数据类型ConnectedStream。Connect可以被看作是形式上的统一,两个流被放在了一个流中,事实上两个流内部的数据都是独立的,要想得到新的DataStream,就需要进一步定义一个co-process转换操作,对两个不同的流的数据,分别进行怎样的转换和处理,最终得到统一的输出类型。

  • ConnectedStream

    首先将两条流经过connect的到一个ConnectedStream,然后调用同处理方法得到DataStream,这里可以使用的同处理方法有mapflatMap以及process方法。

    public class ConnectExample {
        private static final Logger log = LoggerFactory.getLogger(ConnectExample.class);
    
        public static void main(String[] args) throws Exception {
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setParallelism(1);
    
            DataStreamSource<Integer> left = env.fromElements(1, 2, 3);
            DataStreamSource<Long> right = env.fromElements(1L, 2L, 3L);
    
            ConnectedStreams<Integer, Long> connectedStreams = left.connect(right);
            SingleOutputStreamOperator<String> result = connectedStreams.map(new CoMapFunction<Integer, Long, String>() {
                @Override
                public String map1(Integer value) throws Exception {
                    return "Integer: " + value;
                }
    
                @Override
                public String map2(Long value) throws Exception {
                    return "Long: " + value;
                }
            });
    
            result.print();
    
            env.execute();
        }
    }
    

    调用map方法传入的是一个CoMapFunction,分别对两条流中的数据进行处理,这个接口的三个参数分别是第一条流、第二条流以及合并后的数据类型。需要实现的方法也非常见明知意,map1对应第一个流的处理,map2对应第二个流的处理。ConnectedStream也可以直接调用keyBy进行分区操作,得到的也还是一个ConnectedStream。这种操作和对两条流先进行keyBy,然后再connect效果是一样的,需要注意的是,两条流定义的分区键的类型必须相同,不然会抛出异常。

  • CoProcessFunction

    和使用CoMapFunctionCoFlatMapFunction函数一样,CoProcessFunction需要实现两个方法processElement1processElement2分别去处理两个流中的数据,相比于CoMapFunctionCoFlatMapFuntion不同的是,CoProcessFunction中包含生命周期函数以及状态和定时器的操作。

    public class CoProcessFunctionExample {
        private static final Logger log = LoggerFactory.getLogger(CoProcessFunctionExample.class);
    
        public static void main(String[] args) throws Exception {
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setParallelism(1);
    
            SingleOutputStreamOperator<Tuple3<String, String, Long>> appStream = env.fromElements(
                    Tuple3.of("order-1", "app", 1000L),
                    Tuple3.of("order-2", "app", 2000L)
            ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forMonotonousTimestamps()
                    .withTimestampAssigner((SerializableTimestampAssigner<Tuple3<String, String, Long>>) (element, recordTimestamp) -> element.f2));
    
            SingleOutputStreamOperator<Tuple4<String, String, String, Long>> thirdPartyStream = env.fromElements(
                    Tuple4.of("order-1", "third-party", "success", 3000L),
                    Tuple4.of("order-3", "third-party", "success", 4000L)
            ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple4<String, String, String, Long>>forMonotonousTimestamps()
                    .withTimestampAssigner((SerializableTimestampAssigner<Tuple4<String, String, String, Long>>) (element, recordTimestamp) -> element.f3));
    
            appStream.connect(thirdPartyStream)
                    .keyBy(data -> data.f0, data -> data.f0)
                    .process(new CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>() {
                        private ValueState<Tuple3<String, String, Long>> appEventState;
                        private ValueState<Tuple4<String, String, String, Long>> thirdPartyEventState;
    
                        @Override
                        public void open(Configuration parameters) throws Exception {
                            appEventState = getRuntimeContext().getState(new ValueStateDescriptor<>("app-event", Types.TUPLE(Types.STRING, Types.STRING, Types.LONG)));
                            thirdPartyEventState = getRuntimeContext().getState(new ValueStateDescriptor<>("third-party-event", Types.TUPLE(Types.STRING, Types.STRING, Types.STRING, Types.LONG)));
                        }
    
                        @Override
                        public void processElement1(Tuple3<String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context ctx, Collector<String> out) throws Exception {
                            if (thirdPartyEventState.value() != null) {
                                out.collect("对账成功:" + value + "    " + thirdPartyEventState.value());
                                thirdPartyEventState.clear();
                            } else {
                                appEventState.update(value);
                                ctx.timerService().registerEventTimeTimer(value.f2 + 5000L);
                            }
                        }
    
                        @Override
                        public void processElement2(Tuple4<String, String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context ctx, Collector<String> out) throws Exception {
                            if (appEventState.value() != null) {
                                out.collect("对账成功:" + appEventState.value() + "    " + value);
                                appEventState.clear();
                            } else {
                                thirdPartyEventState.update(value);
                                ctx.timerService().registerEventTimeTimer(value.f3 + 5000L);
                            }
                        }
    
                        @Override
                        public void onTimer(long timestamp, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
                            // 定时器触发
                            if (appEventState.value() != null) {
                                out.collect("对账失败:" + appEventState.value() + "    第三方平台支付信息未到");
                            }
                            if (thirdPartyEventState.value() != null) {
                                out.collect("对账失败:" + thirdPartyEventState.value() + "    app信息未到");
                            }
                            appEventState.clear();
                            thirdPartyEventState.clear();
                        }
                    }).print();
    
            env.execute();
        }
    }
    
  • BroadcastConnectedStream

    两条流中的连接中,当DataStream调用connect时,可能传入的第二流是一个广播流(BroadcastStream),此时两条流进行合并得到的就是一个BroadcastConnectedStream

    这种方式一般用来需要动态定义某些规则或者配置的场景。因为规则是实时变动的,我们可以用一个流来实时获取规则数据,而这些规则对整个应用是全局有效的,所以必须把这个流广播给所有的子任务。而下游收到广播出来的规则,会把它保存成一个状态,这就是所谓的“广播状态“。

    从MySQL中动态读取配置,并与kafka中的数据进行关联,将配置数据广播给所有子任务。

    public class BroadcastConnectedStream {
        private static final Logger log = LoggerFactory.getLogger(BroadcastConnectedStream.class);
    
        public static void main(String[] args) throws Exception {
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setParallelism(1);
    
            Properties properties = new Properties();
            properties.put("bootstrap.servers", "localhost:9092");
            properties.put("group.id", "event");
    
            // kafka结构:{"userID": "user_1", "eventTime": "2019-08-17 12:19:47", "eventType": "browse", "productID": 1}
            FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("user", new SimpleStringSchema(), properties);
            kafkaConsumer.setStartFromLatest();
            SingleOutputStreamOperator<String> kafkaSource = env.addSource(kafkaConsumer).name("kafkaSource").uid("source-id-kafka-source");
            SingleOutputStreamOperator<Tuple4<String, String, String, Integer>> eventStream = kafkaSource.flatMap(new FlatMapFunction<String, Tuple4<String, String, String, Integer>>() {
                @Override
                public void flatMap(String value, Collector<Tuple4<String, String, String, Integer>> out) throws Exception {
                    try {
                        JSONObject jsonObject = JSON.parseObject(value);
                        String userID = jsonObject.getString("userID");
                        String eventTime = jsonObject.getString("eventTime");
                        String eventType = jsonObject.getString("eventType");
                        Integer productID = jsonObject.getInteger("productID");
                        out.collect(Tuple4.of(userID, eventTime, eventType, productID));
                    } catch (Exception e) {
                        log.warn("异常数据:{}", value, e);
                    }
                }
            });
    
            // 从MySQL获取配置流,MySQL结构:userID, userName, userAge
            DataStreamSource<HashMap<String, Tuple2<String, Integer>>> configStream = env.addSource(new MySQLSource("localhost", 3306, "demo", "root", "xiaoer", 5000L));
    
            // MapStateDescriptor
            MapStateDescriptor<Void, Map<String, Tuple2<String, Integer>>> configDescriptor = new MapStateDescriptor<>("config", Types.VOID, Types.MAP(Types.STRING, Types.TUPLE(Types.STRING, Types.INT)));
    
            // 将配置流广播
            BroadcastStream<HashMap<String, Tuple2<String, Integer>>> broadcastConfigStream = configStream.broadcast(configDescriptor);
    
            SingleOutputStreamOperator<Tuple6<String, String, String, Integer, String, Integer>> resultStream = eventStream.connect(broadcastConfigStream)
                    .process(new BroadcastProcessFunction<Tuple4<String, String, String, Integer>, HashMap<String, Tuple2<String, Integer>>, Tuple6<String, String, String, Integer, String, Integer>>() {
                        MapStateDescriptor<Void, Map<String, Tuple2<String, Integer>>> configDescriptor = new MapStateDescriptor<>("config", Types.VOID, Types.MAP(Types.STRING, Types.TUPLE(Types.STRING, Types.INT)));
    
                        @Override
                        public void processElement(Tuple4<String, String, String, Integer> value, BroadcastProcessFunction<Tuple4<String, String, String, Integer>, HashMap<String, Tuple2<String, Integer>>, Tuple6<String, String, String, Integer, String, Integer>>.ReadOnlyContext ctx, Collector<Tuple6<String, String, String, Integer, String, Integer>> out) throws Exception {
                            // 获取事件流中的userID
                            String userID = value.f0;
    
                            // 获取状态
                            ReadOnlyBroadcastState<Void, Map<String, Tuple2<String, Integer>>> broadcastState = ctx.getBroadcastState(configDescriptor);
                            Map<String, Tuple2<String, Integer>> broadcastStateUserInfo = broadcastState.get(null);
    
                            Tuple2<String, Integer> userInfo = broadcastStateUserInfo.get(userID);
                            if (userInfo != null) {
                                out.collect(Tuple6.of(value.f0, value.f1, value.f2, value.f3, userInfo.f0, userInfo.f1));
                            }
                        }
    
                        @Override
                        public void processBroadcastElement(HashMap<String, Tuple2<String, Integer>> value, BroadcastProcessFunction<Tuple4<String, String, String, Integer>, HashMap<String, Tuple2<String, Integer>>, Tuple6<String, String, String, Integer, String, Integer>>.Context ctx, Collector<Tuple6<String, String, String, Integer, String, Integer>> out) throws Exception {
                            BroadcastState<Void, Map<String, Tuple2<String, Integer>>> broadcastState = ctx.getBroadcastState(configDescriptor);
                            // 清空状态
                            broadcastState.clear();
                            // 更新状态
                            broadcastState.put(null, value);
                        }
                    });
    
            resultStream.print();
    
            env.execute();
        }
    }
    
    class MySQLSource extends RichSourceFunction<HashMap<String, Tuple2<String, Integer>>> {
        private static final Logger log = LoggerFactory.getLogger(MySQLSource.class);
    
        private String host;
        private Integer port;
        private String db;
        private String user;
        private String password;
        private Long interval;
    
        private volatile boolean isRunning = true;
    
        private Connection connection;
        private PreparedStatement preparedStatement;
    
        public MySQLSource(String host, Integer port, String db, String user, String password, Long interval) {
            this.host = host;
            this.port = port;
            this.db = db;
            this.user = user;
            this.password = password;
    
            /**
             * 间隔多少毫秒查询
             */
            this.interval = interval;
        }
    
        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);
            Class.forName("com.mysql.cj.jdbc.Driver");
            connection = DriverManager.getConnection("jdbc:mysql://" + host + ":" + port + "/" + db + "?useUnicode=true&characterEncoding=UTF-8", user, password);
            String sql = "select userID, userName, userAge from user_info";
            preparedStatement = connection.prepareStatement(sql);
        }
    
        @Override
        public void close() throws Exception {
            super.close();
            if (connection != null) {
                connection.close();
            }
            if (preparedStatement != null) {
                preparedStatement.close();
            }
        }
    
        @Override
        public void run(SourceContext<HashMap<String, Tuple2<String, Integer>>> ctx) throws Exception {
            try {
                while (isRunning) {
                    HashMap<String, Tuple2<String, Integer>> output = new HashMap<>();
                    ResultSet resultSet = preparedStatement.executeQuery();
                    while (resultSet.next()) {
                        String userID = resultSet.getString("userID");
                        String userName = resultSet.getString("userName");
                        int userAge = resultSet.getInt("userAge");
                        output.put(userID, Tuple2.of(userName, userAge));
                    }
                    ctx.collect(output);
                    Thread.sleep(interval);
                }
            } catch (Exception e) {
                log.error("从MySQL获取配置流数据异常...", e);
            }
        }
    
        @Override
        public void cancel() {
            isRunning = false;
        }
    }
    
2.2 基于时间的合流操作

对于两条流的合并,有时候我们不是将两条流的数据简单的合并在一起,而是根据某个字段将其匹配起来,这点有点像关系型数据库中的join操作,事实上,Flink中的connect操作,就可以通过keyBy指定分区键然后合并,实现了类似于SQL中的join操作。使用connect以及处理函数,可以实现双流合并的大多数场景。

不过处理函数处理底层接口,来处理一些具体的场景还是比较抽象的,比如要统计固定的时间内两条流的匹配情况,就需要设置定时器,自定义触发器的逻辑才可以实现,所以Flink提供了两种内置的join算子,以及coGroup算子。

2.2.1 Window Join

基于时间的操作,最基本的就是时间窗口。Flink提供了一个窗口连接算子(Window Join),可以定义时间窗口,并将两条流中共享一个公共分区键的数据放在窗口中进行配对处理。

Window Join的通用调用形式如下:

stream.join(otherStream)
	.where()
	.equalTo()
	.window()
	.apply()

上面中的where的参数是stream的分区键选择器,equalTo是otherStream的分区键选择器,两者相同的元素如果在同一个窗口中,就可以匹配起来,然后调用apply通过JoinFunction进行处理。window传入的是一个窗口分配器,可以使用滚动窗口,滑动窗口以及会话窗口。需要注意的是,这里最后进行处理时只能调用apply方法,没有其他的方法可供选择。需要注意的是JoinFunction并不是真正的窗口函数,而是定义了窗口函数在调用时对匹配数据的处理逻辑。

JoinFunction接口有三个类型,分别是两条流中的数据类型以及最终输出的结果的数据类型,JoinFunction中只有一个join方法需要实现。两条流的数据到来之后,首先会按照key分组进入对应的窗口中存储,当窗口到达结束的时间时,算子会统计出窗口内两条流的所有组合,即笛卡尔积,然后进行遍历,把每一对匹配的数据,作为参数传入JoinFunctionjoin方法中间进行处理。

apply中除了JoinFunction之外,还可以传入FlatJoinFunction,用法非常类似,只是需要实现的join方法没有返回值,而是通过收集器(Collector)来实现,所以对于一对匹配的数据可以输出任意条结果。

Flink中的window join类似于inner join,最后输出的,只有两条流中按key匹配成功的那些数据。

public class WindowJoinExample {
    public static void main(String[] args) throws Exception {
        // 获取流执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置并行度
        env.setParallelism(1);

        // 添加数据源
        SingleOutputStreamOperator<Tuple2<String, Long>> stream1 = env.fromElements(
                        Tuple2.of("a", 1000L),
                        Tuple2.of("b", 1000L),
                        Tuple2.of("a", 2000L),
                        Tuple2.of("b", 2000L))
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                            @Override
                            public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                                return element.f1;
                            }
                        }));

        SingleOutputStreamOperator<Tuple2<String, Long>> stream2 = env.fromElements(
                        Tuple2.of("a", 3000L),
                        Tuple2.of("b", 3000L),
                        Tuple2.of("a", 4000L),
                        Tuple2.of("b", 4000L))
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                            @Override
                            public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                                return element.f1;
                            }
                        }));

        // join
        stream1.join(stream2)
                .where(data -> data.f0)
                .equalTo(data -> data.f0)
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, String>() {
                    @Override
                    public String join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
                        return first + " ==> " + second;
                    }
                })
                .print();

        env.execute();
    }
}

Window Join在底层也是调用了CoGroup去做了实现,并通过JoinCoGroupFunction的构造方法将JoinFunction转换成了JoinCoGroupFunction进行处理。

2.2.2 Interval Join

在某些场景下,我们要处理的时间间隔可能不是固定的,Window Join很明显不能满足。Flink提供了一个Interval Join的合流操作,该算子是针对一条流中的每一个数据,开辟出其时间戳前后的一段时间间隔,看这个期间内是否有来自另一个流的数据匹配。

Interval Join中给定了两个时间点,一个是上界(upperBound),一个是下界(lowerBound)。对于一个流left中的元素a来说,就可以开辟一段时间间隔为[a.timestamp + lowerBound, a.timestamp + upperBound]的范围,此时如果right流中的数据的元素b的时间戳在这个范围内,那么ab就可以成功匹配。

需要注意的是,进行Interval Join的两个流也必须是基于相同的分区键key,需要注意的是a.timestame + lowerBound必须小于等于a.timestamp + upperBound,其中lowerBoundupperBound都可正可负,目前也仅支持事件时间语义。

Flink中的多流转换_第1张图片

Interval Join的通用调用形式如下:

orangeStream
    .keyBy()
    .intervalJoin(greenStream.keyBy())
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process (new ProcessJoinFunction out) {
            out.collect(left + "," + right);
        }
    });

Interval Join同样也是一种inner join,与Window Join不同的是,Interval Join做匹配是基于流中的数据的,所以不确定。而且另一个流中的数据可能不只是在一个时间段内被匹配。

public class IntervalJoinExample {
    public static void main(String[] args) throws Exception {
        // 获取流执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置并行度
        env.setParallelism(1);

        // 获取数据源
        SingleOutputStreamOperator<Tuple3<String, String, Long>> stream1 = env.fromElements(
                Tuple3.of("Mary", "order-1", 5000L),
                Tuple3.of("Alice", "order-2", 5000L),
                Tuple3.of("Bob", "order-3", 20000L),
                Tuple3.of("Alice", "order-4", 20000L),
                Tuple3.of("Cary", "order-5", 51000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple3<String, String, Long> element, long recordTimestamp) {
                        return element.f2;
                    }
                }));

        SingleOutputStreamOperator<Event> stream2 = env.fromElements(
                new Event("Bob", "./cart", 2000L),
                new Event("Alice", "./prod?id=100", 3000L),
                new Event("Alice", "./prod?id=200", 3500L),
                new Event("Bob", "./prod?id=2", 2500L),
                new Event("Alice", "./prod?id=300", 36000L),
                new Event("Bob", "./home", 30000L),
                new Event("Bob", "./prod?id=1", 23000L),
                new Event("Bob", "./prod?id=3", 33000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                    @Override
                    public long extractTimestamp(Event element, long recordTimestamp) {
                        return element.getTimestamp();
                    }
                }));

        // interval join
        stream1.keyBy(data -> data.f0)
                .intervalJoin(stream2.keyBy(Event::getUser))
                .between(Time.seconds(-2), Time.seconds(2))
                .process(new IntervalJoinProcessJoinFunction())
                .print();

        env.execute();
    }
}

Interval Join中只能调用process,然后传入处理函数ProcessJoinFunctionInterval Join的底层是调用了connect进行处理。

2.2.3 Window CoGroup

除了Window JoinInterval Join之外,还提供了Window CoGroup操作。因为Window Join的底层就是Window CoGroup实现的,所以和Window Join的用法非常类似,也是将两条流合并后开窗处理匹配的元素,调用时只需要将join换成coGroup就好了。

Window CoGroup的通用调用形式如下:

dataStream.coGroup(otherStream)
    .where()
    .equalTo()
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply()

CoGroupFuncton中的coGroup方法有点类似于FlatJoinFunction中的join方法,同样都是三个参数,分别是两条流中的数据以及用于输出数据的收集器(Collector)。不同的是,join中的前两个参数是一组匹配的数据,而coGroup中的前两个参数则是可遍历的集合。也就是说,此时不会再去计算窗口中两个流的数据的笛卡尔积,而是将所有的数据一次性传入,至于要怎么样处理,完全自定义。所以CoGroup相比于Join更加的灵活,可以使用CoGroup实现left joinright join以及full join

public class CoGroupExample {
    public static void main(String[] args) throws Exception {
        // 获取流执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置并行度
        env.setParallelism(1);

        // 添加数据源
        SingleOutputStreamOperator<Tuple2<String, Long>> stream1 = env.fromElements(
                Tuple2.of("a", 1000L),
                Tuple2.of("b", 1000L),
                Tuple2.of("c", 1000L),
                Tuple2.of("a", 3000L),
                Tuple2.of("b", 3000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                        return element.f1;
                    }
                }));

        SingleOutputStreamOperator<Tuple2<String, Long>> stream2 = env.fromElements(
                Tuple2.of("a", 2000L),
                Tuple2.of("b", 2000L),
                Tuple2.of("a", 5000L),
                Tuple2.of("b", 5000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                        return element.f1;
                    }
                }));

        // coGroup
        stream1.coGroup(stream2)
                .where(data -> data.f0)
                .equalTo(data -> data.f0)
                .window(TumblingEventTimeWindows.of(Time.seconds(2)))
                .apply(new CoGroupFunction<Tuple2<String, Long>, Tuple2<String, Long>, String>() {
                    @Override
                    public void coGroup(Iterable<Tuple2<String, Long>> first, Iterable<Tuple2<String, Long>> second, Collector<String> out) throws Exception {
                        for (Tuple2<String, Long> left : first) {
                            boolean isMatched = false;
                            for (Tuple2<String, Long> right : second) {
                                // 如果匹配上
                                out.collect(left + " ==> " + right);
                                isMatched = true;
                            }
                            if (!isMatched) {
                                out.collect(left + " ==> " + "null");
                            }
                        }
                    }
                })
                .print();

        env.execute();
    }
}
关注微信公众号《零基础学大数据》回复【Flink】领取全部PDF

你可能感兴趣的:(从零学习Flink,flink,join)