文末附下载地址
在Flink的使用过程中,经常可能会遇到将一个流的数据拆分成多个流,此时就需要将一个DataStream拆分成独立的两个或多个DataStream,一般情况下可能需要根据一些条件将不同的数据过滤出来写入不同的流。
在1.13版本中,使用处理函数(process function)的侧输出流(side output)将一个流进行拆分。处理函数本身可以认为是一个转换算子,它的输出类型比较单一,处理之后得到的仍然是一个DataStream。但是侧输出流并不受此限制,可以任意自定义输出数据,看起来就像从主流分叉出来的支流。
将一个流拆分成多个流,首先需要定义一个输出标签(OutputTag),在处理时将被标记的数据写入单独的流中,之后通过getSideOutput
获取对应的被输出标签所标记的流。
public class SplitStreamByOutputTag {
private static final Logger log = LoggerFactory.getLogger(SplitStreamByOutputTag.class);
private static OutputTag<Tuple3<String, String, Long>> MaryTag = new OutputTag<Tuple3<String, String, Long>>("Mary-pv") {
};
private static OutputTag<Tuple3<String, String, Long>> BobTag = new OutputTag<Tuple3<String, String, Long>>("Mary" + "-pv") {
};
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<Event> stream = env.addSource(new ClickSource());
SingleOutputStreamOperator<Event> processedStream = stream.process(new ProcessFunction<Event, Event>() {
@Override
public void processElement(Event value, ProcessFunction<Event, Event>.Context ctx, Collector<Event> out) throws Exception {
if (value.getUser().equals("Mary")) {
ctx.output(MaryTag, new Tuple3<>(value.getUser(), value.getUrl(), value.getTimestamp()));
} else if (value.getUser().equals("Bob")) {
ctx.output(BobTag, Tuple3.of(value.getUser(), value.getUrl(), value.getTimestamp()));
} else {
out.collect(value);
}
}
});
processedStream.print("Other>>>>>");
processedStream.getSideOutput(MaryTag).print("Mary>>>>>");
processedStream.getSideOutput(BobTag).print("Bob>>>>>");
env.execute();
}
}
最简单的合流操作就是将多条流合并在一起,在Flink中的算子为Union。Union操作要求所有合并的流的类型必须一致,合并之后的新流会包含流中的所有元素,数据类型不变。这种操作比较简单粗暴,就类似于高速路上的岔道,两个道路的车直接汇入主路一样。需要注意的是,Union操作的参数可以是多个DataStream,最后的结果也是DataStream。
同时还需要注意,在事件时间语义下,水位线是时间的进度标志,不同的流中的水位线进展快慢可能不一样,将它们合并在一起之后,对于合并之后的水位线也是以最小的为准,这样才可以保证所有流都不会再传来之前的数据。
public class UnionExample {
private static final Logger log = LoggerFactory.getLogger(UnionExample.class);
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Event> stream1 = env.socketTextStream("localhost", 9998)
.map(data -> {
String[] fields = data.split(",");
return new Event(fields[0].trim(), fields[1].trim(), Long.valueOf(fields[2].trim()));
})
.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2))
.withTimestampAssigner((SerializableTimestampAssigner<Event>) (element, recordTimestamp) -> element.getTimestamp()));
stream1.print("stream1>>>>>");
SingleOutputStreamOperator<Event> stream2 = env.socketTextStream("localhost", 9997)
.map(data -> {
String[] fields = data.split(",");
return new Event(fields[0].trim(), fields[1].trim(), Long.valueOf(fields[2].trim()));
})
.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((SerializableTimestampAssigner<Event>) (element, recordTimestamp) -> element.getTimestamp()));
stream2.print("stream2>>>>>");
// 合并
stream1.union(stream2)
.process(new ProcessFunction<Event, String>() {
@Override
public void processElement(Event value, ProcessFunction<Event, String>.Context ctx, Collector<String> out) throws Exception {
out.collect("watermark:" + ctx.timerService().currentWatermark());
}
}).print();
env.execute();
}
}
流的Union操作虽然简单粗暴,但是严重受限于数据类型不能改变,所以在实际的应用中比较少。除了Union之外,还有一种合流的操作Connect。为了处理更灵活,Connect允许流的数据类型不同,但是合并后的流只能是一种数据类型,所以有一种新的数据类型ConnectedStream。Connect可以被看作是形式上的统一,两个流被放在了一个流中,事实上两个流内部的数据都是独立的,要想得到新的DataStream,就需要进一步定义一个co-process
转换操作,对两个不同的流的数据,分别进行怎样的转换和处理,最终得到统一的输出类型。
ConnectedStream
首先将两条流经过connect的到一个ConnectedStream,然后调用同处理方法得到DataStream,这里可以使用的同处理方法有map
、flatMap
以及process
方法。
public class ConnectExample {
private static final Logger log = LoggerFactory.getLogger(ConnectExample.class);
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<Integer> left = env.fromElements(1, 2, 3);
DataStreamSource<Long> right = env.fromElements(1L, 2L, 3L);
ConnectedStreams<Integer, Long> connectedStreams = left.connect(right);
SingleOutputStreamOperator<String> result = connectedStreams.map(new CoMapFunction<Integer, Long, String>() {
@Override
public String map1(Integer value) throws Exception {
return "Integer: " + value;
}
@Override
public String map2(Long value) throws Exception {
return "Long: " + value;
}
});
result.print();
env.execute();
}
}
调用map方法传入的是一个CoMapFunction
,分别对两条流中的数据进行处理,这个接口的三个参数分别是第一条流、第二条流以及合并后的数据类型。需要实现的方法也非常见明知意,map1对应第一个流的处理,map2对应第二个流的处理。ConnectedStream
也可以直接调用keyBy
进行分区操作,得到的也还是一个ConnectedStream
。这种操作和对两条流先进行keyBy
,然后再connect
效果是一样的,需要注意的是,两条流定义的分区键的类型必须相同,不然会抛出异常。
CoProcessFunction
和使用CoMapFunction
和CoFlatMapFunction
函数一样,CoProcessFunction
需要实现两个方法processElement1
和processElement2
分别去处理两个流中的数据,相比于CoMapFunction
和CoFlatMapFuntion
不同的是,CoProcessFunction
中包含生命周期函数以及状态和定时器的操作。
public class CoProcessFunctionExample {
private static final Logger log = LoggerFactory.getLogger(CoProcessFunctionExample.class);
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple3<String, String, Long>> appStream = env.fromElements(
Tuple3.of("order-1", "app", 1000L),
Tuple3.of("order-2", "app", 2000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forMonotonousTimestamps()
.withTimestampAssigner((SerializableTimestampAssigner<Tuple3<String, String, Long>>) (element, recordTimestamp) -> element.f2));
SingleOutputStreamOperator<Tuple4<String, String, String, Long>> thirdPartyStream = env.fromElements(
Tuple4.of("order-1", "third-party", "success", 3000L),
Tuple4.of("order-3", "third-party", "success", 4000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple4<String, String, String, Long>>forMonotonousTimestamps()
.withTimestampAssigner((SerializableTimestampAssigner<Tuple4<String, String, String, Long>>) (element, recordTimestamp) -> element.f3));
appStream.connect(thirdPartyStream)
.keyBy(data -> data.f0, data -> data.f0)
.process(new CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>() {
private ValueState<Tuple3<String, String, Long>> appEventState;
private ValueState<Tuple4<String, String, String, Long>> thirdPartyEventState;
@Override
public void open(Configuration parameters) throws Exception {
appEventState = getRuntimeContext().getState(new ValueStateDescriptor<>("app-event", Types.TUPLE(Types.STRING, Types.STRING, Types.LONG)));
thirdPartyEventState = getRuntimeContext().getState(new ValueStateDescriptor<>("third-party-event", Types.TUPLE(Types.STRING, Types.STRING, Types.STRING, Types.LONG)));
}
@Override
public void processElement1(Tuple3<String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context ctx, Collector<String> out) throws Exception {
if (thirdPartyEventState.value() != null) {
out.collect("对账成功:" + value + " " + thirdPartyEventState.value());
thirdPartyEventState.clear();
} else {
appEventState.update(value);
ctx.timerService().registerEventTimeTimer(value.f2 + 5000L);
}
}
@Override
public void processElement2(Tuple4<String, String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context ctx, Collector<String> out) throws Exception {
if (appEventState.value() != null) {
out.collect("对账成功:" + appEventState.value() + " " + value);
appEventState.clear();
} else {
thirdPartyEventState.update(value);
ctx.timerService().registerEventTimeTimer(value.f3 + 5000L);
}
}
@Override
public void onTimer(long timestamp, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
// 定时器触发
if (appEventState.value() != null) {
out.collect("对账失败:" + appEventState.value() + " 第三方平台支付信息未到");
}
if (thirdPartyEventState.value() != null) {
out.collect("对账失败:" + thirdPartyEventState.value() + " app信息未到");
}
appEventState.clear();
thirdPartyEventState.clear();
}
}).print();
env.execute();
}
}
BroadcastConnectedStream
两条流中的连接中,当DataStream调用connect
时,可能传入的第二流是一个广播流(BroadcastStream),此时两条流进行合并得到的就是一个BroadcastConnectedStream
。
这种方式一般用来需要动态定义某些规则或者配置的场景。因为规则是实时变动的,我们可以用一个流来实时获取规则数据,而这些规则对整个应用是全局有效的,所以必须把这个流广播给所有的子任务。而下游收到广播出来的规则,会把它保存成一个状态,这就是所谓的“广播状态“。
从MySQL中动态读取配置,并与kafka中的数据进行关联,将配置数据广播给所有子任务。
public class BroadcastConnectedStream {
private static final Logger log = LoggerFactory.getLogger(BroadcastConnectedStream.class);
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("group.id", "event");
// kafka结构:{"userID": "user_1", "eventTime": "2019-08-17 12:19:47", "eventType": "browse", "productID": 1}
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("user", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromLatest();
SingleOutputStreamOperator<String> kafkaSource = env.addSource(kafkaConsumer).name("kafkaSource").uid("source-id-kafka-source");
SingleOutputStreamOperator<Tuple4<String, String, String, Integer>> eventStream = kafkaSource.flatMap(new FlatMapFunction<String, Tuple4<String, String, String, Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple4<String, String, String, Integer>> out) throws Exception {
try {
JSONObject jsonObject = JSON.parseObject(value);
String userID = jsonObject.getString("userID");
String eventTime = jsonObject.getString("eventTime");
String eventType = jsonObject.getString("eventType");
Integer productID = jsonObject.getInteger("productID");
out.collect(Tuple4.of(userID, eventTime, eventType, productID));
} catch (Exception e) {
log.warn("异常数据:{}", value, e);
}
}
});
// 从MySQL获取配置流,MySQL结构:userID, userName, userAge
DataStreamSource<HashMap<String, Tuple2<String, Integer>>> configStream = env.addSource(new MySQLSource("localhost", 3306, "demo", "root", "xiaoer", 5000L));
// MapStateDescriptor
MapStateDescriptor<Void, Map<String, Tuple2<String, Integer>>> configDescriptor = new MapStateDescriptor<>("config", Types.VOID, Types.MAP(Types.STRING, Types.TUPLE(Types.STRING, Types.INT)));
// 将配置流广播
BroadcastStream<HashMap<String, Tuple2<String, Integer>>> broadcastConfigStream = configStream.broadcast(configDescriptor);
SingleOutputStreamOperator<Tuple6<String, String, String, Integer, String, Integer>> resultStream = eventStream.connect(broadcastConfigStream)
.process(new BroadcastProcessFunction<Tuple4<String, String, String, Integer>, HashMap<String, Tuple2<String, Integer>>, Tuple6<String, String, String, Integer, String, Integer>>() {
MapStateDescriptor<Void, Map<String, Tuple2<String, Integer>>> configDescriptor = new MapStateDescriptor<>("config", Types.VOID, Types.MAP(Types.STRING, Types.TUPLE(Types.STRING, Types.INT)));
@Override
public void processElement(Tuple4<String, String, String, Integer> value, BroadcastProcessFunction<Tuple4<String, String, String, Integer>, HashMap<String, Tuple2<String, Integer>>, Tuple6<String, String, String, Integer, String, Integer>>.ReadOnlyContext ctx, Collector<Tuple6<String, String, String, Integer, String, Integer>> out) throws Exception {
// 获取事件流中的userID
String userID = value.f0;
// 获取状态
ReadOnlyBroadcastState<Void, Map<String, Tuple2<String, Integer>>> broadcastState = ctx.getBroadcastState(configDescriptor);
Map<String, Tuple2<String, Integer>> broadcastStateUserInfo = broadcastState.get(null);
Tuple2<String, Integer> userInfo = broadcastStateUserInfo.get(userID);
if (userInfo != null) {
out.collect(Tuple6.of(value.f0, value.f1, value.f2, value.f3, userInfo.f0, userInfo.f1));
}
}
@Override
public void processBroadcastElement(HashMap<String, Tuple2<String, Integer>> value, BroadcastProcessFunction<Tuple4<String, String, String, Integer>, HashMap<String, Tuple2<String, Integer>>, Tuple6<String, String, String, Integer, String, Integer>>.Context ctx, Collector<Tuple6<String, String, String, Integer, String, Integer>> out) throws Exception {
BroadcastState<Void, Map<String, Tuple2<String, Integer>>> broadcastState = ctx.getBroadcastState(configDescriptor);
// 清空状态
broadcastState.clear();
// 更新状态
broadcastState.put(null, value);
}
});
resultStream.print();
env.execute();
}
}
class MySQLSource extends RichSourceFunction<HashMap<String, Tuple2<String, Integer>>> {
private static final Logger log = LoggerFactory.getLogger(MySQLSource.class);
private String host;
private Integer port;
private String db;
private String user;
private String password;
private Long interval;
private volatile boolean isRunning = true;
private Connection connection;
private PreparedStatement preparedStatement;
public MySQLSource(String host, Integer port, String db, String user, String password, Long interval) {
this.host = host;
this.port = port;
this.db = db;
this.user = user;
this.password = password;
/**
* 间隔多少毫秒查询
*/
this.interval = interval;
}
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
Class.forName("com.mysql.cj.jdbc.Driver");
connection = DriverManager.getConnection("jdbc:mysql://" + host + ":" + port + "/" + db + "?useUnicode=true&characterEncoding=UTF-8", user, password);
String sql = "select userID, userName, userAge from user_info";
preparedStatement = connection.prepareStatement(sql);
}
@Override
public void close() throws Exception {
super.close();
if (connection != null) {
connection.close();
}
if (preparedStatement != null) {
preparedStatement.close();
}
}
@Override
public void run(SourceContext<HashMap<String, Tuple2<String, Integer>>> ctx) throws Exception {
try {
while (isRunning) {
HashMap<String, Tuple2<String, Integer>> output = new HashMap<>();
ResultSet resultSet = preparedStatement.executeQuery();
while (resultSet.next()) {
String userID = resultSet.getString("userID");
String userName = resultSet.getString("userName");
int userAge = resultSet.getInt("userAge");
output.put(userID, Tuple2.of(userName, userAge));
}
ctx.collect(output);
Thread.sleep(interval);
}
} catch (Exception e) {
log.error("从MySQL获取配置流数据异常...", e);
}
}
@Override
public void cancel() {
isRunning = false;
}
}
对于两条流的合并,有时候我们不是将两条流的数据简单的合并在一起,而是根据某个字段将其匹配起来,这点有点像关系型数据库中的join操作,事实上,Flink中的connect操作,就可以通过keyBy指定分区键然后合并,实现了类似于SQL中的join操作。使用connect以及处理函数,可以实现双流合并的大多数场景。
不过处理函数处理底层接口,来处理一些具体的场景还是比较抽象的,比如要统计固定的时间内两条流的匹配情况,就需要设置定时器,自定义触发器的逻辑才可以实现,所以Flink提供了两种内置的join算子,以及coGroup算子。
基于时间的操作,最基本的就是时间窗口。Flink提供了一个窗口连接算子(Window Join),可以定义时间窗口,并将两条流中共享一个公共分区键的数据放在窗口中进行配对处理。
Window Join
的通用调用形式如下:
stream.join(otherStream)
.where()
.equalTo()
.window()
.apply()
上面中的where
的参数是stream的分区键选择器,equalTo
是otherStream的分区键选择器,两者相同的元素如果在同一个窗口中,就可以匹配起来,然后调用apply
通过JoinFunction
进行处理。window
传入的是一个窗口分配器,可以使用滚动窗口,滑动窗口以及会话窗口。需要注意的是,这里最后进行处理时只能调用apply
方法,没有其他的方法可供选择。需要注意的是JoinFunction
并不是真正的窗口函数,而是定义了窗口函数在调用时对匹配数据的处理逻辑。
JoinFunction
接口有三个类型,分别是两条流中的数据类型以及最终输出的结果的数据类型,JoinFunction
中只有一个join
方法需要实现。两条流的数据到来之后,首先会按照key分组进入对应的窗口中存储,当窗口到达结束的时间时,算子会统计出窗口内两条流的所有组合,即笛卡尔积,然后进行遍历,把每一对匹配的数据,作为参数传入JoinFunction
的join
方法中间进行处理。
apply
中除了JoinFunction
之外,还可以传入FlatJoinFunction
,用法非常类似,只是需要实现的join
方法没有返回值,而是通过收集器(Collector)来实现,所以对于一对匹配的数据可以输出任意条结果。
Flink中的window join
类似于inner join
,最后输出的,只有两条流中按key匹配成功的那些数据。
public class WindowJoinExample {
public static void main(String[] args) throws Exception {
// 获取流执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并行度
env.setParallelism(1);
// 添加数据源
SingleOutputStreamOperator<Tuple2<String, Long>> stream1 = env.fromElements(
Tuple2.of("a", 1000L),
Tuple2.of("b", 1000L),
Tuple2.of("a", 2000L),
Tuple2.of("b", 2000L))
.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
}));
SingleOutputStreamOperator<Tuple2<String, Long>> stream2 = env.fromElements(
Tuple2.of("a", 3000L),
Tuple2.of("b", 3000L),
Tuple2.of("a", 4000L),
Tuple2.of("b", 4000L))
.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
}));
// join
stream1.join(stream2)
.where(data -> data.f0)
.equalTo(data -> data.f0)
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, String>() {
@Override
public String join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return first + " ==> " + second;
}
})
.print();
env.execute();
}
}
Window Join
在底层也是调用了CoGroup
去做了实现,并通过JoinCoGroupFunction
的构造方法将JoinFunction
转换成了JoinCoGroupFunction
进行处理。
在某些场景下,我们要处理的时间间隔可能不是固定的,Window Join
很明显不能满足。Flink提供了一个Interval Join
的合流操作,该算子是针对一条流中的每一个数据,开辟出其时间戳前后的一段时间间隔,看这个期间内是否有来自另一个流的数据匹配。
Interval Join
中给定了两个时间点,一个是上界(upperBound),一个是下界(lowerBound)。对于一个流left
中的元素a
来说,就可以开辟一段时间间隔为[a.timestamp + lowerBound, a.timestamp + upperBound]
的范围,此时如果right
流中的数据的元素b
的时间戳在这个范围内,那么a
和b
就可以成功匹配。
需要注意的是,进行Interval Join
的两个流也必须是基于相同的分区键key,需要注意的是a.timestame + lowerBound
必须小于等于a.timestamp + upperBound
,其中lowerBound
和upperBound
都可正可负,目前也仅支持事件时间语义。
Interval Join
的通用调用形式如下:
orangeStream
.keyBy()
.intervalJoin(greenStream.keyBy())
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction out) {
out.collect(left + "," + right);
}
});
Interval Join
同样也是一种inner join
,与Window Join
不同的是,Interval Join
做匹配是基于流中的数据的,所以不确定。而且另一个流中的数据可能不只是在一个时间段内被匹配。
public class IntervalJoinExample {
public static void main(String[] args) throws Exception {
// 获取流执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并行度
env.setParallelism(1);
// 获取数据源
SingleOutputStreamOperator<Tuple3<String, String, Long>> stream1 = env.fromElements(
Tuple3.of("Mary", "order-1", 5000L),
Tuple3.of("Alice", "order-2", 5000L),
Tuple3.of("Bob", "order-3", 20000L),
Tuple3.of("Alice", "order-4", 20000L),
Tuple3.of("Cary", "order-5", 51000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
@Override
public long extractTimestamp(Tuple3<String, String, Long> element, long recordTimestamp) {
return element.f2;
}
}));
SingleOutputStreamOperator<Event> stream2 = env.fromElements(
new Event("Bob", "./cart", 2000L),
new Event("Alice", "./prod?id=100", 3000L),
new Event("Alice", "./prod?id=200", 3500L),
new Event("Bob", "./prod?id=2", 2500L),
new Event("Alice", "./prod?id=300", 36000L),
new Event("Bob", "./home", 30000L),
new Event("Bob", "./prod?id=1", 23000L),
new Event("Bob", "./prod?id=3", 33000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.getTimestamp();
}
}));
// interval join
stream1.keyBy(data -> data.f0)
.intervalJoin(stream2.keyBy(Event::getUser))
.between(Time.seconds(-2), Time.seconds(2))
.process(new IntervalJoinProcessJoinFunction())
.print();
env.execute();
}
}
Interval Join
中只能调用process
,然后传入处理函数ProcessJoinFunction
,Interval Join
的底层是调用了connect
进行处理。
除了Window Join
和Interval Join
之外,还提供了Window CoGroup
操作。因为Window Join
的底层就是Window CoGroup
实现的,所以和Window Join
的用法非常类似,也是将两条流合并后开窗处理匹配的元素,调用时只需要将join
换成coGroup
就好了。
Window CoGroup
的通用调用形式如下:
dataStream.coGroup(otherStream)
.where()
.equalTo()
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply()
CoGroupFuncton
中的coGroup
方法有点类似于FlatJoinFunction
中的join
方法,同样都是三个参数,分别是两条流中的数据以及用于输出数据的收集器(Collector)。不同的是,join
中的前两个参数是一组匹配的数据,而coGroup
中的前两个参数则是可遍历的集合。也就是说,此时不会再去计算窗口中两个流的数据的笛卡尔积,而是将所有的数据一次性传入,至于要怎么样处理,完全自定义。所以CoGroup
相比于Join
更加的灵活,可以使用CoGroup
实现left join
、right join
以及full join
。
public class CoGroupExample {
public static void main(String[] args) throws Exception {
// 获取流执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并行度
env.setParallelism(1);
// 添加数据源
SingleOutputStreamOperator<Tuple2<String, Long>> stream1 = env.fromElements(
Tuple2.of("a", 1000L),
Tuple2.of("b", 1000L),
Tuple2.of("c", 1000L),
Tuple2.of("a", 3000L),
Tuple2.of("b", 3000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
}));
SingleOutputStreamOperator<Tuple2<String, Long>> stream2 = env.fromElements(
Tuple2.of("a", 2000L),
Tuple2.of("b", 2000L),
Tuple2.of("a", 5000L),
Tuple2.of("b", 5000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
}));
// coGroup
stream1.coGroup(stream2)
.where(data -> data.f0)
.equalTo(data -> data.f0)
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply(new CoGroupFunction<Tuple2<String, Long>, Tuple2<String, Long>, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, Long>> first, Iterable<Tuple2<String, Long>> second, Collector<String> out) throws Exception {
for (Tuple2<String, Long> left : first) {
boolean isMatched = false;
for (Tuple2<String, Long> right : second) {
// 如果匹配上
out.collect(left + " ==> " + right);
isMatched = true;
}
if (!isMatched) {
out.collect(left + " ==> " + "null");
}
}
}
})
.print();
env.execute();
}
}