好久没写博客了,更新一下。这篇博客主要将的是Flink中双流join的操作。
在说双流join之前先了解一下window的定义,推荐博客。然后了解一下watermark的定义,推荐博客。
我是在看完第二篇博客后,将数据源改为了kafka。但是一直没有触发window,打印window信息。然后在Flink的开发者邮箱中看见了问题解答。从邮箱的回答中可以很明显的看出为什么没有触发window。One of the source parallelisms doesn't have data
因为监听的topic给的是10个分区,在Flink中会并行使用多个分区。每一个分区发出的数据都会进入window,这个时候触发window的watermark就会变为分区中最小的watermark。问题就在我测试的时候每次都只往一个分区中发送消息,所以无法触发window。把kafka分区改为1完美解决问题。官方解释
测试代码:
public class WatermarkTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", Constants.KAFKA_BROKER);
properties.setProperty("group.id", "crm_stream_window");
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
DataStream stream =
env.addSource(new FlinkKafkaConsumer011<>("test", new SimpleStringSchema(), properties));
DataStream> inputMap = stream.map(new MapFunction>() {
private static final long serialVersionUID = -8812094804806854937L;
@Override
public Tuple2 map(String value) throws Exception {
return new Tuple2<>(value.split("\\W+")[0], Long.valueOf(value.split("\\W+")[1]));
}
});
DataStream> watermark =
inputMap.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks>() {
private static final long serialVersionUID = 8252616297345284790L;
Long currentMaxTimestamp = 0L;
Long maxOutOfOrderness = 10000L;//最大允许的乱序时间是10s
Watermark watermark = null;
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
@Nullable
@Override
public Watermark getCurrentWatermark() {
watermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness);
return watermark;
}
@Override
public long extractTimestamp(Tuple2 element, long previousElementTimestamp) {
Long timestamp = element.f1;
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
System.out.println("timestamp : " + element.f1 + "|" + format.format(element.f1) + " currentMaxTimestamp : " + currentMaxTimestamp + "|" + format.format(currentMaxTimestamp) + "," + " watermark : " + watermark.getTimestamp() + "|" + format.format(watermark.getTimestamp()));
return timestamp;
}
});
watermark.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply(new WindowFunction, String, Tuple, TimeWindow>() {
private static final long serialVersionUID = 7813420265419629362L;
@Override
public void apply(Tuple tuple, TimeWindow window, Iterable> input, Collector out) throws Exception {
System.out.println(tuple);
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
out.collect("window " + format.format(window.getStart()) + " window " + format.format(window.getEnd()));
}
}).print();
env.execute("window test");
}
}
单流的问题解决后,开始解决双流join。双流join的EvenTimeWindow和单流其实是差不多的。给两个流都指定EventTiem,然后指定延迟时间。需要注意的就是,此时触发window的watermark是两个流中较慢的watermark。也就是说A流的watermark为10,B流的watermark为20,当流join的时候,默认的watermark是10。解答推荐。这里的双流join是左联,将窗口中所有的数据都取出,然后经过逻辑判断将需要的数据发往下游。
双流代码:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.getTableEnvironment(env);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Kafka011JsonTableSource orderSource = new OrderStreamKafkaReader("test").getTableSource("crm_stream");
tableEnv.registerTableSource("orderDetail", orderSource);
Table orderDetailTable = tableEnv.sqlQuery("SELECT * FROM orderDetail ");
//getItemTable() 从kafka中取出的json数据
Table itemIdTable = new StreamJoinTest().getItemTable(orderDetailTable);
DataStream dataStream = tableEnv.toAppendStream(itemIdTable, OrderDetailInfo4Stat.class)
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(5)) {
private static final long serialVersionUID = -4075075569327086395L;
@Override
public long extractTimestamp(OrderDetailInfo4Stat element) {
return element.getOrderLastModifyTime().getTime();
}
});
dataStream.print();
Kafka011JsonTableSource couponCodeSource = new CouponCodeStreamKafkaReader(Constants.COUPON_CODE_INFO).getTableSource("crm_stream");
tableEnv.registerTableSource("CouponCodeTable", couponCodeSource);
DataStream couponCodeStream = tableEnv.toAppendStream(tableEnv.sqlQuery("select * from CouponCodeTable"), CouponCodeInfo4Stat.class).
assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(5)) {
private static final long serialVersionUID = 8867904924628932955L;
@Override
public long extractTimestamp(CouponCodeInfo4Stat element) {
return element.getLastModifyTime().getTime();
}
});
couponCodeStream.print();
DataStream realDataStream = dataStream.coGroup(couponCodeStream)
.where((KeySelector) OrderDetailInfo4Stat::getOrderBatchId)
.equalTo((KeySelector) CouponCodeInfo4Stat::getOrderBatchId)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new CoGroupFunction() {
private static final long serialVersionUID = -1240462675125711867L;
@Override
public void coGroup(Iterable first, Iterable second, Collector out) throws Exception {
for (OrderDetailInfo4Stat orderDetailInfo4Stat : first) {
for (CouponCodeInfo4Stat couponCodeInfo4Stat : second) {
if (orderDetailInfo4Stat.getOrderBatchId().equals(couponCodeInfo4Stat.getOrderBatchId())) {
orderDetailInfo4Stat.setOrderSalesmanId(couponCodeInfo4Stat.getReceiveUserId());
}
}
out.collect(orderDetailInfo4Stat);
}
}
});
realDataStream.print();
env.execute("stream join test");
}
到这里就结束了。
努力吧,皮卡丘。