0)本文编写了多个flink水位线watermark的代码例子,加深对watermark的理解 ;
1)时间分类
我们主要讨论的是 事件时间;
2)flink窗口分为 滚动窗口,滑动窗口, 本文使用了 滚动窗口;
3)本文结合代码示例讲了 水位线, 窗口,窗口属性 lateness 延迟属性, 窗口流的 siteOutputLateData 侧输出流(旁路输出),及其它们的作用;
1)定义(本文自定义总结,非官方):水位线 watermark,指的是 flink底层在数据流中添加的带有时间戳的数据,当这些水位线数据到达算子时(如窗口算子),算子会认为 小于水位线的业务数据都来了;(数据可以理解为 一条日志,或温度传感器采集的温度信息)
2)作用: 水位线可以用来处理无序数据流;(下文代码例子会给出);
3)如何产生水位线?
1)建立一个 10s 滚动窗口算子(每10s新开一个长度为10s的窗口),水位线取温度bean的时间戳,且延迟 0 秒,如下:
其中 窗口用于收集id号码,即属于同一个窗口的元素的id会被收集到一起;
public class WindowTest3_EventTimeWatermarkWindow3 {
public static void main(String[] args) throws Exception {
// 创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 从socket读取数据,数据格式参见 sensorTimeWatermarkWindow.txt
// DataStream textStream = env.readTextFile("D:\\workbench_idea\\diydata\\flinkdemo2\\src\\main\\resources\\sensorTimeWatermarkWindow.txt");
// nc -lk 7777
DataStream textStream = env.socketTextStream("192.168.163.201", 7778);
// 转换为 SensorReader pojo类型
DataStream sensorStream = textStream.map(x -> {
String[] arr = x.split(",");
return new SensorReadingTimeWatermarkWindow(arr[0], arr[1], arr[2], new BigDecimal(arr[3]));
});
// 设置抽取时间戳,水位线延迟2秒(如当前时间戳为 20:00:10 ,水位线的时间是 20:00:08),窗口是看水位线时间,而不是时间时间
SingleOutputStreamOperator streamWithWatermark = sensorStream.assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(0))
.withTimestampAssigner((event, timestamp) -> event.getTimestamp().getTime())
);
// 开窗聚合
SingleOutputStreamOperator aggForWindowStream =
streamWithWatermark.keyBy(SensorReadingTimeWatermarkWindow::getType)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.aggregate(new AggregateFunction() {
@Override
public String createAccumulator() {
return "";
}
@Override
public String add(SensorReadingTimeWatermarkWindow sensorReadingTimeWatermarkWindow, String s) {
return s + ", " + sensorReadingTimeWatermarkWindow.getId();
}
@Override
public String getResult(String s) {
return s;
}
@Override
public String merge(String s, String acc1) {
return s + ", " + acc1;
}
});
// 打印
aggForWindowStream.print("aggForWindowStream");
// 执行
env.execute("aggForWindowStream");
}
}
上述代码中的水位线的延迟时间为0s,即水位线时间戳等于事件时间戳;
元素抽象为 传感器信息bean,如下:
public class SensorReadingTimeWatermarkWindow {
private String id;
private String type;
private Timestamp timestamp;
private BigDecimal temperature;
public SensorReadingTimeWatermarkWindow() {
}
public SensorReadingTimeWatermarkWindow(String id, String type, String timeStr, BigDecimal temperature) {
this.id = id;
this.type = type;
this.temperature = temperature;
this.parseTimestamp(timeStr);
}
private void parseTimestamp(String timeStr) {
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
try {
this.timestamp = new Timestamp(simpleDateFormat.parse(timeStr).getTime());
} catch (ParseException e) {
this.timestamp = new Timestamp(System.currentTimeMillis());
}
}
}
接收的是 nc 客户端的socket文本流,窗口算子计算结果如下:
详情如下:
1,sensor1,2022-04-17 22:07:01,36.1
7,sensor1,2022-04-17 22:07:02,36.7
8,sensor1,2022-04-17 22:07:04,36.8
11,sensor1,2022-04-17 22:07:07,36.9
12,sensor1,2022-04-17 22:07:11,36.9 -> 1, 7, 8, 11
13,sensor1,2022-04-17 22:07:09,36.9
15,sensor1,2022-04-17 22:07:16,36.9
16,sensor1,2022-04-17 22:07:23,36.9 -> 12,15
【结果分析】
问题来了: 事件13去哪里了? 被 flink 丢弃了,因为事件13迟到了;
【补充】窗口范围是左闭右开;如上图,第1个窗口的范围是 [0,10),第2个窗口是 [10,20)
1)修改上述水位线代码, 设置延迟时间为5s,重新录入上述数据,结果如下:
// 设置抽取时间戳,水位线延迟2秒(如当前时间戳为 20:00:10 ,水位线的时间是 20:00:08),窗口是看水位线时间,而不是事件时间
SingleOutputStreamOperator streamWithWatermark =
sensorStream.assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)) // 水位线延迟时间修改为 5s
.withTimestampAssigner((event, timestamp) -> event.getTimestamp().getTime())
);
1,sensor1,2022-04-17 22:07:01,36.1
7,sensor1,2022-04-17 22:07:02,36.7
8,sensor1,2022-04-17 22:07:04,36.8
11,sensor1,2022-04-17 22:07:07,36.9
12,sensor1,2022-04-17 22:07:11,36.9
13,sensor1,2022-04-17 22:07:09,36.9
15,sensor1,2022-04-17 22:07:16,36.9 -> 1, 7, 8, 11, 13
16,sensor1,2022-04-17 22:07:23,36.9
21,sensor1,2022-04-17 22:07:20,36.9
22,sensor1,2022-04-17 22:07:25,36.9 -> 12, 15
【结果分析】
此外,窗口还有 lateness 属性,表示延迟多长时间关闭窗口;
如下面代码每10s 创建一个长度为12s的窗口; (如果没有 lateness参数或其为0的话, 就是 每10s 创建一个长度为10s的窗口)
代码修改如下:
SingleOutputStreamOperator aggForWindowStream =
streamWithWatermark.keyBy(SensorReadingTimeWatermarkWindow::getType)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.allowedLateness(Time.seconds(2)) // 允许窗口延迟 2 秒后关闭窗口
窗口算子计算结果如下:
详情如下:
1,sensor1,2022-04-17 22:07:01,36.1
7,sensor1,2022-04-17 22:07:02,36.7
8,sensor1,2022-04-17 22:07:04,36.8
11,sensor1,2022-04-17 22:07:07,36.9
12,sensor1,2022-04-17 22:07:11,36.9
13,sensor1,2022-04-17 22:07:09,36.9
15,sensor1,2022-04-17 22:07:15,36.9 -> 1, 7, 8, 11, 13
16,sensor1,2022-04-17 22:07:09,36.9 -> 1, 7, 8, 11, 13, 16
17,sensor1,2022-04-17 22:07:16,36.9
18,sensor1,2022-04-17 22:07:09,36.9 -> 1, 7, 8, 11, 13, 16, 18
19,sensor1,2022-04-17 22:07:17,36.9 窗口关闭
20,sensor1,2022-04-17 22:07:09,36.9 被丢弃
21,sensor1,2022-04-17 22:07:20,36.9
22,sensor1,2022-04-17 22:07:25,36.9 -> 12, 15, 17, 19
【结果分析】
通过以上示例,本文应该是把窗口的lateness属性 讲清楚了;
【问题】 事件20被丢弃的话, 不满足业务场景对数据一致性的要求;
从旁路输出(side output)获取迟到数据;
通过 Flink 的 旁路输出 功能,可以获得迟到数据的数据流。
首先,需要在开窗后的 stream 上使用 sideOutputLateData(OutputTag) 表明需要把迟到数据存入 旁输出流。
代码修改如下:添加旁路输出流(侧输出流)
// 侧输出流,对于延迟的且没有进入窗口的数据,放到侧输出流(旁路输出流)
OutputTag lateOutputTag = new OutputTag("late") {
};
// 开窗聚合
SingleOutputStreamOperator aggForWindowStream =
streamWithWatermark.keyBy(SensorReadingTimeWatermarkWindow::getType)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.allowedLateness(Time.seconds(2)) // 允许延迟 2 秒后关闭窗口
.sideOutputLateData(lateOutputTag) // 无法进入窗口,则进入侧输出流
.aggregate(new AggregateFunction() {
@Override
public String createAccumulator() {
return "";
}
@Override
public String add(SensorReadingTimeWatermarkWindow sensorReadingTimeWatermarkWindow, String s) {
return s + ", " + sensorReadingTimeWatermarkWindow.getId();
}
@Override
public String getResult(String s) {
return s;
}
@Override
public String merge(String s, String acc1) {
return s + ", " + acc1;
}
});
// 打印窗口算子结果
aggForWindowStream.print("aggForWindowStream");
// 打印旁输出流
aggForWindowStream.getSideOutput(lateOutputTag).print("lateOutputTag");
// 执行
env.execute("aggForWindowStream");
事件发生详情如下:
1,sensor1,2022-04-17 22:07:01,36.1
7,sensor1,2022-04-17 22:07:02,36.7
8,sensor1,2022-04-17 22:07:04,36.8
11,sensor1,2022-04-17 22:07:07,36.9
12,sensor1,2022-04-17 22:07:11,36.9
13,sensor1,2022-04-17 22:07:09,36.9
15,sensor1,2022-04-17 22:07:15,36.9 -> 1, 7, 8, 11, 13
16,sensor1,2022-04-17 22:07:09,36.9 -> 1, 7, 8, 11, 13, 16
17,sensor1,2022-04-17 22:07:16,36.9
18,sensor1,2022-04-17 22:07:09,36.9 -> 1, 7, 8, 11, 13, 16, 18
19,sensor1,2022-04-17 22:07:17,36.9 窗口关闭
20,sensor1,2022-04-17 22:07:09,36.9 -> lateOutputTag> SensorReadingTimeWindow{id='20', type='sensor1', timestamp=2022-04-17 22:07:09.0, temperature=36.9}
结果分析: