Flink中的窗口算子,是处理无界流的核心,窗口将流分割成很多的“buckets”,每个bucket可以基于时间、元素个数等进行分隔,针对每一个bucket可以做一些数据的计算。
窗口的类型:
滚动窗口:将数据根据固定的窗口长度进行分隔,窗口的长度固定,数据没有重叠。
滑动窗口:窗口的长度固定,根据滑动间隔确定窗口的返回,数据可以重叠。
回话窗口:一段时间内没有接收到新数据会生成新的窗口。
本文主要对滚动窗口做一些示例。
基于事件处理时间(ProcessingTime)的滚动窗口
(1)自定义数据源
每秒生成一个数字
public class IntegerSource implements SourceFunction<Integer> {
int i = 0;
@Override
public void run(SourceContext ctx) throws Exception {
while (true) {
ctx.collect(i++);
Thread.sleep(1000);
}
}
@Override
public void cancel() {
}
}
(2)示例
@Test
public void tumblingProcessingTimeWindowsTest() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
DataStreamSource<Integer> source = env.addSource(new IntegerSource());
//基于ProcessingTime的滚动窗口,窗口长度时3s
source.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(3)))
.process(new ProcessAllWindowFunction<Integer, Integer, TimeWindow>() {
@Override
public void process(Context context, Iterable<Integer> elements, Collector<Integer> out) throws Exception {
Iterator<Integer> it = elements.iterator();
int sum = 0;
while (it.hasNext()) {
Integer next = it.next();
sum += next;
System.out.println("元素: " + next + " ,处理时间:" + new Date());
}
out.collect(sum);
}
})
.print();
env.execute("TumblingProcessingTimeWindows");
}
结果:
元素: 0 ,处理时间:Sat Apr 16 21:39:15 CST 2022
元素: 1 ,处理时间:Sat Apr 16 21:39:15 CST 2022
3> 1
元素: 2 ,处理时间:Sat Apr 16 21:39:18 CST 2022
元素: 3 ,处理时间:Sat Apr 16 21:39:18 CST 2022
元素: 4 ,处理时间:Sat Apr 16 21:39:18 CST 2022
4> 9
元素: 5 ,处理时间:Sat Apr 16 21:39:21 CST 2022
元素: 6 ,处理时间:Sat Apr 16 21:39:21 CST 2022
元素: 7 ,处理时间:Sat Apr 16 21:39:21 CST 2022
5> 18
带偏移量的基于基于ProcessingTime的滚动窗口
示例:
@Test
public void tumblingProcessingTimeWindowsWithOffsetTest() throws Exception {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.zzz");
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
DataStreamSource<Integer> source = env.addSource(new IntegerSource());
//基于ProcessingTime的滚动窗口,窗口长度时3s,偏移量是2s
source.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(3), Time.seconds(2)))
.process(new ProcessAllWindowFunction<Integer, Integer, TimeWindow>() {
@Override
public void process(Context context, Iterable<Integer> elements, Collector<Integer> out) throws Exception {
Iterator<Integer> it = elements.iterator();
int sum = 0;
while (it.hasNext()) {
Integer next = it.next();
sum += next;
System.out.println("元素: " + next + " ,处理时间:" + format.format(new Date()));
}
out.collect(sum);
}
})
.print();
env.execute("tumbling window");
}
结果:
元素: 0 ,处理时间:2022-04-16 21:43:02.CST
元素: 1 ,处理时间:2022-04-16 21:43:02.CST
元素: 2 ,处理时间:2022-04-16 21:43:02.CST
8> 3
元素: 3 ,处理时间:2022-04-16 21:43:05.CST
元素: 4 ,处理时间:2022-04-16 21:43:05.CST
元素: 5 ,处理时间:2022-04-16 21:43:05.CST
1> 12
元素: 6 ,处理时间:2022-04-16 21:43:08.CST
元素: 7 ,处理时间:2022-04-16 21:43:08.CST
元素: 8 ,处理时间:2022-04-16 21:43:08.CST
2> 21
设置偏移量为2s后,处理时间往后偏移了2s,本来应该在2022-04-16 21:43:03、2022-04-16 21:43:06时间点处理元素,现在偏移到了2022-04-16 21:43:05、2022-04-16 21:43:08
(1)数据源
为了将时间和数量区分开,将数据源调整为 0.5s 生产一个数据
public class IntegerSource implements SourceFunction<Integer> {
int i = 0;
@Override
public void run(SourceContext ctx) throws Exception {
while (true) {
ctx.collect(i++);
Thread.sleep(500);
}
}
@Override
public void cancel() {
}
}
(2)示例
@Test
public void tumblingCountWindowsTest() throws Exception {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.zzz");
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
DataStreamSource<Integer> source = env.addSource(new IntegerSource());
//每5个数据计算一次
source.countWindowAll(5)
.process(new ProcessAllWindowFunction<Integer, Integer, GlobalWindow>() {
@Override
public void process(Context context, Iterable<Integer> elements, Collector<Integer> out) throws Exception {
Iterator<Integer> it = elements.iterator();
int sum = 0;
while (it.hasNext()) {
Integer next = it.next();
sum += next;
System.out.println("元素: " + next + " ,处理时间:" + format.format(new Date()));
}
out.collect(sum);
}
})
.print();
env.execute("countWindowAll");
}
结果:
元素: 0 ,处理时间:2022-04-16 21:50:10.CST
元素: 1 ,处理时间:2022-04-16 21:50:10.CST
元素: 2 ,处理时间:2022-04-16 21:50:10.CST
元素: 3 ,处理时间:2022-04-16 21:50:10.CST
元素: 4 ,处理时间:2022-04-16 21:50:10.CST
8> 10
元素: 5 ,处理时间:2022-04-16 21:50:13.CST
元素: 6 ,处理时间:2022-04-16 21:50:13.CST
元素: 7 ,处理时间:2022-04-16 21:50:13.CST
元素: 8 ,处理时间:2022-04-16 21:50:13.CST
元素: 9 ,处理时间:2022-04-16 21:50:13.CST
1> 35
基于事件时间(EventTime)的滚动窗口
(1)数据实体类
@Data
@AllArgsConstructor
public class EventElement {
private Integer id;
//数据时间
private Long time;
//为了方便打印
private Date date;
}
(2)数据源
public class EventElementSource implements SourceFunction<EventElement> {
@Override
public void run(SourceContext ctx) throws Exception {
int id = 0;
Random random = new Random();
while (true) {
long time = new Date().getTime() + random.nextInt(5000);
Date date = new Date(time);
EventElement eventElement = new EventElement(id++, time, date);
ctx.collect(eventElement);
Thread.sleep(1000);
}
}
@Override
public void cancel() {
}
}
(3)示例
@Test
public void tumblingEventTimeWindowsTest() throws Exception {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.zzz");
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING)
.setParallelism(1);
//添加数据源
env.addSource(new EventElementSource())
//设置时间戳和水印,使用数据里的时间
.assignTimestampsAndWatermarks(WatermarkStrategy.<EventElement>forMonotonousTimestamps().withTimestampAssigner((eventTElement, re) -> eventTElement.getTime()))
//基于事件时间的滚动窗口
.windowAll(TumblingEventTimeWindows.of(Time.seconds(3)))
.process(new ProcessAllWindowFunction<EventElement, Integer, TimeWindow>() {
@Override
public void process(Context context, Iterable<EventElement> elements, Collector<Integer> out) throws Exception {
Iterator<EventElement> it = elements.iterator();
int id = 0;
while (it.hasNext()) {
EventElement next = it.next();
id = next.getId();
System.out.println("元素: " + next + " ,处理时间:" + format.format(new Date()));
}
out.collect(id);
}
})
.print();
env.execute("TumblingEventTimeWindows");
}
结果:
数据产生时间
EventElement(id=0, time=1650120291747, date=Sat Apr 16 22:44:51 CST 2022)
EventElement(id=1, time=1650120289932, date=Sat Apr 16 22:44:49 CST 2022)
EventElement(id=2, time=1650120293657, date=Sat Apr 16 22:44:53 CST 2022)
EventElement(id=3, time=1650120291129, date=Sat Apr 16 22:44:51 CST 2022)
EventElement(id=4, time=1650120291191, date=Sat Apr 16 22:44:51 CST 2022)
EventElement(id=5, time=1650120294228, date=Sat Apr 16 22:44:54 CST 2022)
EventElement(id=6, time=1650120293441, date=Sat Apr 16 22:44:53 CST 2022)
EventElement(id=7, time=1650120294474, date=Sat Apr 16 22:44:54 CST 2022)
EventElement(id=8, time=1650120296386, date=Sat Apr 16 22:44:56 CST 2022)
EventElement(id=9, time=1650120298108, date=Sat Apr 16 22:44:58 CST 2022)
EventElement(id=13, time=1650120303753, date=Sat Apr 16 22:45:03 CST 2022)
EventElement(id=14, time=1650120305941, date=Sat Apr 16 22:45:05 CST 2022)
EventElement(id=15, time=1650120306923, date=Sat Apr 16 22:45:06 CST 2022
数据处理时间:
元素: EventElement(id=0, time=1650120291747, date=Sat Apr 16 22:44:51 CST 2022) ,处理时间:2022-04-16 22:44:52.CST
元素: EventElement(id=2, time=1650120293657, date=Sat Apr 16 22:44:53 CST 2022) ,处理时间:2022-04-16 22:44:52.CST
元素: EventElement(id=3, time=1650120291129, date=Sat Apr 16 22:44:51 CST 2022) ,处理时间:2022-04-16 22:44:52.CST
元素: EventElement(id=4, time=1650120291191, date=Sat Apr 16 22:44:51 CST 2022) ,处理时间:2022-04-16 22:44:52.CST
4
元素: EventElement(id=5, time=1650120294228, date=Sat Apr 16 22:44:54 CST 2022) ,处理时间:2022-04-16 22:44:56.CST
元素: EventElement(id=7, time=1650120294474, date=Sat Apr 16 22:44:54 CST 2022) ,处理时间:2022-04-16 22:44:56.CST
元素: EventElement(id=8, time=1650120296386, date=Sat Apr 16 22:44:56 CST 2022) ,处理时间:2022-04-16 22:44:56.CST
8
EventElement(id=10, time=1650120297360, date=Sat Apr 16 22:44:57 CST 2022)
EventElement(id=11, time=1650120299111, date=Sat Apr 16 22:44:59 CST 2022)
EventElement(id=12, time=1650120303800, date=Sat Apr 16 22:45:03 CST 2022)
元素: EventElement(id=9, time=1650120298108, date=Sat Apr 16 22:44:58 CST 2022) ,处理时间:2022-04-16 22:44:59.CST
元素: EventElement(id=10, time=1650120297360, date=Sat Apr 16 22:44:57 CST 2022) ,处理时间:2022-04-16 22:44:59.CST
元素: EventElement(id=11, time=1650120299111, date=Sat Apr 16 22:44:59 CST 2022) ,处理时间:2022-04-16 22:44:59.CST
11
元素: EventElement(id=12, time=1650120303800, date=Sat Apr 16 22:45:03 CST 2022) ,处理时间:2022-04-16 22:45:02.CST
元素: EventElement(id=13, time=1650120303753, date=Sat Apr 16 22:45:03 CST 2022) ,处理时间:2022-04-16 22:45:02.CST
元素: EventElement(id=14, time=1650120305941, date=Sat Apr 16 22:45:05 CST 2022) ,处理时间:2022-04-16 22:45:02.CST
14
通过水印,flink会在属于窗口的最后一个元素到达时关闭当前窗口,当然如果数据的时间不是递增的,也可能会出现丢失数据的情况。
1、id为0的数据生成时间是22:44:51,id 为1的数据生成的时间16 22:44:49已经不再第一个窗口的范围内。
2、第一个窗口处理完成后,id为6的数据生成时间是16 22:44:53处于第一个窗口的范围内,当前窗口也不会处理该数据了,造成数据丢失。