好久不写 CSDN blog,早转战Gitbook的。这次记录文章同时,顺带刷下存在感。下面进入正题:
本文主要关于Flink timeWindow 的滚动窗口边界和以及延时数据处理的调研。读这篇文章需要对Flink Eventtime + WaterMark + Windows 机制有个基础了解。其次最好先阅读Flink流计算编程–watermark(水位线)简介 这篇文章。本文是在阅读这篇CSDN 文章后,有的一些启发想法并记录下来。
在源码的org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows 类中有个start变量,记录窗口开始时间。
@Override
public Collection assignWindows(Object element, long timestamp, WindowAssignerContext context) {
if (timestamp > Long.MIN_VALUE) {
// Long.MIN_VALUE is currently assigned when no timestamp is present
long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);
return Collections.singletonList(new TimeWindow(start, start + size));
} else {
throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). " +
"Is the time characteristic set to 'ProcessingTime', or did you forget to call " +
"'DataStream.assignTimestampsAndWatermarks(...)'?");
}
}
start的计算逻辑需要接着看 getWindowStartWithOffset()的源码。这里timestamp就是数据的 eventtime。code 如下:
/**
* Method to get the window start for a timestamp.
*
* @param timestamp epoch millisecond to get the window start.
* @param offset The offset which window start would be shifted by.
* @param windowSize The size of the generated windows.
* @return window start
*/
public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {
return timestamp - (timestamp - offset + windowSize) % windowSize;
}
所以这是确定 windows start time 的逻辑。 end time = start time + windowsSize 就好了。
以下是我的Code,在给定窗口大小的情况下,每个元素所属的滚动窗口的 start time计算逻辑。
public class BoundaryForTimeWindowTest {
public static void main(String[] args) {
// 注意是毫秒为单位
long windowsize = 14000L;
// 注意是毫秒为单位,滚动窗口 offset = 0L
long offset = 0L;
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
long a1 = 1000000050000L;
long a2 = 1000000054000L;
long a3 = 1000000079900L;
long a4 = 1000000115000L;
long b5 = 1000000100000L;
long b6 = 1000000109000L;
System.out.println(a1 + " -> " + format.format(a1) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a1, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a1, offset, windowsize)));
System.out.println(a2 + " -> " + format.format(a2) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a2, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a2, offset, windowsize)));
System.out.println(a3 + " -> " + format.format(a3) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a3, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a3, offset, windowsize)));
System.out.println(a4 + " -> " + format.format(a4) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a4, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a4, offset, windowsize)));
System.out.println(b5 + " -> " + format.format(b5) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(b5, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(b5, offset, windowsize)));
System.out.println(b6 + " -> " + format.format(b6) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(b6, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(b6, offset, windowsize)));
}
private static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {
return timestamp - (timestamp - offset + windowSize) % windowSize;
}
}
1000000050000 -> 2001-09-09 09:47:30.000 所属窗口的起始时间是: 1000000050000 -> 2001-09-09 09:47:30.000
1000000054000 -> 2001-09-09 09:47:34.000 所属窗口的起始时间是: 1000000050000 -> 2001-09-09 09:47:30.000
1000000079900 -> 2001-09-09 09:47:59.900 所属窗口的起始时间是: 1000000078000 -> 2001-09-09 09:47:58.000
1000000115000 -> 2001-09-09 09:48:35.000 所属窗口的起始时间是: 1000000106000 -> 2001-09-09 09:48:26.000
1000000100000 -> 2001-09-09 09:48:20.000 所属窗口的起始时间是: 1000000092000 -> 2001-09-09 09:48:12.000
1000000109000 -> 2001-09-09 09:48:29.000 所属窗口的起始时间是: 1000000106000 -> 2001-09-09 09:48:26.000
因此对于 timeWindow(Time.seconds(14)) 这样的窗口。对应flink 源码中 getWindowStartWithOffset()方法的参数就是:
offset = 0
windowSize = 14000
上述的输出表明,当窗口长度为14秒的时候,对于元素 1000000109000 所属的窗口就是 [1000000106000,1000000120000)
先把我测试代码放出来。代码中的 long delay、int windowSize 以及输入源 Tuple3[] elements 都是可配的参数。
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.text.SimpleDateFormat;
public class TimeWindowDemo {
public static void main(String[] args) throws Exception {
long delay = 5100L;
int windowSize = 15;
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置数据源
env.setParallelism(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream> dataStream = env.addSource(new TimeWindowDemo.DataSource()).name("Demo Source");
// 设置水位线
DataStream> watermark = dataStream.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks>() {
private final long maxOutOfOrderness = delay;
private long currentMaxTimestamp = 0L;
@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
@Override
public long extractTimestamp(Tuple3 element, long previousElementTimestamp) {
long timestamp = element.f2;
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
System.out.println(element.f1 + " -> " + timestamp + " -> " + format.format(timestamp));
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}
});
// 窗口函数进行处理
DataStream> resStream = watermark.keyBy(0).timeWindow(Time.seconds(windowSize)).reduce(
new ReduceFunction>() {
@Override
public Tuple3 reduce(Tuple3 value1, Tuple3 value2) throws Exception {
return Tuple3.of(value1.f0, value1.f1 + "" + value2.f1, 1L);
}
}
);
resStream.print();
env.execute("event time demo");
}
private static class DataSource extends RichParallelSourceFunction> {
private volatile boolean running = true;
@Override
public void run(SourceContext> ctx) throws InterruptedException {
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "1", 1000000050000L),
Tuple3.of("a", "2", 1000000054000L),
Tuple3.of("a", "3", 1000000079900L),
Tuple3.of("a", "4", 1000000115000L),
Tuple3.of("b", "5", 1000000100000L),
Tuple3.of("b", "6", 1000000108000L)
};
int count = 0;
while (running && count < elements.length) {
ctx.collect(new Tuple3<>((String) elements[count].f0, (String) elements[count].f1, (Long) elements[count].f2));
count++;
Thread.sleep(1000);
}
}
@Override
public void cancel() {
running = false;
}
}
}
long delay = 5000L;
int windowSize = 10;
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "1", 1000000050000L),
Tuple3.of("a", "2", 1000000054000L),
Tuple3.of("a", "3", 1000000079900L),
Tuple3.of("a", "4", 1000000120000L),
Tuple3.of("b", "5", 1000000111000L),
Tuple3.of("b", "6", 1000000089000L)
};
1 -> 1000000050000 -> 2001-09-09 09:47:30.000
2 -> 1000000054000 -> 2001-09-09 09:47:34.000
3 -> 1000000079900 -> 2001-09-09 09:47:59.900
(a,12,1)
4 -> 1000000120000 -> 2001-09-09 09:48:40.000
(a,3,1000000079900)
5 -> 1000000111000 -> 2001-09-09 09:48:31.000
6 -> 1000000089000 -> 2001-09-09 09:48:09.000
(b,5,1000000111000)
(a,4,1000000120000)
当地4条记录(4 -> 1000000120000 -> 2001-09-09 09:48:40.000) 进来的时候,watermark = 2001-09-09 09:48:35.000。
这时第5条记录(5 -> 1000000111000 -> 2001-09-09 09:48:31.000) 进来时,所属的窗口还没有被触发计算,因为第5条记录所属的窗口是 [2001-09-09 09:48:30.000, 2001-09-09 09:48:40.000)。也就是说 watermark < windows end time,所以第5条记录进来还是能被计算的。
但是对于第6条数据(6 -> 1000000089000 -> 2001-09-09 09:48:09.000)所属的窗口范围是[2001-09-09 09:48:00.000,2001-09-09 09:48:10.000)也就是说当前的 watermark > windows end time。所以第6条数据所属的窗口就永远不会被触发计算了,因此第6条数据也就丢失了。
其实就是进一步验证上面的结论。
long delay = 5000L;
int windowSize = 10;
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "1", 1000000050000L),
Tuple3.of("a", "2", 1000000054000L),
Tuple3.of("a", "3", 1000000079900L),
Tuple3.of("a", "4", 1000000120000L),
Tuple3.of("b", "5", 1000000100001L),
Tuple3.of("b", "6", 1000000109000L)
};
1 -> 1000000050000 -> 2001-09-09 09:47:30.000
2 -> 1000000054000 -> 2001-09-09 09:47:34.000
3 -> 1000000079900 -> 2001-09-09 09:47:59.900
(a,12,1)
4 -> 1000000120000 -> 2001-09-09 09:48:40.000
(a,3,1000000079900)
5 -> 1000000100001 -> 2001-09-09 09:48:20.001
6 -> 1000000109000 -> 2001-09-09 09:48:29.000
(a,4,1000000120000)
现在第5个元素和第6个元素其实都是属于 [2001-09-09 09:48:20.000,2001-09-09 09:48:30.000) 窗口。
但是这个窗口的 windows end time < watermark,所以这个窗口已经无法被触发计算了。其实也就是这个窗口创建时,已经 windows end time < watermark ,相当于第5条和第6条数据都丢失了。
这里我们再改一下输入,把最大的delay时间设置成 5.1s。窗口长度还是 10秒。
参数:
long delay = 5100L;
int windowSize = 10;
输入:
Tuple3[] elements = new Tuple3[]{
Tuple3.of("a", "1", 1000000050000L),
Tuple3.of("a", "2", 1000000054000L),
Tuple3.of("a", "3", 1000000079900L),
Tuple3.of("a", "4", 1000000115000L),
Tuple3.of("b", "5", 1000000100000L),
Tuple3.of("b", "6", 1000000108000L)
};
输出:
1 -> 1000000050000 -> 2001-09-09 09:47:30.000
2 -> 1000000054000 -> 2001-09-09 09:47:34.000
3 -> 1000000079900 -> 2001-09-09 09:47:59.900
(a,12,1)
4 -> 1000000115000 -> 2001-09-09 09:48:35.000
(a,3,1000000079900)
5 -> 1000000100000 -> 2001-09-09 09:48:20.000
6 -> 1000000108000 -> 2001-09-09 09:48:28.000
(b,56,1)
(a,4,1000000115000)
为啥会这样呢??因为第4条数据(4 -> 1000000115000 -> 2001-09-09 09:48:35.000) 进来后,watermark = 2001-09-09 09:48:29.900。
但是第5条和第6条的所属的时间窗口是 [2001-09-09 09:48:20.000, 2001-09-09 09:48:30.000) 。也就是说第5条和第6条数据进来的时候,watermark < windows end time。所以这种情况下,就算数据的 eventtime < watermark 但是数据还是被保留了,没有丢失。
以下是我测试得到的结论: