Flink 中 timeWindow 滚动窗口边界和数据延迟问题调研

1. 说明

好久不写 CSDN blog,早转战Gitbook的。这次记录文章同时,顺带刷下存在感。下面进入正题:

本文主要关于Flink timeWindow 的滚动窗口边界和以及延时数据处理的调研。读这篇文章需要对Flink Eventtime + WaterMark + Windows 机制有个基础了解。其次最好先阅读Flink流计算编程–watermark(水位线)简介 这篇文章。本文是在阅读这篇CSDN 文章后,有的一些启发想法并记录下来。

2. timeWindow 窗口边界问题

2.1 介绍

在源码的org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows 类中有个start变量,记录窗口开始时间。

@Override
public Collection assignWindows(Object element, long timestamp, WindowAssignerContext context) {
    if (timestamp > Long.MIN_VALUE) {
        // Long.MIN_VALUE is currently assigned when no timestamp is present
        long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);
        return Collections.singletonList(new TimeWindow(start, start + size));
    } else {
        throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). " +
                "Is the time characteristic set to 'ProcessingTime', or did you forget to call " +
                "'DataStream.assignTimestampsAndWatermarks(...)'?");
    }
}

start的计算逻辑需要接着看 getWindowStartWithOffset()的源码。这里timestamp就是数据的 eventtime。code 如下:

/**
 * Method to get the window start for a timestamp.
 *
 * @param timestamp epoch millisecond to get the window start.
 * @param offset The offset which window start would be shifted by.
 * @param windowSize The size of the generated windows.
 * @return window start
 */
public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {
    return timestamp - (timestamp - offset + windowSize) % windowSize;
}

所以这是确定 windows start time 的逻辑。 end time = start time + windowsSize 就好了。

2.2 代码

以下是我的Code,在给定窗口大小的情况下,每个元素所属的滚动窗口的 start time计算逻辑。

public class BoundaryForTimeWindowTest {
    public static void main(String[] args) {

        // 注意是毫秒为单位
        long windowsize = 14000L;
        // 注意是毫秒为单位,滚动窗口 offset = 0L
        long offset = 0L;

        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
        long a1 = 1000000050000L;
        long a2 = 1000000054000L;
        long a3 = 1000000079900L;
        long a4 = 1000000115000L;
        long b5 = 1000000100000L;
        long b6 = 1000000109000L;

        System.out.println(a1 + " -> " + format.format(a1) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a1, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a1, offset, windowsize)));
        System.out.println(a2 + " -> " + format.format(a2) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a2, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a2, offset, windowsize)));
        System.out.println(a3 + " -> " + format.format(a3) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a3, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a3, offset, windowsize)));
        System.out.println(a4 + " -> " + format.format(a4) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(a4, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(a4, offset, windowsize)));
        System.out.println(b5 + " -> " + format.format(b5) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(b5, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(b5, offset, windowsize)));
        System.out.println(b6 + " -> " + format.format(b6) + "\t所属窗口的起始时间是: " + getWindowStartWithOffset(b6, offset, windowsize) + " -> " + format.format(getWindowStartWithOffset(b6, offset, windowsize)));

    }

    private static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {
        return timestamp - (timestamp - offset + windowSize) % windowSize;
    }
}

2.3 输出:

1000000050000 -> 2001-09-09 09:47:30.000	所属窗口的起始时间是: 1000000050000 -> 2001-09-09 09:47:30.000
1000000054000 -> 2001-09-09 09:47:34.000	所属窗口的起始时间是: 1000000050000 -> 2001-09-09 09:47:30.000
1000000079900 -> 2001-09-09 09:47:59.900	所属窗口的起始时间是: 1000000078000 -> 2001-09-09 09:47:58.000
1000000115000 -> 2001-09-09 09:48:35.000	所属窗口的起始时间是: 1000000106000 -> 2001-09-09 09:48:26.000
1000000100000 -> 2001-09-09 09:48:20.000	所属窗口的起始时间是: 1000000092000 -> 2001-09-09 09:48:12.000
1000000109000 -> 2001-09-09 09:48:29.000	所属窗口的起始时间是: 1000000106000 -> 2001-09-09 09:48:26.000

2.4 说明:

因此对于 timeWindow(Time.seconds(14)) 这样的窗口。对应flink 源码中 getWindowStartWithOffset()方法的参数就是:

offset = 0
windowSize = 14000

上述的输出表明,当窗口长度为14秒的时候,对于元素 1000000109000 所属的窗口就是 [1000000106000,1000000120000)

3. 关于延时数据问题

先把我测试代码放出来。代码中的 long delay、int windowSize 以及输入源 Tuple3[] elements 都是可配的参数。

import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.text.SimpleDateFormat;

public class TimeWindowDemo {

    public static void main(String[] args) throws Exception {
        long delay = 5100L;
        int windowSize = 15;

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置数据源
        env.setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        DataStream> dataStream = env.addSource(new TimeWindowDemo.DataSource()).name("Demo Source");

        // 设置水位线
        DataStream> watermark = dataStream.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks>() {
            private final long maxOutOfOrderness = delay;
            private long currentMaxTimestamp = 0L;

            @Override
            public Watermark getCurrentWatermark() {
                return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
            }

            @Override
            public long extractTimestamp(Tuple3 element, long previousElementTimestamp) {
                long timestamp = element.f2;
                SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
                System.out.println(element.f1 + " -> " + timestamp + " -> " + format.format(timestamp));
                currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
                return timestamp;
            }
        });

        // 窗口函数进行处理
        DataStream> resStream = watermark.keyBy(0).timeWindow(Time.seconds(windowSize)).reduce(
            new ReduceFunction>() {
                @Override
                public Tuple3 reduce(Tuple3 value1, Tuple3 value2) throws Exception {
                    return Tuple3.of(value1.f0, value1.f1 + "" + value2.f1, 1L);
                }
            }
        );

        resStream.print();

        env.execute("event time demo");
    }

    private static class DataSource extends RichParallelSourceFunction> {
        private volatile boolean running = true;

        @Override
        public void run(SourceContext> ctx) throws InterruptedException {
            Tuple3[] elements = new Tuple3[]{
                Tuple3.of("a", "1", 1000000050000L),
                Tuple3.of("a", "2", 1000000054000L),
                Tuple3.of("a", "3", 1000000079900L),
                Tuple3.of("a", "4", 1000000115000L),
                Tuple3.of("b", "5", 1000000100000L),
                Tuple3.of("b", "6", 1000000108000L)
            };

            int count = 0;
            while (running && count < elements.length) {
                ctx.collect(new Tuple3<>((String) elements[count].f0, (String) elements[count].f1, (Long) elements[count].f2));
                count++;
                Thread.sleep(1000);
            }
        }

        @Override
        public void cancel() {
            running = false;
        }
    }
}

3.1 情况1:元素在水位线以下,但是windows还没被触发计算

3.1.1 参数:

    long delay = 5000L;
    int windowSize = 10;

3.1.2 输入:

    Tuple3[] elements = new Tuple3[]{
                    Tuple3.of("a", "1", 1000000050000L),
                    Tuple3.of("a", "2", 1000000054000L),
                    Tuple3.of("a", "3", 1000000079900L),
                    Tuple3.of("a", "4", 1000000120000L),
                    Tuple3.of("b", "5", 1000000111000L),
                    Tuple3.of("b", "6", 1000000089000L)
                };

3.1.3 输出:

1 -> 1000000050000 -> 2001-09-09 09:47:30.000
2 -> 1000000054000 -> 2001-09-09 09:47:34.000
3 -> 1000000079900 -> 2001-09-09 09:47:59.900
(a,12,1)
4 -> 1000000120000 -> 2001-09-09 09:48:40.000
(a,3,1000000079900)
5 -> 1000000111000 -> 2001-09-09 09:48:31.000
6 -> 1000000089000 -> 2001-09-09 09:48:09.000
(b,5,1000000111000)
(a,4,1000000120000)

3.1.4 说明:

当地4条记录(4 -> 1000000120000 -> 2001-09-09 09:48:40.000) 进来的时候,watermark = 2001-09-09 09:48:35.000。

这时第5条记录(5 -> 1000000111000 -> 2001-09-09 09:48:31.000) 进来时,所属的窗口还没有被触发计算,因为第5条记录所属的窗口是 [2001-09-09 09:48:30.000, 2001-09-09 09:48:40.000)。也就是说 watermark < windows end time,所以第5条记录进来还是能被计算的。

但是对于第6条数据(6 -> 1000000089000 -> 2001-09-09 09:48:09.000)所属的窗口范围是[2001-09-09 09:48:00.000,2001-09-09 09:48:10.000)也就是说当前的 watermark > windows end time。所以第6条数据所属的窗口就永远不会被触发计算了,因此第6条数据也就丢失了。

3.2 情况2:元素在水位线以下,但是windows已经无法被触发计算了

其实就是进一步验证上面的结论。

3.2.1 参数:

    long delay = 5000L;
    int windowSize = 10;

3.2.2 输入:

      Tuple3[] elements = new Tuple3[]{
            Tuple3.of("a", "1", 1000000050000L),
            Tuple3.of("a", "2", 1000000054000L),
            Tuple3.of("a", "3", 1000000079900L),
            Tuple3.of("a", "4", 1000000120000L),
            Tuple3.of("b", "5", 1000000100001L),
            Tuple3.of("b", "6", 1000000109000L)
        };

3.2.3 输出:

1 -> 1000000050000 -> 2001-09-09 09:47:30.000
2 -> 1000000054000 -> 2001-09-09 09:47:34.000
3 -> 1000000079900 -> 2001-09-09 09:47:59.900
(a,12,1)
4 -> 1000000120000 -> 2001-09-09 09:48:40.000
(a,3,1000000079900)
5 -> 1000000100001 -> 2001-09-09 09:48:20.001
6 -> 1000000109000 -> 2001-09-09 09:48:29.000
(a,4,1000000120000)

3.2.4 说明:

现在第5个元素和第6个元素其实都是属于 [2001-09-09 09:48:20.000,2001-09-09 09:48:30.000) 窗口。

但是这个窗口的 windows end time < watermark,所以这个窗口已经无法被触发计算了。其实也就是这个窗口创建时,已经 windows end time < watermark ,相当于第5条和第6条数据都丢失了。

这里我们再改一下输入,把最大的delay时间设置成 5.1s。窗口长度还是 10秒。
参数:

    long delay = 5100L;
    int windowSize = 10;

输入:

 Tuple3[] elements = new Tuple3[]{
                Tuple3.of("a", "1", 1000000050000L),
                Tuple3.of("a", "2", 1000000054000L),
                Tuple3.of("a", "3", 1000000079900L),
                Tuple3.of("a", "4", 1000000115000L),
                Tuple3.of("b", "5", 1000000100000L),
                Tuple3.of("b", "6", 1000000108000L)
            };

输出:

1 -> 1000000050000 -> 2001-09-09 09:47:30.000
2 -> 1000000054000 -> 2001-09-09 09:47:34.000
3 -> 1000000079900 -> 2001-09-09 09:47:59.900
(a,12,1)
4 -> 1000000115000 -> 2001-09-09 09:48:35.000
(a,3,1000000079900)
5 -> 1000000100000 -> 2001-09-09 09:48:20.000
6 -> 1000000108000 -> 2001-09-09 09:48:28.000
(b,56,1)
(a,4,1000000115000)

为啥会这样呢??因为第4条数据(4 -> 1000000115000 -> 2001-09-09 09:48:35.000) 进来后,watermark = 2001-09-09 09:48:29.900。
但是第5条和第6条的所属的时间窗口是 [2001-09-09 09:48:20.000, 2001-09-09 09:48:30.000) 。也就是说第5条和第6条数据进来的时候,watermark < windows end time。所以这种情况下,就算数据的 eventtime < watermark 但是数据还是被保留了,没有丢失。

4. timeWindow 总结

以下是我测试得到的结论:

  1. 如果当前数据的 EventTime 在 WaterMark 之上,也就是 EventTime> WaterMark。因为我们知道数据所属窗口的 WindowEndTime,一定是大于 EventTime 的。这时我们有 WindowEndTime > EventTime > WaterMark。所以这种情况下数据是一定不会丢失的。
  2. 如果当前数据的 EventTime 在 WaterMark 之下,也就是 WaterMark > EventTime。这时候要分两种情况:
      2.1 如果该数据所属窗口的 WindowEndTime > WaterMark,则表示窗口还没被触发,即 WindowEndTime > WaterMark > EventTime,这种情况数据也是不会丢失的。
      2.1 如果该数据所属窗口的 WaterMark > WindowEndTime,则表示窗口已经无法被触发,即 WaterMark > WindowEndTime > EventTime,这种情况数据也就丢失了。

你可能感兴趣的:(flink)