Flink DataStream EventTime下Watermark生成与Window触发

本文总结Flink DataStream EventTime下Watermark的生成与Window计算的触发时机,旨在弄清楚以下问题:

  1. 窗口起止时间

  2. MaxOutOfOrdernessAllowedLateness的理解

  3. 窗口计算第一次触发的时机

  4. 窗口计算再次触发的时机

  5. 窗口销毁的时机

  6. 并行流中多并行度下Watermark的生成与窗口触发的时机

  7. Kafka并行流中窗口不触发

代码实现

package com.bigdata.flink.dataStreamEventTimeWatermark;

import lombok.extern.slf4j.Slf4j;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.ConfigConstants;
import org.apache.flink.configuration.ConfigOptions;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.collect.Iterables;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.util.Collector;
import org.joda.time.DateTime;
import org.joda.time.DateTimeZone;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;

import java.util.Properties;

/**
 * Summary:
 *    EventTime下Watermark生成与Window触发
 */
@Slf4j
public class DataStreamEventTimeWatermark {
    public static void main(String[] args) throws Exception{

        args=new String[]{"--application","flink/src/main/java/com/bigdata/flink/dataStreamEventTimeWatermark/application.properties"};

        //1、解析命令行参数
        ParameterTool fromArgs = ParameterTool.fromArgs(args);
        ParameterTool parameterTool = ParameterTool.fromPropertiesFile(fromArgs.getRequired("application"));

        String bootstrapServers = parameterTool.getRequired("bootstrapServers");
        String topic = parameterTool.getRequired("topic");
        String groupID = parameterTool.getRequired("groupID");
        int maxOutOfOrderness = parameterTool.getInt("maxOutOfOrderness"); // 10秒
        int windowLength = parameterTool.getInt("windowLength"); // 30秒
        int allowedLateness = parameterTool.getInt("allowedLateness"); // 5秒

        //2、设置运行环境
        Configuration config = new Configuration();
        config.setInteger(ConfigOptions.key("rest.port").defaultValue(8081),8081);
        config.setBoolean(ConfigConstants.LOCAL_START_WEBSERVER, true);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(config);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(1); //并行度为1

        //3、添加数据源
        Properties kafkaProperties = new Properties();
        kafkaProperties.put("bootstrap.servers",bootstrapServers);
        kafkaProperties.put("group.id",groupID);
        FlinkKafkaConsumer010<String> kafkaConsumer = new FlinkKafkaConsumer010<>(topic, new SimpleStringSchema(), kafkaProperties);
        kafkaConsumer.setStartFromLatest();
        DataStream<String> source = env.addSource(kafkaConsumer).rebalance();

        //4、解析数据源
        SingleOutputStreamOperator<Tuple2<String, String>> parsedData = source.process(new CustomProcessFunctionParseLog());

        //5、提取时间并生成水印
        SingleOutputStreamOperator<Tuple2<String, String>> watermarkedData = parsedData.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(maxOutOfOrderness)));

        //6、窗口计算
        watermarkedData
                .keyBy((KeySelector<Tuple2<String, String>, String>) value -> value.f1)
                .window(TumblingEventTimeWindows.of(Time.seconds(windowLength)))
                .allowedLateness(Time.seconds(allowedLateness))
                .process(new CustomProcessWindowFunction());


        //7、开始执行
        env.execute(DataStreamEventTimeWatermark.class.getSimpleName());


    }

    /**
     * 自定义WindowFunction,实现窗口计算
     *
     */
    static class CustomProcessWindowFunction extends ProcessWindowFunction<Tuple2<String, String>, String, String, TimeWindow>{

        int subTaskID;

        @Override
        public void open(Configuration parameters) throws Exception {
            subTaskID=getRuntimeContext().getIndexOfThisSubtask();
        }

        @Override
        public void process(String key, Context context, Iterable<Tuple2<String, String>> elements, Collector<String> out) throws Exception {
            int count = Iterables.size(elements);

            TimeWindow window = context.window();
            String windowStart = new DateTime(window.getStart(), DateTimeZone.forID("+08:00")).toString("yyyy-MM-dd HH:mm:ss");
            String windowEnd = new DateTime(window.getEnd(), DateTimeZone.forID("+08:00")).toString("yyyy-MM-dd HH:mm:ss");

            String record ="SubtaskID: "+subTaskID+ " WindowRange: " + windowStart + " ~ " + windowEnd + " Key: " + key + " Count: " + count;

            log.warn(record);

            out.collect(record);
        }
    }


    /**
     * 自定义ProcessFunction,解析从Kafka获取到的JSON数据
     *
     */
    static class CustomProcessFunctionParseLog extends ProcessFunction<String,Tuple2<String, String>>{
        @Override
        public void processElement(String value, Context ctx, Collector<Tuple2<String, String>> out) throws Exception {
            try {
                String[] values = value.split(",");
                String eventTime=values[0];
                String eventType=values[1];

                Tuple2<String, String> parsedValue = new Tuple2<>(eventTime, eventType);
                out.collect(parsedValue);
            } catch (Exception ex) {
                log.error("解析Kafka数据异常,Record: {}", value, ex);
            }
        }
    }

    /**
     * 有固定延迟的周期性水印。
     * 来源于{@link org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor}。
     * 
     * 为方便调试,做了简单修改。
     *
     * 水印发射的间隔(周期)可通过 StreamExecutionEnvironment.getConfig().setAutoWatermarkInterval(milliseconds)设置。
     * 当为EventTime时,周期默认为200ms。
     *
     */
    static class BoundedOutOfOrdernessTimestampExtractor implements AssignerWithPeriodicWatermarks<Tuple2<String, String>>{

        private static final long serialVersionUID = 1L;

        private long currentMaxTimestamp;

        private long lastEmittedWatermark = Long.MIN_VALUE;

        private final long maxOutOfOrderness;


        public BoundedOutOfOrdernessTimestampExtractor(Time maxOutOfOrderness) {
            if (maxOutOfOrderness.toMilliseconds() < 0) {
                throw new RuntimeException("Tried to set the maximum allowed " +
                        "lateness to " + maxOutOfOrderness + ". This parameter cannot be negative.");
            }
            this.maxOutOfOrderness = maxOutOfOrderness.toMilliseconds();
            this.currentMaxTimestamp = Long.MIN_VALUE + this.maxOutOfOrderness;
        }

        @Override
        public final Watermark getCurrentWatermark() {
            // this guarantees that the watermark never goes backwards.
            long potentialWM = currentMaxTimestamp - maxOutOfOrderness;
            if (potentialWM >= lastEmittedWatermark) {
                lastEmittedWatermark = potentialWM;
            }
            Watermark watermark = new Watermark(lastEmittedWatermark);
            log.warn("当前水印: "+new DateTime(watermark.getTimestamp(), DateTimeZone.forID("+08:00")).toString("yyyy-MM-dd HH:mm:ss"));
            return watermark;
        }

        @Override
        public final long extractTimestamp(Tuple2<String, String> element, long previousElementTimestamp) {
            DateTimeFormatter dateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
            DateTime dateTime = DateTime.parse(element.f0, dateTimeFormatter);
            long timestamp = dateTime.getMillis();
            if (timestamp > currentMaxTimestamp) {
                currentMaxTimestamp = timestamp;
            }
            return timestamp;
        }
    }

}

调试验证

序号 Record EventTime EventTime 对应的 Watermark EventTime 对应的Window起止时间 触发Window计算 Window Key Window Value
1 2019-10-19 08:46:00,browse 2019-10-19 08:46:00 2019-10-19 08:45:50 [2019-10-19 08:46:00, 2019-10-19 08:46:30)
2 2019-10-19 08:46:10,browse 2019-10-19 08:46:10 2019-10-19 08:46:00 [2019-10-19 08:46:00, 2019-10-19 08:46:30)
3 2019-10-19 08:46:20,browse 2019-10-19 08:46:20 2019-10-19 08:46:10 [2019-10-19 08:46:00, 2019-10-19 08:46:30)
4 2019-10-19 08:46:30,browse 2019-10-19 08:46:30 2019-10-19 08:46:20 [2019-10-19 08:46:30, 2019-10-19 08:47:00)
5 2019-10-19 08:46:40,browse 2019-10-19 08:46:40 2019-10-19 08:46:30 [2019-10-19 08:46:30, 2019-10-19 08:47:00) 第一次触发 browse 3
6 2019-10-19 08:46:09,browse 2019-10-19 08:46:09 2019-10-19 08:46:30 [2019-10-19 08:46:00, 2019-10-19 08:46:30) 再次触发 browse 3+1=4
7 2019-10-19 08:46:05,browse 2019-10-19 08:46:05 2019-10-19 08:46:30 [2019-10-19 08:46:00, 2019-10-19 08:46:30) 再次触发 browse 3+1+1=5
8 2019-10-19 08:46:15,browse 2019-10-19 08:46:15 2019-10-19 08:46:30 [2019-10-19 08:46:00, 2019-10-19 08:46:30) 再次触发 browse 3+1+1+1=6
9 2019-10-19 08:46:41,browse 2019-10-19 08:46:41 2019-10-19 08:46:31 [2019-10-19 08:46:30, 2019-10-19 08:47:00)
10 2019-10-19 08:46:42,browse 2019-10-19 08:46:42 2019-10-19 08:46:32 [2019-10-19 08:46:30, 2019-10-19 08:47:00)
11 2019-10-19 08:46:15,browse 2019-10-19 08:46:15 2019-10-19 08:46:32 [2019-10-19 08:46:00, 2019-10-19 08:46:30) 再次触发 browse 3+1+1+1+1=7
12 2019-10-19 08:46:43,browse 2019-10-19 08:46:43 2019-10-19 08:46:33 [2019-10-19 08:46:30, 2019-10-19 08:47:00)
13 2019-10-19 08:46:44,browse 2019-10-19 08:46:44 2019-10-19 08:46:34 [2019-10-19 08:46:30, 2019-10-19 08:47:00)
14 2019-10-19 08:46:15,browse 2019-10-19 08:46:15 2019-10-19 08:46:34 [2019-10-19 08:46:00, 2019-10-19 08:46:30) 再次触发 browse 3+1+1+1+1+1=8
15 2019-10-19 08:46:45,browse 2019-10-19 08:46:45 2019-10-19 08:46:35 [2019-10-19 08:46:30, 2019-10-19 08:47:00) 销毁窗口
16 2019-10-19 08:46:15,browse 2019-10-19 08:46:15 2019-10-19 08:46:35 [2019-10-19 08:46:00, 2019-10-19 08:46:30) 已经销毁,不再触发

重点说明

  1. 窗口长度为30秒。窗口第一次触发前允许的迟到时间为maxOutOfOrderness: 10秒。窗口触发后,允许的迟到时间为allowedLateness: 5秒

  2. 当输入记录5 2019-10-19 08:46:40,browse时,EventTime: 2019-10-19 08:46:40,Watermark: 2019-10-19 08:46:30。第一次触发。Watermark>=窗口[2019-10-19 08:46:00, 2019-10-19 08:46:30)结束时间,触发该窗口计算。

  3. 当输入记录6时 2019-10-19 08:46:09,browse, EventTime: 2019-10-19 08:46:09,Watermark: 2019-10-19 08:46:30。窗口触发后的迟到数据。再次触发。Watermark<[2019-10-19 08:46:00, 2019-10-19 08:46:30)结束时间+AllowedLateness。再次触发该窗口计算。

  4. 当输入记录7时 2019-10-19 08:46:05,browseEventTime: 2019-10-19 08:46:05,Watermark: 2019-10-19 08:46:30。窗口触发后的迟到数据。再次触发。Watermark<[2019-10-19 08:46:00, 2019-10-19 08:46:30)结束时间+AllowedLateness。再次触发该窗口计算。

  5. 当输入记录8时 2019-10-19 08:46:15,browseEventTime: 2019-10-19 08:46:15,Watermark: 2019-10-19 08:46:30。窗口触发后的迟到数据。再次触发。Watermark<[2019-10-19 08:46:00, 2019-10-19 08:46:30)结束时间+AllowedLateness。再次触发该窗口计算。

  6. 当输入记录11时 2019-10-19 08:46:15,browse, EventTime: 2019-10-19 08:46:15,Watermark: 2019-10-19 08:46:32。窗口触发后的迟到数据。再次触发。Watermark<[2019-10-19 08:46:00, 2019-10-19 08:46:30)结束时间+AllowedLateness。再次触发该窗口计算。

  7. 当输入记录14时 2019-10-19 08:46:15,browse, EventTime: 2019-10-19 08:46:15,Watermark: 2019-10-19 08:46:34。窗口触发后的迟到数据。再次触发。Watermark<[2019-10-19 08:46:00, 2019-10-19 08:46:30)结束时间+AllowedLateness。再次触发该窗口计算。

  8. 当输入记录15时 2019-10-19 08:46:45,browseEventTime: 2019-10-19 08:46:45,Watermark: 2019-10-19 08:46:35。此时的Watermark已经大于等于窗口[2019-10-19 08:46:00, 2019-10-19 08:46:30)结束时间2019-10-19 08:46:30+5s(即2019-10-19 08:46:35),会销毁窗口。

  9. 当输入记录16时 2019-10-19 08:46:15,browseEventTime: 2019-10-19 08:46:15,Watermark: 2019-10-19 08:46:35。也是窗口[2019-10-19 08:46:00, 2019-10-19 08:46:30)的迟到数据,但由于该窗口已经销毁了,因此不会再触发窗口计算。

总结

  1. 以上是在单并行度进行的验证。

  2. 窗口起止时间: 前闭后开的自然时间。

    举例: 30s一个窗口,窗口间隔为[00:00:00 ~ 00:00:30)[00:00:30,00:01:00)[00:01:00 ~ 00:01:30) … 以此类推。

  3. MaxOutOfOrdernessAllowedLateness的理解。

    Flink中的迟到或乱序,实际上是一回事。MaxOutOfOrdernessAllowedLateness都是用来解决乱序问题的。具体区别:

    A. MaxOutOfOrderness: 第一次窗口计算触发前,最多允许乱序或迟到或等待多久。

    B. AllowedLateness: 窗口计算触发后,依然有迟到,最多允许迟到多久。

  4. 窗口计算第一次触发的时机: Watermark >= Window End TimeEventTime >= Window End Time+MaxOutOfOrderness

  5. 窗口计算再次触发的时机: 针对窗口计算触发后,依然有迟到的数据,当Watermark < Window End Time + AllowedLateness时会再次或多次触发Window计算。

  6. 窗口销毁的时机: Watermark >= Window End Time + AllowedLateness

  7. 并行流中多并行度下Watermark的生成与窗口触发的时机

    并行流中每个并行度会单独生成Watermark,但窗口的触发以并行度中最小的Watermark为准。当所有的Watermark都对齐后,以上边的规则触发窗口计算。

  8. Kafka并行流中窗口不触发

    如: 有一个Kafka Topic,3个Partition,Flink 并行度为3,此时只往一个Partition 中发送数据,通过调试可以看到只有一个并行度在更新Watermark,其他两个并行度的Watermark始终为Long.MIN_VALUE

    这样,即使一个并行度达到了我们觉得应该触发的条件,实际上并不会触发,因为其他两个并行度的Watermark并没有对齐,可以通过给其他两个并行度发送心跳或其他方式(待追踪)解决。

你可能感兴趣的:(Flink)