Flink watermark

Flink watermark

1.简介

Flink水印的本质是DataStream中的一种特殊元素,每个水印都携带有一个时间戳。当时间戳为T的水印出现时,表示事件时间t <= T的数据都已经到达,即水印后面应该只能流入事件时间t > T的数据。也就是说,水印是Flink判断迟到数据的标准,同时也是窗口触发的标记。本质上用来处理实时数据中的乱序问题的,通常是水位线和窗口结合使用来实现。

2. WaterMark触发时机

上面谈到了对数据乱序问题的处理机制是watermark+window,那么window什么时候该被触发呢?

基于Event Time的事件处理,Flink默认的事件触发条件为:

对于out-of-order及正常的数据而言

watermark的时间戳 > = window_end_time

在 [window_start_time,window_end_time] 中有数据存在。

对于late element太多的数据而言

Event Time > watermark的时间戳

WaterMark相当于一个EndLine,一旦Watermarks大于了某个window的end_time,就意味着windows_end_time时间和WaterMark时间相同的窗口开始计算执行了。

就是说,我们根据一定规则,计算出Watermarks,并且设置一些延迟,给迟到的数据一些机会,也就是说正常来讲,对于迟到的数据,我只等你一段时间,再不来就没有机会了。

WaterMark时间可以用Flink系统现实时间,也可以用处理数据所携带的Event time。

总的来说:WaterMark的任务触发时机为:

1:watermark时间 >= window_end_time 即max(timestamp, currentMaxTimestamp....)-allowedLateness >= window_end_time
2:在[window_start_time,window_end_time)中有数据存在

针对乱序事件的处理总结为:


窗口window 的作用是为了周期性的获取数据。
watermark的作用是防止数据出现乱序(经常),事件时间内获取不到指定的全部数据,而做的一种保险方法。
allowLateNess是将窗口关闭时间再延迟一段时间。
sideOutPut是最后兜底操作,所有过期延迟数据,指定窗口已经彻底关闭了,就会把数据放到侧输出流。
3.watermark的几种生产方式
3.1 标点水位线(Punctuated Watermark)

标点水位线(Punctuated Watermark)通过数据流中某些特殊标记事件来触发新水位线的生成。这种方式下窗口的触发与时间无关,而是决定于何时收到标记事件。

在实际的生产中Punctuated方式在TPS很高的场景下会产生大量的Watermark在一定程度上对下游算子造成压力,所以只有在实时性要求非常高的场景才会选择Punctuated的方式进行Watermark的生成。


class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks[MyEvent] {
    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        element.getCreationTime
    }
    override def checkAndGetNextWatermark(lastElement: MyEvent, extractedTimestamp: Long): Watermark = {
        if (element.hasWatermarkMarker()) new Watermark(extractedTimestamp) else null
    }
}

其中extractTimestamp用于从消息中提取事件时间,checkAndGetNextWatermark用于检查事件是否标点事件,若是则生成新的水位线。不同于定期水位线定时调用getCurrentWatermark,标点水位线是每接受一个事件就需要调用checkAndGetNextWatermark,若返回值非 null 且新水位线大于当前水位线,则触发窗口计算

注:数据流中每一个递增的EventTime都会产生一个Watermark。在实际的生产中Punctuated方式在TPS很高的场景下会产生大量的Watermark在一定程度上对下游算子造成压力,所以只有在实时性要求非常高的场景才会选择Punctuated的方式进行Watermark的生成

3.2 定期水位线(Periodic Watermark)

周期性的(允许一定时间间隔或者达到一定的记录条数)产生一个Watermark。不管是否有新的消息抵达,水位线提升的时间间隔是由用户设置的,在两次水位线提升时隔内会有一部分消息流入,用户可以根据这部分数据来计算出新的水位线。

在实际的生产中Periodic的方式必须结合时间和积累条数两个维度继续周期性产生Watermark,否则在极端情况下会有很大的延时。

举个例子,最简单的水位线算法就是取目前为止最大的事件时间,然而这种方式比较暴力,对乱序事件的容忍程度比较低,容易出现大量迟到事件。


class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {
    val maxOutOfOrderness = 3500L; // 3.5 seconds
    var currentMaxTimestamp: Long;
    override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
        val timestamp = element.getCreationTime()
        currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
        timestamp;
    }
    override def getCurrentWatermark(): Watermark = {
        // return the watermark as current highest timestamp minus the out-of-orderness bound
        new Watermark(currentMaxTimestamp - maxOutOfOrderness);
    }
}

其中extractTimestamp用于从消息中提取事件时间,而getCurrentWatermark用于生成新的水位线,新的水位线只有大于当前水位线才是有效的。每个窗口都会有该类的一个实例,因此可以利用实例的成员变量保存状态,比如上例中的当前最大时间戳

注:周期性的(一定时间间隔或者达到一定的记录条数)产生一个Watermark。在实际的生产中Periodic的方式必须结合时间和积累条数两个维度继续周期性产生Watermark,否则在极端情况下会有很大的延时。

4. flink1.11之后新的水印生成策略WatermarkStrategy

在flink 1.11之前的版本中,提供了两种生成水印(Watermark)的策略,分别是AssignerWithPunctuatedWatermarks和AssignerWithPeriodicWatermarks,这两个接口都继承自TimestampAssigner接口。所以为了避免代码的重复,在flink 1.11 中对flink的水印生成接口进行了重构,统一使用使用assignTimestampsAndWatermarks方法来构造水印,新的接口需要传入一个WatermarkStrategy对象。

assignTimestampsAndWatermarks(WatermarkStrategy<T>)
4.1 WatermarkStrategy源码:

@Public
public interface WatermarkStrategy<T> extends
    TimestampAssignerSupplier<T>, WatermarkGeneratorSupplier<T> {
  /**
   * Instantiates a WatermarkGenerator that generates watermarks according to this strategy.
   */
  @Override
  WatermarkGenerator<T> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);

  /**
   * Instantiates a {@link TimestampAssigner} for assigning timestamps according to this
   * strategy.
   */
  @Override
  default TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
    // By default, this is {@link RecordTimestampAssigner},
    // for cases where records come out of a source with valid timestamps, for example from Kafka.
    return new RecordTimestampAssigner<>();
  }

  // ------------------------------------------------------------------------
  //  Builder methods for enriching a base WatermarkStrategy
  // ------------------------------------------------------------------------

  /**
   * Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
   * {@link TimestampAssigner} (via a {@link TimestampAssignerSupplier}).
   *
   * <p>You can use this when a {@link TimestampAssigner} needs additional context, for example
   * access to the metrics system.
   *
   * <pre>
   * {@code WatermarkStrategy<Object> wmStrategy = WatermarkStrategy
   *   .forMonotonousTimestamps()
   *   .withTimestampAssigner((ctx) -> new MetricsReportingAssigner(ctx));
   * }</pre>
   */
  default WatermarkStrategy<T> withTimestampAssigner(TimestampAssignerSupplier<T> timestampAssigner) {
    checkNotNull(timestampAssigner, "timestampAssigner");
    return new WatermarkStrategyWithTimestampAssigner<>(this, timestampAssigner);
  }

  /**
   * Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
   * {@link SerializableTimestampAssigner}.
   *
   * <p>You can use this in case you want to specify a {@link TimestampAssigner} via a lambda
   * function.
   *
   * <pre>
   * {@code WatermarkStrategy<CustomObject> wmStrategy = WatermarkStrategy
   *   .forMonotonousTimestamps()
   *   .withTimestampAssigner((event, timestamp) -> event.getTimestamp());
   * }</pre>
   */
  default WatermarkStrategy<T> withTimestampAssigner(SerializableTimestampAssigner<T> timestampAssigner) {
    checkNotNull(timestampAssigner, "timestampAssigner");
    return new WatermarkStrategyWithTimestampAssigner<>(this,
        TimestampAssignerSupplier.of(timestampAssigner));
  }

  /**
   * Creates a new enriched {@link WatermarkStrategy} that also does idleness detection in the
   * created {@link WatermarkGenerator}.
   *
   * <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
   * stream for that amount of time, then that partition is considered "idle" and will not hold
   * back the progress of watermarks in downstream operators.
   *
   * <p>Idleness can be important if some partitions have little data and might not have events
   * during some periods. Without idleness, these streams can stall the overall event time
   * progress of the application.
   */
  default WatermarkStrategy<T> withIdleness(Duration idleTimeout) {
    checkNotNull(idleTimeout, "idleTimeout");
    checkArgument(!(idleTimeout.isZero() || idleTimeout.isNegative()),
        "idleTimeout must be greater than zero");
    return new WatermarkStrategyWithIdleness<>(this, idleTimeout);
  }

  // ------------------------------------------------------------------------
  //  Convenience methods for common watermark strategies
  // ------------------------------------------------------------------------

  /**
   * Creates a watermark strategy for situations with monotonously ascending timestamps.
   *
   * <p>The watermarks are generated periodically and tightly follow the latest
   * timestamp in the data. The delay introduced by this strategy is mainly the periodic interval
   * in which the watermarks are generated.
   *
   * @see AscendingTimestampsWatermarks
   */
  static <T> WatermarkStrategy<T> forMonotonousTimestamps() {
    return (ctx) -> new AscendingTimestampsWatermarks<>();
  }

  /**
   * @see BoundedOutOfOrdernessWatermarks
   */
  static <T> WatermarkStrategy<T> forBoundedOutOfOrderness(Duration maxOutOfOrderness) {
    return (ctx) -> new BoundedOutOfOrdernessWatermarks<>(maxOutOfOrderness);
  }

  /**
   * Creates a watermark strategy based on an existing {@link WatermarkGeneratorSupplier}.
   */
  static <T> WatermarkStrategy<T> forGenerator(WatermarkGeneratorSupplier<T> generatorSupplier) {
    return generatorSupplier::createWatermarkGenerator;
  }

  /**
   * Creates a watermark strategy that generates no watermarks at all. This may be useful in
   * scenarios that do pure processing-time based stream processing.
   */
  static <T> WatermarkStrategy<T> noWatermarks() {
    return (ctx) -> new NoWatermarksGenerator<>();
  }

创建source之后设置的固定延迟生成水印watermark,如kafka

wordSource.assignTimestampsAndWatermarks(
    WatermarkStrategy
            .<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5))      // 设置水印允许延迟5秒
 .withTimestampAssigner((event, timestamp) -> event.f1 ));da
4.2 单调递增生成水印:
dataStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());

使用WatermarkStrategy 生成watermark demo:

package it.kenn.eventtime;
 
import com.alibaba.fastjson.JSONObject;
import it.kenn.util.DateUtils;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
 
import java.time.Duration;
import java.time.LocalDateTime;
import java.time.ZoneOffset;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoUnit;
import java.util.Iterator;
import java.util.Properties;
 
 
/**
 * 主要是event time、watermark的知识
 */
public class EventTimeDemo {
    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(6);
        Properties properties = new Properties();
        properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "1test_34fldink182ddddd344356");
        properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
 
        SingleOutputStreamOperator<JSONObject> kafkaSource = env.addSource(new FlinkKafkaConsumer<>("metric-topic", new SimpleStringSchema(), properties)).map(JSONObject::parseObject);
 
        kafkaSource
                .assignTimestampsAndWatermarks(WatermarkStrategy
                        .<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(5))//水印策略
                        .withTimestampAssigner((record, ts) -> {
                            DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
//                            LocalDateTime parse = LocalDateTime.parse(record.getString("@timestamp"), pattern).plusHours(8);
//                            return parse.toInstant(ZoneOffset.of("+8")).toEpochMilli();
                            return DateUtils.parseStringToLong(record.getString("@timestamp"),pattern,8, ChronoUnit.HOURS);
                        })//解析事件时间
                        .withIdleness(Duration.ofMinutes(1))//对于很久不来的流(空闲流,即可能一段时间内某源没有流来数据)如何处置
                )
                .keyBy(new KeySelector<JSONObject, String>() {
                    @Override
                    public String getKey(JSONObject record){
                        if (record.containsKey("process") && record.getJSONObject("process").containsKey("name")){
                            return record.getJSONObject("process").getString("name");
                        }else {
                            return "unknown-process";
                        }
                    }
                })
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                //四个泛型分别是输入类型,输出类型,key和TimeWindow,这个process函数处理的数据是这个5s窗口中的所有数据
                .process(new ProcessWindowFunction<JSONObject, Tuple2<String,Long>, String, TimeWindow>() {
                    @Override
                    public void process(String key, Context context, Iterable<JSONObject> iterable, Collector<Tuple2<String,Long>> collector) throws Exception {
                        String time = null;
                        Long ts = 0L;
                        Iterator<JSONObject> iterator = iterable.iterator();
                        if (iterator.hasNext()){
                            JSONObject next = iterator.next();
                            time = next.getString("@timestamp");
                            DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
//                            time = LocalDateTime.parse(time, pattern).plusHours(8).toString().replace("T"," ");
                            ts = DateUtils.parseStringToLong(time, pattern, 8, ChronoUnit.HOURS);
                        }
                        collector.collect(new Tuple2<>(key,ts));
                    }
                })
                .print();
//        kafkaSource.print();
        env.execute();
    }
}
package it.kenn.util;
 
import java.time.LocalDateTime;
import java.time.ZoneOffset;
import java.time.format.DateTimeFormatter;
import java.time.temporal.TemporalUnit;
 
/**
 * 时间工具类
 *
 * @author kenn
 * 2020年11月25日23点10分
 */
public final class DateUtils {
 
    public static Long parseStringToLong(String time, DateTimeFormatter pattern, int offset, TemporalUnit unit) {
//        DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
        LocalDateTime dateTime = null;
        if (offset > 0){
            dateTime = LocalDateTime.parse(time, pattern).plus(offset, unit);
        }else if (offset < 0){
            dateTime = LocalDateTime.parse(time, pattern).minus(Math.abs(offset), unit);
        }else {
            dateTime = LocalDateTime.parse(time, pattern);
        }
        return dateTime.toInstant(ZoneOffset.of("+8")).toEpochMilli();
    }
 
    public static Long parseStringToLong(String time, DateTimeFormatter pattern) {
        return parseStringToLong(time, pattern, 0, null);
    }
 
    public static Long parseStringToLong(String time) {
        return parseStringToLong(time, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"));
    }
 
    public static LocalDateTime parseStringToDateTime(String time, DateTimeFormatter pattern) {
        return LocalDateTime.parse(time, pattern);
    }
 
    public static LocalDateTime parseStringToDateTime(String time) {
        return parseStringToDateTime(time, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"));
    }
}
4.3 一种是periodic(周期性)水印

public class MonkeyPeriodicWatermarkGenerator implements WatermarkGenerator<Tuple2<String, Long>> {

 // 因为Watermark是不断推进的,所以我们总是保存最大的事件时间
 private long currentTimestamp;
 // 允许最大的乱序时间
 private long maxOutOfOrderness = 3000;

    @Override
    public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
        currentTimestamp = Math.max(event.f1, currentTimestamp);
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
 // 发出水印(允许乱序时间)
 output.emitWatermark(new Watermark(currentTimestamp - maxOutOfOrderness));
    }
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
ExecutionConfig config = env.getConfig();
// 设置水印的生成间隔为1秒,也就是说每隔1秒往流中加入一个水印
config.setAutoWatermarkInterval(1000);

DataStreamSource<Tuple2<String, Long>> wordSource = env.addSource(new RichSourceFunction<Tuple2<String, Long>>() {
    private volatile Boolean isCancel;
    private int totalCount;

    @Override
    public void open(Configuration parameters) throws Exception {
        this.isCancel = false;
        this.totalCount = 0;
    }

    @Override
    public void run(SourceContext<Tuple2<String, Long>> ctx) throws Exception {
        while(!this.isCancel) {
            String word = RandomStringUtils.randomAlphabetic(10);
            ctx.collect(Tuple2.of(word, System.currentTimeMillis()));
            this.totalCount++;

            if(this.totalCount % 100 == 0) {
                TimeUnit.SECONDS.sleep(1);
            }
        }
    }

    @Override
    public void cancel() {
        this.isCancel = true;
    }
});

SingleOutputStreamOperator<Tuple2<String, Long>> wordWithTsDS =
 wordSource.assignTimestampsAndWatermarks(new WatermarkStrategy<Tuple2<String, Long>>() {
            @Override
            public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
                return new MonkeyPeriodicWatermarkGenerator();
            }

            @Override
            public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
                return (event, ts) -> event.f1;
            }
        });

wordWithTsDS.map(tuple -> tuple.f0)
        .map(word -> Tuple2.of(word, 1), TypeInformation.of(new TypeHint<Tuple2<String, Integer>>() {}))
        .keyBy(wordAndCnt -> wordAndCnt.f0)
        .window(TumblingEventTimeWindows.of(Time.seconds(5)))
        .reduce((wc1, wc2) -> Tuple2.of(wc1.f0, wc1.f1 + wc2.f1)).name("reduce")
        .print();

env.execute("Flink Eventtime and Watermark");
punctuated watermark
接下来,我用代码模拟一下使用punctuated watermark。我需要对Source做以下改造,就是Source发出的消息有可能会有时间戳,也有可能没有时间戳。但如果我们检测到时间戳后,立即发出水印。

首先,此处基于punctuated事件来发出水印,只要检测到元组中的第二个字段不为-1,马上发出水印。注意提取事件时间有一处小细节,第一次因为还没有任何的事件时间,所以默认会是Long.MIN_VALUE,系统会直接报错,所以,我们初始化为0。

public class PunctuatedWatermarkGenerator
        implements WatermarkGenerator<Tuple2<String, Long>> , TimestampAssigner<Tuple2<String, Long>> {
    @Override
    public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
 // 提前事件时间要先判断时间戳字段是否为-1
 if(element.f1 != -1) {
            return element.f1;
        }
        else {
 // 如果为空,返回上一次的事件时间
 return recordTimestamp > 0 ? recordTimestamp : 0;
        }
    }

    @Override
    public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
        if(event.f1 != -1) {
            output.emitWatermark(new Watermark(event.f1));
        }
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
 // nothing
 }
}
4.4 指定使用punctuated watermark

SingleOutputStreamOperator<Tuple2<String, Long>> wordWithTsDS =
        wordSource.assignTimestampsAndWatermarks(new WatermarkStrategy<Tuple2<String, Long>>() {
            @Override
            public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
                return new PunctuatedWatermarkGenerator();
            }

            @Override
            public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
                return new PunctuatedWatermarkGenerator();
            }
        });
4.5 处理空闲数据源

在某些情况下,由于数据产生的比较少,导致一段时间内没有数据产生,进而就没有水印的生成,导致下游依赖水印的一些操作就会出现问题,比如某一个算子的上游有多个算子,这种情况下,水印是取其上游两个算子的较小值,如果上游某一个算子因为缺少数据迟迟没有生成水印,就会出现eventtime倾斜问题,导致下游没法触发计算。

所以filnk通过WatermarkStrategy.withIdleness()方法允许用户在配置的时间内(即超时时间内)没有记录到达时将一个流标记为空闲。这样就意味着下游的数据不需要等待水印的到来。

当下次有水印生成并发射到下游的时候,这个数据流重新变成活跃状态。

在Flink中,我们可以使用withIdleness来设置空闲的source。


ingleOutputStreamOperator<Tuple2<String, Long>> wordWithTsDS =
        wordSource.assignTimestampsAndWatermarks(WatermarkStrategy
                .<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5))      // 设置水印允许延迟5秒
 .withIdleness(Duration.ofSeconds(15))                                       // 设置空闲source为15秒
 .withTimestampAssigner((event, timestamp) -> event.f1));   

大部分时候,我们只需要使用内置的BoundedOutOfOrdernessWatermarks即可,并使用Lambda表达式从事件中提出时间戳就好。但还是得了解它的实现机制。这样将来出现问题的时候,我们也能够第一时间发现问题在哪儿。

案例demo

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
 
import javax.annotation.Nullable;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
 
 
/**
 *
 * Watermark 案例
 *
 * Created by xuwei.tech.
 */
public class StreamingWindowWatermark {
 
    public static void main(String[] args) throws Exception {
        //定义socket的端口号
        int port = 9000;
        //获取运行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
 
        //设置使用eventtime,默认是使用processtime
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
 
        //设置并行度为1,默认并行度是当前机器的cpu数量
        env.setParallelism(1);
 
        //连接socket获取输入的数据
        DataStream<String> text = env.socketTextStream("hadoop100", port, "\n");
 
        //解析输入的数据
        DataStream<Tuple2<String, Long>> inputMap = text.map(new MapFunction<String, Tuple2<String, Long>>() {
            @Override
            public Tuple2<String, Long> map(String value) throws Exception {
                String[] arr = value.split(",");
                return new Tuple2<>(arr[0], Long.parseLong(arr[1]));
            }
        });
 
        //抽取timestamp和生成watermark
        DataStream<Tuple2<String, Long>> waterMarkStream = inputMap.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple2<String, Long>>() {
 
            Long currentMaxTimestamp = 0L;
            final Long maxOutOfOrderness = 10000L;// 最大允许的乱序时间是10s
 
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
            /**
             * 定义生成watermark的逻辑
             * 默认100ms被调用一次
             */
            @Nullable
            @Override
            public Watermark getCurrentWatermark() {
                return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
            }
 
            //定义如何提取timestamp
            @Override
            public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
                long timestamp = element.f1;
                currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
                System.out.println("key:"+element.f0+",eventtime:["+element.f1+"|"+sdf.format(element.f1)+"],currentMaxTimestamp:["+currentMaxTimestamp+"|"+
                        sdf.format(currentMaxTimestamp)+"],watermark:["+getCurrentWatermark().getTimestamp()+"|"+sdf.format(getCurrentWatermark().getTimestamp())+"]");
                return timestamp;
            }
        });
         // 保存被丢弃的数据
        OutputTag<Tuple2<String,Long>> outputTag = new <Tuple2<String,Long>>("late-data"){};
        //分组,聚合
        DataStream<String> window = waterMarkStream.keyBy(0)
                .window(TumblingEventTimeWindows.of(Time.seconds(3)))//按照消息的EventTime分配窗口,和调用TimeWindow效果一样
                .allowedLateness(Time.seconds(2))// 允许数据迟到2s
                .sideOutputLateData(outputTag)   //  通过sideOutputLateData 可以把迟到的数据统一收集,统计存储,方便后期排查问题。旁路输出 
                .apply(new WindowFunction<Tuple2<String, Long>, String, Tuple, TimeWindow>() {
                    /**
                     * 对window内的数据进行排序,保证数据的顺序
                     * @param tuple
                     * @param window
                     * @param input
                     * @param out
                     * @throws Exception
                     */
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
                        String key = tuple.toString();
                        List<Long> arrarList = new ArrayList<Long>();
                        Iterator<Tuple2<String, Long>> it = input.iterator();
                        while (it.hasNext()) {
                            Tuple2<String, Long> next = it.next();
                            arrarList.add(next.f1);
                        }
                        Collections.sort(arrarList);
                        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
                        String result = key + "," + arrarList.size() + "," + sdf.format(arrarList.get(0)) + "," + sdf.format(arrarList.get(arrarList.size() - 1))
                                + "," + sdf.format(window.getStart()) + "," + sdf.format(window.getEnd());
                        out.collect(result);
                    }
                });
        // 把迟到数据暂时打印到控制台,实际中可以保存到其它存储介质中
        DataStream<Tuple2<String,Long>> sideOut = window.getSideOutput(outputTag);
        //测试-把结果打印到控制台即可
        window.print();
 
        //注意:因为flink是懒加载的,所以必须调用execute方法,上面的代码才会执行
        env.execute("eventtime-watermark");
 
    }
 
 
}

你可能感兴趣的:(Flink,flink,java,windows,大数据,hadoop)