flink time详解

前言

本文为学习flink入门与实战/网易云课堂-flink大数据项目实战课程的笔记整理

 

一、Time

1.Stream中,Time的种类有三种:Event Time/Ingestion Time/Processing Time

2.三种Time之间的关系

flink time详解_第1张图片 三种Time之间的关系

3.设置Time的方法:

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

二、Flink如何处理乱序数据

flink处理时,容易出现数据乱序的情况。在计算window时,不能无限期等待,因此需要有一个机制来保证,在特定时间之后,必须触发window计算,该机制为watermark。

只有Event Time时需要指定watermark和timestamp,watermark和timestamp采用毫秒作为计量单位。

2.1 watermark

1.应用场景:

有序Stream中的watermark:

flink time详解_第2张图片

无序Stream中的watermark:

flink time详解_第3张图片

多并行度Stream的watermark:

flink time详解_第4张图片

2.多并行度watermark对齐机制:

一个opt有多个入度时,watermark会取所有入度中最小的watermark

2.2 watermark生成方式

1.生成时机:

a.接收到Source的数据后,立即生成watermark

b.在map/filter等操作后生成(timestamp assigner/watermark generator)

示例代码:

package com.zzh.testWindow;

import com.zzh.testJoin.Transcript;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.sql.Timestamp;


public class testWindow {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.createLocalEnvironment();
        //设置时间类型为event time
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        DataStream dataStream=env.fromElements(getTranscriptDataSource());

        //在opt中设置watermark
        DataStream dataStreamWithTimeStamp=dataStream.filter(new FilterFunction() {
            @Override
            public boolean filter(Transcript transcript) throws Exception {
                if (transcript.getScore()>60){
                    return true;
                }
                return false;
            }
        }).assignTimestampsAndWatermarks(new MyWaterMark(3500));

        dataStreamWithTimeStamp.timeWindowAll(Time.seconds(10)).reduce(new ReduceFunction(){
                    @Override
                    public Transcript reduce(Transcript lastData, Transcript newData) throws Exception {
                        System.out.println(lastData);
                        System.out.println(newData);
                        System.out.println("=====================");
                        lastData.setScore((lastData.getScore()+newData.getScore())/2);
                        return lastData;
                    }
        }).print();

        env.execute("finish");
    }



    private static Transcript[] getTranscriptDataSource(){
        return new Transcript[]{
                new Transcript("1","张三","语文",100, Timestamp.valueOf("2020-07-01 11:1:1").getTime()),
                new Transcript("2","李四","语文",78,Timestamp.valueOf("2020-07-01 11:3:1").getTime()),
                new Transcript("3","王五","语文",99,Timestamp.valueOf("2020-07-01 11:3:4").getTime()),
                new Transcript("4","赵六","语文",81,Timestamp.valueOf("2020-07-01 11:3:9").getTime()),
                new Transcript("5","钱七","语文",59,Timestamp.valueOf("2020-07-01 11:1:10").getTime()),
                new Transcript("6","马二","语文",97,Timestamp.valueOf("2020-07-01 11:1:12").getTime()),
        };
    }
}

 

2.生成方式:

a.wtih periodic watermarks

概述:

周期性调用getCurrentWatermark()方法,若获取的watermark不为null且大于上一个watermark,则向下游发送

特点:

  • 周期性触发
  • 每隔N秒自动向流注入watermark
  • 可以定义一个最大允许乱序的时间
  • 实现AssignerWithPeriodWatermarks接口
  • 可设置watermark发送周期:
    ExecutionConfig.setAutoWatermarkInterval();

    示例代码:

package com.zzh.testWindow;

import com.zzh.testJoin.Transcript;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;

import javax.annotation.Nullable;

public class MyWaterMark implements AssignerWithPeriodicWatermarks {
    private long currentMaxTimeStamp;
    private long timeBounded;

    public MyWaterMark(long timeBounded){
        this.timeBounded=timeBounded;
    }

    @Nullable
    @Override
    public Watermark getCurrentWatermark() {
        //当当前watermark比上一次大,则向发射数据,因此此处使用最大timestamp减去bounded
        return new Watermark(this.currentMaxTimeStamp-this.timeBounded);
    }

    @Override
    public long extractTimestamp(Transcript transcript, long l) {
        //获取当前最大的时间戳
        long currentTimeStamp=transcript.getTime();
        this.currentMaxTimeStamp=Math.max(currentTimeStamp,this.currentMaxTimeStamp);
        return currentTimeStamp;
    }
}

b.with punctuated watermarks

特点:

  • 基于某些事件触发watermark生成
  • 每一个元素都会判断是否生成watermark
  • 实现AssignerWithPunctuatedWatermarks

示例代码:

package com.zzh.testWindow;

import com.zzh.testJoin.Transcript;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;

import javax.annotation.Nullable;


public class PunctuatedWaterMark implements AssignerWithPunctuatedWatermarks {

    @Nullable
    @Override
    public Watermark checkAndGetNextWatermark(Transcript transcript, long l) {
        //l等价于transcript的timestamp
        return transcript.getTime()>0?new Watermark(l):null;
    }

    @Override
    public long extractTimestamp(Transcript transcript, long l) {
        return transcript.getTime();
    }
}

三、预定义Timestamp Extractors和watermark Emitters

3.1 适用于时间戳单调递增场景

.assignTimestampsAndWatermarks(new AscendingTimestampExtractor() {
            @Override
            public long extractAscendingTimestamp(Transcript element) {
                return element.getTime();
            }
        });

3.2 适用于固定延迟的场景

.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(10)) {
            @Override
            public long extractTimestamp(Transcript element) {
                return element.getTime();
            }
        });

3.3 延迟数据处理

1.allowedLateness(),设置最大延迟处理时间

2.sideOutputTag,提供延迟获取数据的方式,这样就不会丢弃数据了

示例代码:

        OutputTag lateOutputTag=new OutputTag("late-date");
        dataStreamWithTimeStamp.timeWindowAll(Time.seconds(10)).
                allowedLateness(Time.seconds(10)).
                sideOutputLateData(lateOutputTag).

 

你可能感兴趣的:(flink,flink)