flink解析:EventTime与Watermark

版本:1.11
当是EventTime类的应用时,Watermark才有意义(IngestTime也可以生成WaterMark,很少用),它是一个时间,表示后来的数据的EventTime都大于该Watermark,即不满足条件的记录不会被纳入计算,特别地,对于窗口算子,触发窗口计算与清理

1. EventTime的分配与Watermark生成

1.1 通过SourceFunction

在run方法中调用

@Override
public void run(SourceContext> ctx) throws Exception {
    ctx.collectWithTimestamp(数据,时间);
    ctx.emitWatermark(new Watermark(Watermark时间));
}
1.2 通过assignTimestampsAndWatermarks方法

在老版本中内置了基于时间基于特定记录生成Watermark的接口AssignerWithPeriodicWatermarksAssignerWithPunctuatedWatermarks,在新版本中被废弃。入参需要是WatermarkStrategy,用户需要自己定义Watermark生成类WatermarkGenerator与EventTime的分配类TimestampAssigner

public interface WatermarkStrategy extends
        TimestampAssignerSupplier, WatermarkGeneratorSupplier {
    @Override
    WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);

    @Override
    default TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context) {
        return new RecordTimestampAssigner<>();
    }
}

在Task被调度到TaskManager被执行时,这两个方法会在assignTimestampsAndWatermarks对应的Operator TimestampsAndWatermarksOperator被调用open方法时被调用(即operator的初始化)

    @Override
    public void open() throws Exception {
        super.open();
// 创建分配EventTime的类
        timestampAssigner = watermarkStrategy.createTimestampAssigner(this::getMetricGroup);
// 创建生成Watermark的类
        watermarkGenerator = watermarkStrategy.createWatermarkGenerator(this::getMetricGroup);

        wmOutput = new WatermarkEmitter(output, getContainingTask().getStreamStatusMaintainer());

        watermarkInterval = getExecutionConfig().getAutoWatermarkInterval();
        if (watermarkInterval > 0) {
            final long now = getProcessingTimeService().getCurrentProcessingTime();
// 并注册一个定时器Timer在(now + watermarkInterval)时执行,用来生成Watermark
            getProcessingTimeService().registerTimer(now + watermarkInterval, this);
        }
    }

对每一条数据,会调用watermarkGenerator的onEvent方法与timestampAssigner的extractTimestamp方法

    @Override
    public void processElement(final StreamRecord element) throws Exception {
        final T event = element.getValue();
        final long previousTimestamp = element.hasTimestamp() ? element.getTimestamp() : Long.MIN_VALUE;
// 分配EventTime
        final long newTimestamp = timestampAssigner.extractTimestamp(event, previousTimestamp);

        element.setTimestamp(newTimestamp);
        output.collect(element);
        watermarkGenerator.onEvent(event, newTimestamp, wmOutput);
    }

在定时器中,会调用TimestampsAndWatermarksOperatoronProcessingTime方法,其中会调用watermarkGenerator的onPeriodicEmit方法。这个调度时间在调用env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)时被设置为200ms

    @Override
    public void onProcessingTime(long timestamp) throws Exception {
// 重要,生成Watermark
        watermarkGenerator.onPeriodicEmit(wmOutput);

        final long now = getProcessingTimeService().getCurrentProcessingTime();
// 重新注册一个定时器
        getProcessingTimeService().registerTimer(now + watermarkInterval, this);
    }

TimestampsAndWatermarksOperator也不是到点就发watermark,还是有逻辑的

public void emitWatermark(Watermark watermark) {
            final long ts = watermark.getTimestamp();

            if (ts <= currentWatermark) {
                return;
            }

            currentWatermark = ts;

            output.emitWatermark(new org.apache.flink.streaming.api.watermark.Watermark(ts));
        }

watermark是通过广播的形式向下游发送

// from RecordWriterOutput
public void emitWatermark(Watermark mark) {
        recordWriter.broadcastEmit(serializationDelegate);
    }
1.3 示例

改造废弃的程序,原程序

    class CarTimestamp extends AscendingTimestampExtractor> {
        private static final long serialVersionUID = 1L;

        @Override
        public long extractAscendingTimestamp(Tuple4 element) {
            return element.f3;
        }
    }

改造如下

public class CarTimestampStrategy implements WatermarkStrategy> {
    private TimestampAssigner assigner;

    @Override
    public WatermarkGenerator> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {

        return new WatermarkGenerator>() {
            long currentTimestamp = Long.MIN_VALUE;
            @Override
            public void onEvent(Tuple4 event, long eventTimestamp, WatermarkOutput output) {
                if (eventTimestamp >= this.currentTimestamp) {
                    this.currentTimestamp = eventTimestamp;
                } else {
                    // todo
                }
            }

            @Override
            public void onPeriodicEmit(WatermarkOutput output) {
                Watermark watermark = new Watermark(currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1);
                output.emitWatermark(watermark);
            }
        };
    }

    @Override
    public TimestampAssigner> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
        assigner = new TimestampAssigner>() {
            @Override
            public long extractTimestamp(Tuple4 element, long recordTimestamp) {
                return element.f3;
            }
        };
        return assigner;
    }
}

2. Watermark传播

在讨论之前,有一些前置知识要说明以下:

  • Watermark通过广播的方式向下游传播,每一个InputChannel都会接收到
  • 每一个Task对应一个InputGate,一个InputGate有多个InputChannel,个数为上游的并发数之和。每一个InputChannel对应一个InputChannelStatus,Task通过用来读取数据的StreamTaskNetworkInput中的StatusWatermarkValve来进行管理。


    部分ExecutionGraph
    protected static class InputChannelStatus {
        protected long watermark;
        protected StreamStatus streamStatus;
        protected boolean isWatermarkAligned;
}

其中,watermark用来记录每次上游发送给InputChannel的watermark,streamStatus用来表示该InputChannel是Active还是IDLE,isWatermarkAligned表示该InputChannel与Task的watermark是否对齐(就是>=)。Task也有watermark,对于EventTime应用,当Task的watermark更新后,会触发operator的processWatermark方法,特别地,对于窗口操作,它会触发那些窗口结束时间小于等于该watermark的定时器,最终触发窗口计算或状态清理。

2.1 步骤1:InputChannel的watermark更新

// from StatusWatermarkValve
public void inputWatermark(Watermark watermark, int channelIndex) throws Exception {
        if (lastOutputStreamStatus.isActive() && channelStatuses[channelIndex].streamStatus.isActive()) {
            long watermarkMillis = watermark.getTimestamp();

            if (watermarkMillis > channelStatuses[channelIndex].watermark) {
                channelStatuses[channelIndex].watermark = watermarkMillis;

                if (!channelStatuses[channelIndex].isWatermarkAligned && watermarkMillis >= lastOutputWatermark) {
                    channelStatuses[channelIndex].isWatermarkAligned = true;
                }

                // now, attempt to find a new min watermark across all aligned channels
                findAndOutputNewMinWatermarkAcrossAlignedChannels();
            }
        }
    }

如上面的代码:

  1. 当有一个InputChannelStatus的状态是active,lastOutputStreamStatus.isActive()返回true;当所有的InputChannelStatus都是IDLE则返回false
  2. 当前读取数据的InputChannel的状态是active,且上游传给该channel的watermark大于InputChannelStatus的中记载的(老的),则更新

2.2 步骤2:Task的watermark更新

// from StatusWatermarkValve
private void findAndOutputNewMinWatermarkAcrossAlignedChannels() throws Exception {
       long newMinWatermark = Long.MAX_VALUE;
       boolean hasAlignedChannels = false;

       for (InputChannelStatus channelStatus : channelStatuses) {
           if (channelStatus.isWatermarkAligned) {
               hasAlignedChannels = true;
               newMinWatermark = Math.min(channelStatus.watermark, newMinWatermark);
           }
       }

       if (hasAlignedChannels && newMinWatermark > lastOutputWatermark) {
           lastOutputWatermark = newMinWatermark;
// 向headOperator发送watermark,调用该opeartor的processWatermark方法
           output.emitWatermark(new Watermark(lastOutputWatermark));
       }
   }

代码不复杂,逻辑是

  1. 如果InputChannel是IDLE的,则它的watermark一定是不对齐的
  2. 取所有watermark对齐的channelStatus中watermark最小的
  3. 如果新的watermark比上次的大则更新

2.3 问题一:什么时候InputChannelStatus的isWatermarkAligned为false

// from StatusWatermarkValve
    public void inputStreamStatus(StreamStatus streamStatus, int channelIndex) throws Exception {
        if (streamStatus.isIdle() && channelStatuses[channelIndex].streamStatus.isActive()) {
            channelStatuses[channelIndex].streamStatus = StreamStatus.IDLE;
            channelStatuses[channelIndex].isWatermarkAligned = false;

当上游发送的消息是StreamStatus.IDLE时

2.4 问题二:什么时候上游发送StreamStatus.IDLE

// from WatermarkContext
public void markAsTemporarilyIdle() {
            synchronized (checkpointLock) {
                streamStatusMaintainer.toggleStreamStatus(StreamStatus.IDLE);
            }
        }

当用户在实现的SourceFunction的run方法中调用ctx.markAsTemporarilyIdle()时。通常是预计一段时间没有数据才调用

你可能感兴趣的:(flink解析:EventTime与Watermark)