背景
Flink SQL的window计算除了window结束后触发计算之外,还支持中途输出window中间累加结果以及对迟到的数据修正window累加结果(以回撤形式输出:先输出UPDATE_BEFORE旧值然后输出UPDATE_AFTER新值)。
目前版本Flink 1.15 master分支代码中这些为实验特性。下面从配置参数入手,分析下上述功能的实现原理。
WindowEmitStrategy
负责接收保存table配置的参数和生成对应的trigger
。
WindowEmitStrategy
有如下试验性参数:
@Experimental
val TABLE_EXEC_EMIT_EARLY_FIRE_ENABLED: ConfigOption[JBoolean] =
key("table.exec.emit.early-fire.enabled")
.booleanType()
.defaultValue(Boolean.box(false))
.withDescription("Specifies whether to enable early-fire emit." +
"Early-fire is an emit strategy before watermark advanced to end of window.")
// It is a experimental config, will may be removed later.
@Experimental
val TABLE_EXEC_EMIT_EARLY_FIRE_DELAY: ConfigOption[Duration] =
key("table.exec.emit.early-fire.delay")
.durationType()
.noDefaultValue()
.withDescription("The early firing delay in milli second, early fire is " +
"the emit strategy before watermark advanced to end of window. " +
"< 0 is illegal configuration. " +
"0 means no delay (fire on every element). " +
"> 0 means the fire interval. ")
// It is a experimental config, will may be removed later.
@Experimental
val TABLE_EXEC_EMIT_LATE_FIRE_ENABLED: ConfigOption[JBoolean] =
key("table.exec.emit.late-fire.enabled")
.booleanType()
.defaultValue(Boolean.box(false))
.withDescription("Specifies whether to enable late-fire emit. " +
"Late-fire is an emit strategy after watermark advanced to end of window.")
// It is a experimental config, will may be removed later.
@Experimental
val TABLE_EXEC_EMIT_LATE_FIRE_DELAY: ConfigOption[Duration] =
key("table.exec.emit.late-fire.delay")
.durationType()
.noDefaultValue()
.withDescription("The late firing delay in milli second, late fire is " +
"the emit strategy after watermark advanced to end of window. " +
"< 0 is illegal configuration. " +
"0 means no delay (fire on every element). " +
"> 0 means the fire interval.")
// It is a experimental config, will may be removed later.
@Experimental
val TABLE_EXEC_EMIT_ALLOW_LATENESS: ConfigOption[Duration] =
key("table.exec.emit.allow-lateness")
.durationType()
.noDefaultValue()
.withDescription("Sets the time by which elements are allowed to be late. " +
"Elements that arrive behind the watermark by more than the specified time " +
"will be dropped. " +
"Note: use the value if it is set, else use 'minIdleStateRetentionTime' in table config." +
"< 0 is illegal configuration. " +
"0 means disable allow lateness. " +
"> 0 means allow-lateness.")
下面是他们的功能解释:
- table.exec.emit.early-fire.enabled: 是否启用early fire。Early fire的含义是在watermark到达window结束时间点之前输出结果。
- table.exec.emit.early-fire.delay: early fire结果的时间间隔。如果值大于0,含义为每隔指定时间输出结果。如果值为0,则每次元素到来都会输出结果。
- table.exec.emit.late-fire.enabled: 是否启用late fire。Late fire的含义是在watermark到达window结束时间点之后输出结果。
- table.exec.emit.late-fire.delay: late fire结果的时间间隔。和early fire delay的逻辑相似。如果值大于0,含义为每隔指定时间输出结果。如果值为0,则每次元素到来都会输出结果。
- table.exec.emit.allow-lateness: window 内的数据会在window结束时候保存的额外时间。超过这个时间后,window数据会被清空。注意这个参数和watermark含义的区别。watermark也可以接受迟到的元素,watermark范围内迟到的元素是不会影响计算出正确结果的,在触发计算前会考虑到可能有元素来迟这种情况。可认为是延迟计算,只有watermark到达window end之后才会触发计算。allow-lateness虽然字面上也是允许元素迟到,但是计算结果是以retract(回撤)形式出现的。计算的触发(由watermark决定)不会等待到allow-lateness。因此,在watermark和allow-lateness之前出现的元素,会触发Flink的回撤输出,即先输出原先计算结果无效(UPDATE_BEFORE),然后在补上正确的结果(UPDATE_AFTER)。
getTrigger
方法。这个方法创建出early fire和late fire对应的trigger。这个方法在StreamExecGroupWindowAggregate
类的createWindowOperator
方法调用。
getTrigger
方法逻辑如下:
def getTrigger: Trigger[TimeWindow] = {
// 创建出earlyTrigger和lateTrigger
val earlyTrigger = createTriggerFromFireDelay(earlyFireDelayEnabled, earlyFireDelay)
val lateTrigger = createTriggerFromFireDelay(lateFireDelayEnabled, lateFireDelay)
// 根据时间类型创建出不同的组合Trigger
if (isEventTime) {
val trigger = EventTimeTriggers.afterEndOfWindow[TimeWindow]()
(earlyTrigger, lateTrigger) match {
// EventTimeTriggers.AfterEndOfWindowEarlyAndLate类型trigger
case (Some(early), Some(late)) => trigger.withEarlyFirings(early).withLateFirings(late)
// EventTimeTriggers.AfterEndOfWindowNoLate类型trigger
case (Some(early), None) => trigger.withEarlyFirings(early)
// EventTimeTriggers.AfterEndOfWindow(如果late trigger是ElementTriggers.EveryElement类型)或EventTimeTriggers.AfterEndOfWindowEarlyAndLate类型
case (None, Some(late)) => trigger.withLateFirings(late)
// EventTimeTriggers.AfterEndOfWindow类型
case (None, None) => trigger
}
} else {
val trigger = ProcessingTimeTriggers.afterEndOfWindow[TimeWindow]()
// late trigger is ignored, as no late element in processing time
earlyTrigger match {
case Some(early) => trigger.withEarlyFirings(early)
case None => trigger
}
}
}
EventTimeTriggers
是支持early fire和late fire的组合类型trigger,后面分析。
我们继续分析创建earlyTrigger
和lateTrigger
的方法createTriggerFromFireDelay
:
private def createTriggerFromFireDelay(
enableDelayEmit: JBoolean,
fireDelay: Duration): Option[Trigger[TimeWindow]] = {
// 检查是否启用
if (!enableDelayEmit) {
None
} else {
if (fireDelay.toMillis > 0) {
// 如果delay大于0,返回周期触发的ProcessingTimeTrigger
// 为ProcessingTimeTriggers.AfterFirstElementPeriodic类型
Some(ProcessingTimeTriggers.every(fireDelay))
} else {
// 否则,返回ElementTriggers.EveryElement。仅仅在每次元素到来时候触发
Some(ElementTriggers.every())
}
}
}
和上面配置项解释中一样,如果delay为0,每次元素到来都会触发。如果delay大于0,会生成一个processing time周期触发的trigger,触发间隔为delay。
EventTimeTriggers
EventTimeTriggers
包含了多种event time类型的early trigger或late trigger的组合实现。我们分析最复杂的AfterEndOfWindowEarlyAndLate
。它同时包含early trigger和late trigger。
下面是三个重要方法onElement
,onProcessingTime
和onEventTime
的分析。它内部维护了一个状态量hasFired
,用来保存是否进入了late fire状态,从而决定触发early trigger还是late trigger。
@Override
public boolean onElement(Object element, long timestamp, W window) throws Exception {
// 是否该触发late trigger
Boolean hasFired = ctx.getPartitionedState(hasFiredOnTimeStateDesc).value();
if (hasFired != null && hasFired) {
// this is to cover the case where we recover from a failure and the watermark
// is Long.MIN_VALUE but the window is already in the late phase.
// 触发lateTrigger.onElement
return lateTrigger != null && lateTrigger.onElement(element, timestamp, window);
} else {
if (triggerTime(window) <= ctx.getCurrentWatermark()) {
// 在late阶段
// we are in the late phase
// if there is no late trigger then we fire on every late element
// This also covers the case of recovery after a failure
// where the currentWatermark will be Long.MIN_VALUE
return true;
} else {
// we are in the early phase
// early阶段
// 注册一个在window结束时间触发的event time定时器
ctx.registerEventTimeTimer(triggerTime(window));
return earlyTrigger != null
&& earlyTrigger.onElement(element, timestamp, window);
}
}
}
@Override
public boolean onProcessingTime(long time, W window) throws Exception {
Boolean hasFired = ctx.getPartitionedState(hasFiredOnTimeStateDesc).value();
if (hasFired != null && hasFired) {
// late fire
return lateTrigger != null && lateTrigger.onProcessingTime(time, window);
} else {
// early fire
return earlyTrigger != null && earlyTrigger.onProcessingTime(time, window);
}
}
@Override
public boolean onEventTime(long time, W window) throws Exception {
ValueState hasFiredState = ctx.getPartitionedState(hasFiredOnTimeStateDesc);
Boolean hasFired = hasFiredState.value();
if (hasFired != null && hasFired) {
// late fire
return lateTrigger != null && lateTrigger.onEventTime(time, window);
} else {
if (time == triggerTime(window)) {
// window任意一个element到来,都会注册一个在窗口结束时候触发的event time定时器
// 到达此处说明event time为window结束时间
// 更新状态,说明接下来要触发late fire
// fire on time and update state
hasFiredState.update(true);
return true;
} else {
// early fire
return earlyTrigger != null && earlyTrigger.onEventTime(time, window);
}
}
}
WindowOperator
Table对应的WindowOperator
位于flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/window/WindowOperator.java
。包含Flink SQL用来保存和计算window数据。
WindowOperator
包含两个子类:AggregateWindowOperator
(一个agg group输出一个结果)和TableAggregateWindowOperator
(一个agg group输出多个结果)。
window接收到数据之后调用processElement
方法,逻辑和分析如下:
@Override
public void processElement(StreamRecord record) throws Exception {
RowData inputRow = record.getValue();
long timestamp;
// 获取元素对应的event time或者processing time
if (windowAssigner.isEventTime()) {
timestamp = inputRow.getLong(rowtimeIndex);
} else {
timestamp = internalTimerService.currentProcessingTime();
}
// 转换成UTC
timestamp = TimeWindowUtil.toUtcTimestampMills(timestamp, shiftTimeZone);
// the windows which the input row should be placed into
// 查找数据影响的window
// 对于GeneralWindowProcessFunction和MergingWindowProcessFunction,assignStateNamespace和assignActualWindows相同
// 对于PannedWindowProcessFunction,assignStateNamespace返回的是切片之后的window
Collection affectedWindows = windowFunction.assignStateNamespace(inputRow, timestamp);
// 标记元素是否丢弃
boolean isElementDropped = true;
for (W window : affectedWindows) {
// 遍历到了,不丢弃元素
isElementDropped = false;
windowState.setCurrentNamespace(window);
// 获取当前聚合的值
RowData acc = windowState.value();
if (acc == null) {
// 如果不存在,创建aggregator
acc = windowAggregator.createAccumulators();
}
windowAggregator.setAccumulators(window, acc);
// 聚合运算累加新值,或者是回撤运算
if (RowDataUtil.isAccumulateMsg(inputRow)) {
windowAggregator.accumulate(inputRow);
} else {
windowAggregator.retract(inputRow);
}
acc = windowAggregator.getAccumulators();
// 保存聚合结果
windowState.update(acc);
}
// the actual window which the input row is belongs to
// 查找数据真实存在的window
Collection actualWindows = windowFunction.assignActualWindows(inputRow, timestamp);
for (W window : actualWindows) {
// 遍历到了,不丢弃元素
isElementDropped = false;
triggerContext.window = window;
// 如果trigger再元素到来的时候触发,发送window聚合运算结果
boolean triggerResult = triggerContext.onElement(inputRow, timestamp);
if (triggerResult) {
emitWindowResult(window);
}
// register a clean up timer for the window
// 注册一个清理window数据的定时器
registerCleanupTimer(window);
}
if (isElementDropped) {
// markEvent will increase numLateRecordsDropped
// 标记此丢弃元素,用以计算元素丢弃率
lateRecordsDroppedRate.markEvent();
}
}
registerCleanupTimer
注册了一个window清理的定时器。
private void registerCleanupTimer(W window) {
long cleanupTime = toEpochMillsForTimer(cleanupTime(window), shiftTimeZone);
if (cleanupTime == Long.MAX_VALUE) {
// don't set a GC timer for "end of time"
return;
}
// 根据时间类型,注册不同的定时器
if (windowAssigner.isEventTime()) {
triggerContext.registerEventTimeTimer(cleanupTime);
} else {
triggerContext.registerProcessingTimeTimer(cleanupTime);
}
}
registerCleanupTimer
注册的timer触发时间由cleanupTime
方法决定,它计算window需要清理的时间点。
private long cleanupTime(W window) {
if (windowAssigner.isEventTime()) {
// 如果使用event time,需要考虑到配置参数中的allowedLateness(允许迟到多久)
// 所以说window中数据的保留时间延长了allowedLateness时长
long cleanupTime = Math.max(0, window.maxTimestamp() + allowedLateness);
return cleanupTime >= window.maxTimestamp() ? cleanupTime : Long.MAX_VALUE;
} else {
return Math.max(0, window.maxTimestamp());
}
}
由分析可知对于event time类型,通常来说window的清理时间为window.maxTimestamp() + allowedLateness
。和上面table.exec.emit.allow-lateness
参数的解释相同,数据会在window结束时候额外保存allow-lateness
配置的时间。
接下来我们分析event time trigger触发的执行逻辑onEventTime
方法,内容如下:
@Override
public void onEventTime(InternalTimer timer) throws Exception {
setCurrentKey(timer.getKey());
// 获取关联的window
triggerContext.window = timer.getNamespace();
if (triggerContext.onEventTime(timer.getTimestamp())) {
// fire
// 触发计算,将window累计结果发送出去
emitWindowResult(triggerContext.window);
}
// 如果到了window清理时间,清理window中的数据
// window清理时间计算逻辑和cleanupTime方法相同
if (windowAssigner.isEventTime()) {
windowFunction.cleanWindowIfNeeded(triggerContext.window, timer.getTimestamp());
}
}
onProcessingTime
方法内容如下。和onElement
方法类似不再赘述。
@Override
public void onProcessingTime(InternalTimer timer) throws Exception {
setCurrentKey(timer.getKey());
triggerContext.window = timer.getNamespace();
if (triggerContext.onProcessingTime(timer.getTimestamp())) {
// fire
emitWindowResult(triggerContext.window);
}
if (!windowAssigner.isEventTime()) {
windowFunction.cleanWindowIfNeeded(triggerContext.window, timer.getTimestamp());
}
}
AggregateWindowOperator 生成retract数据
WindowOperator
的emitWindowResult
是一个抽象方法。生成retract数据逻辑位于它的子类AggregateWindowOperator
。接下来分析它的代码。
@Override
protected void emitWindowResult(W window) throws Exception {
// 获取window聚合计算结果
windowFunction.prepareAggregateAccumulatorForEmit(window);
RowData acc = aggWindowAggregator.getAccumulators();
RowData aggResult = aggWindowAggregator.getValue(window);
// 如果生成更新类型数据
if (produceUpdates) {
// 获取上一次输出的结果
previousState.setCurrentNamespace(window);
RowData previousAggResult = previousState.value();
// recordCounter的值初始化位于WindowOperator的构造方法,来源于inputCountIndex
// inputCountIndex的含义为SQL中COUNT(*)的位置,如果不包含COUNT(*)值为-1
// 相关解释位于RecordCounter的of方法说明
// recordCounter.recordCountIsZero作用是返回RowData数据对应COUNT(*)这一项的值是否是0
// 如果不是0,需要生成UPDATE_BEFORE和UPDATE_AFTER回撤数据(新老数据不同),如果新老数据相同,无输出
// 如果是0,需要生成DELETE类型的回撤数据
if (!recordCounter.recordCountIsZero(acc)) {
// has emitted result for the window
if (previousAggResult != null) {
// current agg is not equal to the previous emitted, should emit retract
// 如果之前输出过数据,并且这次计算结果上次的不同,需要生成回撤retracted数据
if (!equaliser.equals(aggResult, previousAggResult)) {
// send UPDATE_BEFORE
// 生成UPDATE_BEFORE数据
collect(
RowKind.UPDATE_BEFORE,
(RowData) getCurrentKey(),
previousAggResult);
// send UPDATE_AFTER
// 生成UPDATE_AFTER数据
collect(RowKind.UPDATE_AFTER, (RowData) getCurrentKey(), aggResult);
// update previousState
// 更新上一次计算的结果为本次结果
previousState.update(aggResult);
}
// if the previous agg equals to the current agg, no need to send retract and
// accumulate
// 如果本次计算结果和上次计算的相同,不需要生成回撤数据
}
// the first fire for the window, only send INSERT
else {
// 这个分支是第一次输出数据,生成的数据是INSERT
// send INSERT
collect(RowKind.INSERT, (RowData) getCurrentKey(), aggResult);
// update previousState
// 更新状态
previousState.update(aggResult);
}
} else {
// 生成DELETE类型的回撤数据
// has emitted result for the window
// we retracted the last record for this key
if (previousAggResult != null) {
// send DELETE
collect(RowKind.DELETE, (RowData) getCurrentKey(), previousAggResult);
// clear previousState
previousState.clear();
}
// if the counter is zero, no need to send accumulate
}
} else {
// 如果不生成更新数据
if (!recordCounter.recordCountIsZero(acc)) {
// 如果COUNT统计结果不为0,生成INSERT数据
// send INSERT
collect(RowKind.INSERT, (RowData) getCurrentKey(), aggResult);
}
// 否则不用输出
// if the counter is zero, no need to send accumulate
// there is no possible skip `if` branch when `produceUpdates` is false
}
}
本博客为作者原创,欢迎大家参与讨论和批评指正。如需转载请注明出处。