首先,在单并行度下,我们设置窗口时间为3秒,watermark为10s,我们会在水位线超过EventTime所属的窗口的时候才会进行触发操作。如果我们添加了允许迟到时间,则在水位线超过了该窗口之后的时间在迟到时间内,还会再次触发,会把本次迟到的数据和上一次触发的窗口数据加到一起来计算。
env.setParallelism(1)
val wrterMarkSteam = mapData.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(String, Long)] {
var currentMaxTimestamp = 0L
val maxOutOfOrderness = 10000L
val sdf = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss.SSS")
override def getCurrentWatermark: Watermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness)
override def extractTimestamp(t: (String, Long), l: Long): Long = {
val timestamp = t._2
currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp)
println(Thread.currentThread().getName,s"[key:${t._1},eventtime:[${t._2}| ${sdf.format(t._2)}],| [watermark:${sdf.format(getCurrentWatermark.getTimestamp)}]")
timestamp
}
})
wrterMarkSteam.keyBy(_._1)
.timeWindow(Time.seconds(3))
.allowedLateness(Time.seconds(10))
.apply(new WindowFunction[(String,Long),String,String,TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
val arrarList = new ListBuffer[Long]
val it = input.iterator
while (it.hasNext) {
val next = it.next
arrarList+= next._2
}
val sorted_arrarList = arrarList.sortWith(_<_)
val sdf = new SimpleDateFormat( "yyyy-MM-dd HH:mm:ss.SSS" )
val result = key + "," + sorted_arrarList.size + "," + sdf.format( window.getStart ) + "," + sdf.format( window.getEnd )
out.collect(result)
}
}).print()
举个代码和例子,对于我们的输入来说,会输出触发水位线之后的key的count和本次窗口的时间开始和结束,在数据来了之后,会根据本次的EventTime计算Watermark,然后在根据默认的触发器进行触发。如果你觉得EventTime已经达到了触发水位线的时间但是还没有触发,可能是因为你的并行度不是1,会按照默认分区规则一条数据一个分区。也就是说你的水位线可能是其他分区的水位线,而不是你一开始的水位线。
0001,1538359881000
0001,1538359882000
0001,1538359883000
0001,1538359884000
0001,1538359885000
0001,1538359886000
0001,1538359887000
0001,1538359888000
0001,1538359889000
0001,1538359891000
0001,1538359892000
0001,1538359893000
0001,1538359894000
0001,1538359895000
0001,1538359896000
0001,1538359897000
0001,1538359898000
0001,1538359899000
0001,1538359900000
0001,1538359901000
0001,1538359902000
0001,1538359903000
0001,1538359904000
0001,1538359905000
0001,1538359906000
0001,1538359907000
0001,1538359908000
0001,1538359909000
0001,1538359910000
0001,1538359911000
0001,1538359912000
0001,1538359913000
0001,1538359914000
0001,1538359915000
0001,1538359916000
0001,1538359917000
0001,1538359918000
0001,1538359919000
对于这些测试数据,我们可以一条一条的实验,发现了结果如下,对于每一条数据来说,都会有当先进程,event time和这个时间所带来的水位线。对于触发的结果来说,会有key,count,和窗口的开始和结束
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359881000| 2018-10-01 10:11:21.000],| [watermark:2018-10-01 10:11:11.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359882000| 2018-10-01 10:11:22.000],| [watermark:2018-10-01 10:11:12.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359883000| 2018-10-01 10:11:23.000],| [watermark:2018-10-01 10:11:13.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359884000| 2018-10-01 10:11:24.000],| [watermark:2018-10-01 10:11:14.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359885000| 2018-10-01 10:11:25.000],| [watermark:2018-10-01 10:11:15.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359886000| 2018-10-01 10:11:26.000],| [watermark:2018-10-01 10:11:16.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359887000| 2018-10-01 10:11:27.000],| [watermark:2018-10-01 10:11:17.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359888000| 2018-10-01 10:11:28.000],| [watermark:2018-10-01 10:11:18.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359889000| 2018-10-01 10:11:29.000],| [watermark:2018-10-01 10:11:19.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359891000| 2018-10-01 10:11:31.000],| [watermark:2018-10-01 10:11:21.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359892000| 2018-10-01 10:11:32.000],| [watermark:2018-10-01 10:11:22.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359893000| 2018-10-01 10:11:33.000],| [watermark:2018-10-01 10:11:23.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359894000| 2018-10-01 10:11:34.000],| [watermark:2018-10-01 10:11:24.000])
0001,3,2018-10-01 10:11:21.000,2018-10-01 10:11:24.000
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359895000| 2018-10-01 10:11:35.000],| [watermark:2018-10-01 10:11:25.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359896000| 2018-10-01 10:11:36.000],| [watermark:2018-10-01 10:11:26.000])
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359897000| 2018-10-01 10:11:37.000],| [watermark:2018-10-01 10:11:27.000])
0001,3,2018-10-01 10:11:24.000,2018-10-01 10:11:27.000
(Source: Socket Stream -> Filter -> Map -> Timestamps/Watermarks (1/1),[key:0001,eventtime:[1538359881000| 2018-10-01 10:11:21.000],| [watermark:2018-10-01 10:11:27.000])
0001,4,2018-10-01 10:11:21.000,2018-10-01 10:11:24.000
对于多并行度,官网给出的图如下,也就是对于多并行度来说,会根据所有并行度中最低的水位线来触发窗口操作。
通过对以上代码并行度的更改,测试结果如下,在最后一条记录的时候,所有的水位线都超过了event time,所以才会触发窗口操作。
(Filter -> Map -> Timestamps/Watermarks (1/8),[key:0001,eventtime:[1538359881000| 2018-10-01 10:11:21.000],| [watermark:2018-10-01 10:11:11.000])
(Filter -> Map -> Timestamps/Watermarks (4/8),[key:0001,eventtime:[1538359884000| 2018-10-01 10:11:24.000],| [watermark:2018-10-01 10:11:14.000])
(Filter -> Map -> Timestamps/Watermarks (6/8),[key:0001,eventtime:[1538359886000| 2018-10-01 10:11:26.000],| [watermark:2018-10-01 10:11:16.000])
(Filter -> Map -> Timestamps/Watermarks (5/8),[key:0001,eventtime:[1538359885000| 2018-10-01 10:11:25.000],| [watermark:2018-10-01 10:11:15.000])
(Filter -> Map -> Timestamps/Watermarks (3/8),[key:0001,eventtime:[1538359883000| 2018-10-01 10:11:23.000],| [watermark:2018-10-01 10:11:13.000])
(Filter -> Map -> Timestamps/Watermarks (2/8),[key:0001,eventtime:[1538359882000| 2018-10-01 10:11:22.000],| [watermark:2018-10-01 10:11:12.000])
(Filter -> Map -> Timestamps/Watermarks (2/8),[key:0001,eventtime:[1538359891000| 2018-10-01 10:11:31.000],| [watermark:2018-10-01 10:11:21.000])
(Filter -> Map -> Timestamps/Watermarks (8/8),[key:0001,eventtime:[1538359888000| 2018-10-01 10:11:28.000],| [watermark:2018-10-01 10:11:18.000])
(Filter -> Map -> Timestamps/Watermarks (3/8),[key:0001,eventtime:[1538359892000| 2018-10-01 10:11:32.000],| [watermark:2018-10-01 10:11:22.000])
(Filter -> Map -> Timestamps/Watermarks (7/8),[key:0001,eventtime:[1538359887000| 2018-10-01 10:11:27.000],| [watermark:2018-10-01 10:11:17.000])
(Filter -> Map -> Timestamps/Watermarks (1/8),[key:0001,eventtime:[1538359889000| 2018-10-01 10:11:29.000],| [watermark:2018-10-01 10:11:19.000])
(Filter -> Map -> Timestamps/Watermarks (4/8),[key:0001,eventtime:[1538359893000| 2018-10-01 10:11:33.000],| [watermark:2018-10-01 10:11:23.000])
(Filter -> Map -> Timestamps/Watermarks (6/8),[key:0001,eventtime:[1538359895000| 2018-10-01 10:11:35.000],| [watermark:2018-10-01 10:11:25.000])
(Filter -> Map -> Timestamps/Watermarks (5/8),[key:0001,eventtime:[1538359894000| 2018-10-01 10:11:34.000],| [watermark:2018-10-01 10:11:24.000])
(Filter -> Map -> Timestamps/Watermarks (8/8),[key:0001,eventtime:[1538359897000| 2018-10-01 10:11:37.000],| [watermark:2018-10-01 10:11:27.000])
(Filter -> Map -> Timestamps/Watermarks (7/8),[key:0001,eventtime:[1538359896000| 2018-10-01 10:11:36.000],| [watermark:2018-10-01 10:11:26.000])
(Filter -> Map -> Timestamps/Watermarks (1/8),[key:0001,eventtime:[1538359898000| 2018-10-01 10:11:38.000],| [watermark:2018-10-01 10:11:28.000])
(Filter -> Map -> Timestamps/Watermarks (2/8),[key:0001,eventtime:[1538359899000| 2018-10-01 10:11:39.000],| [watermark:2018-10-01 10:11:29.000])
(Filter -> Map -> Timestamps/Watermarks (3/8),[key:0001,eventtime:[1538359900000| 2018-10-01 10:11:40.000],| [watermark:2018-10-01 10:11:30.000])
(Filter -> Map -> Timestamps/Watermarks (4/8),[key:0001,eventtime:[1538359901000| 2018-10-01 10:11:41.000],| [watermark:2018-10-01 10:11:31.000])
1> 0001,3,2018-10-01 10:11:21.000,2018-10-01 10:11:24.000
对于迟到的数据,在触发之后会添加到之前的窗口中,所以我们如果有需要就要自定义触发器,其他的都和默认的EventTimeTrigger触发器相同,只是在触发的时候,清楚上一次窗口的数据
wrterMarkSteam.keyBy(_._1)
.timeWindow(Time.seconds(3))
.allowedLateness(Time.seconds(10))
.trigger(new Trigger[(String,Long), TimeWindow]{
override def onElement(element: (String, Long), timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
if (window.maxTimestamp() <= ctx.getCurrentWatermark()) {
// if the watermark is already past the window fire immediately
TriggerResult.FIRE_AND_PURGE
} else {
ctx.registerEventTimeTimer(window.maxTimestamp())
TriggerResult.CONTINUE
}
}
override def onProcessingTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE
override def onEventTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
if(time == window.maxTimestamp()) TriggerResult.FIRE_AND_PURGE else TriggerResult.CONTINUE
}
override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = ctx.deleteEventTimeTimer( window.maxTimestamp )
})
.apply(new WindowFunction[(String,Long),String,String,TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
val arrarList = new ListBuffer[Long]
val it = input.iterator
while (it.hasNext) {
val next = it.next
arrarList+= next._2
}
val sorted_arrarList = arrarList.sortWith(_<_)
val sdf = new SimpleDateFormat( "yyyy-MM-dd HH:mm:ss.SSS" )
val result = key + "," + sorted_arrarList.size + "," + sdf.format( window.getStart ) + "," + sdf.format( window.getEnd )
out.collect(result)
}
}).print()