两种方法指定:
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.scala._
object watermark {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironment()
//设置时间特性为event time
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val input = List(("a", 1L, 1), ("b", 1L, 1), ("b", 3L, 1))
val source: DataStream[(String,Long,Int)] = env.addSource(
new SourceFunction[(String,Long,Int)]() {
override def run(sourceContext: SourceFunction.SourceContext[(String, Long, Int)]): Unit = {
input.foreach(value=>{
//指定时间戳 tuple第二个元素作为time stamp
sourceContext.collectWithTimestamp(value,value._2)
//生成watermarks watermark 为time stamp -1
sourceContext.emitWatermark(new Watermark(value._2-1))
})
sourceContext.emitWatermark(new Watermark(Long.MaxValue))
}
override def cancel(): Unit = {}
})
source.print()
env.execute("test")
}
}
两种Periodic Watermark Assigner,一种为升序模式,会将数据中的Timestamp根据指定字段提取,并用当前的Timestamp作为最新的Watermark,这种Timestamp Assigner比较适合于事件按顺序生成,没有乱序事件的情况;另外一种是通过设定固定的时间间隔来指定Watermark落后于Timestamp的区间长度,也就是最长容忍迟到多长时间内的数据到达系统。
Flink提供了两个预定义实现类来生成水印
按照数据的字段来分配timestamp
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
object ascendingTimestamps {
def main(Args:Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironment()
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val input = env.fromCollection(List(("a", 1L, 1), ("b", 1L, 1), ("b", 3L, 1)))
val inputWithTimestampAndWatermark = input.assignAscendingTimestamps(t =>
t._3)
val result = inputWithTimestampAndWatermark
.keyBy(0) //key by first element
.timeWindow(Time.seconds(10))
.sum("_2") //sum by second element
result.print()
env.execute()
}
}
inputstream.keyBy(t=>t.id).window(new myWindowAssigner())
inputstream.windowAll(new myWindowAssigner())
时间窗口 与 数量窗口
Tumbling Windows
根据固定时间或大小进行切分,且窗口和窗口之间的元素互不重叠
DataStream input = ...;
// tumbling event-time windows
input
.keyBy()
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.();
// tumbling processing-time windows
input
.keyBy()
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.();
// daily tumbling event-time windows offset by -8 hours.
input
.keyBy()
.window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))
.();
Sliding Windows
DataStream input = ...;
// sliding event-time windows
input
.keyBy()
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.();
// sliding processing-time windows
input
.keyBy()
.window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.();
// sliding processing-time windows offset by -8 hours
input
.keyBy()
.window(SlidingProcessingTimeWindows.of(Time.hours(12), Time.hours(1), Time.hours(-8)))
.();
第一第二个例子of(param1,param2)
第一个参数是窗口大小
第二个参数是窗口每次滑动的大小
第三个例子有第三个参数,表示offset
举个例子,假如是windows.of(1Hour,30min,15min)
1:15:00.000 - 2:14:59.999, 1:45:00.000 - 2:44:59.999 etc.
An important use case for offsets is to adjust windows to timezones other than UTC-0. For example, in China you would have to specify an offset of Time.hours(-8).
DataStream input = ...;
// event-time session windows with static gap
input
.keyBy()
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.();
// event-time session windows with dynamic gap
input
.keyBy()
.window(EventTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.();
// processing-time session windows with static gap
input
.keyBy()
.window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
.();
// processing-time session windows with dynamic gap
input
.keyBy()
.window(ProcessingTimeSessionWindows.withDynamicGap((element) -> {
// determine and return session gap
}))
.();
内部机理:
为每一个到达的数据新创建一个window
如果到达的数据的间隔小于预先设定的间隔,那么将会把这些邻近的数据merge起来
所以session window operator需要一个merging Trigger 和 一个 merging Window Function
Session Window 底层实现
参考本文
flink-1.9
/**
* A {@code WindowAssigner} that can merge windows.
*
* @param The type of elements that this WindowAssigner can assign windows to.
* @param The type of {@code Window} that this assigner assigns.
*/
@PublicEvolving
public abstract class MergingWindowAssigner extends WindowAssigner {
private static final long serialVersionUID = 1L;
/**
* Determines which windows (if any) should be merged.
*
* @param windows The window candidates.
* @param callback A callback that can be invoked to signal which windows should be merged.
*/
public abstract void mergeWindows(Collection windows, MergeCallback callback);
/**
* Callback to be used in {@link #mergeWindows(Collection, MergeCallback)} for specifying which
* windows should be merged.
*/
public interface MergeCallback {
/**
* Specifies that the given windows should be merged into the result window.
*
* @param toBeMerged The list of windows that should be merged into one window.
* @param mergeResult The resulting merged window.
*/
void merge(Collection toBeMerged, W mergeResult);
}
}
Global Windows
这个博主有很多flink源码解析
摘自:《Flink原理、实战与性能优化》 — 张利兵
在豆瓣阅读书店查看:https://read.douban.com/ebook/114289022/
本作品由华章数媒授权豆瓣阅读全球范围内电子版制作与发行。
© 版权所有,侵权必究。