IngestionTime进入Flink系统的时间;
ProcessingTime Flink算子操作的时间。
有个程序描述下它们的差异:
//订单对象(userid、消费总额total)
case class Order(userid: Long, total: Long)
case class OrderSummary(startTime: String, endTime: String, userid: Long, total: Long)
object IngestionOrProcessTime extends App {
val streamenv = StreamExecutionEnvironment.createLocalEnvironment()
streamenv.setParallelism(1)
//第一:设置IngestionTime,很明显这四条数据,将同一时间进入到Flink系统,将在同一个窗口中计算,按userid为key做聚合
// streamenv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
//第二:设置ProcessingTime,在aggregate方法中,加一个针对时间的操作,否则将无数据
streamenv.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
//基于key做聚合即可
streamenv.fromElements(Order(100L, 1000L), Order(101L, 2000L), Order(100L, 3000L), Order(101L, 4000L))
.keyBy(_.userid)
.timeWindow(Time.seconds(10))
.aggregate(new AggregateFunction[Order, (Long, Long), (Long, Long)] {
//创建累加器
override def createAccumulator() = {
//Thread.sleep(5000) //每天event睡5秒,若是设置IngestionTime,此处不影响,若为ProcessingTime才有影响
Thread.sleep(10000) //对窗口的影响
(0L, 0L)
}
//累加器内累加
override def add(value: Order, accumulator: (Long, Long)) = { (value.userid, accumulator._2 + value.total)}
override def getResult(accumulator: (Long, Long)) = accumulator
//合并累加器
override def merge(a: (Long, Long), b: (Long, Long)) = { (a._1, a._2 + b._2)}
}
,
new WindowFunction[(Long, Long), OrderSummary, Long, TimeWindow]() {
override def apply(key: Long, window: TimeWindow, inputs: Iterable[(Long, Long)], out: Collector[OrderSummary]) {
val date1 = new java.util.Date(); date1.setTime(window.getStart)
val date2 = new java.util.Date(); date2.setTime(window.getEnd)
val simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
val winStartTime = simpleDateFormat.format(date1); val winEndTime = simpleDateFormat.format(date2)
for (value <- inputs) {
//inputs已经按key集合,有几个key,就应该循环几次
out.collect(OrderSummary("winStartTime :" + winStartTime, "winEndTime :" + winEndTime, value._1, value._2))
}
}
}
)
.print()
streamenv.execute("IngestionOrProcessTime_starting")
}
设置setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)的输出:
四条数据同一时间进入flink,所以只有一个窗口。
TimeCharacteristic.ProcessingTime sleep5秒的输出:
生成了两个窗口,因为处理时长是5秒。第一条、第二条在一窗口,后面两条在二窗口。
TimeCharacteristic.ProcessingTime sleep10秒的输出:
生成了四个窗口,因为处理时长是10秒(sleep10秒),每个窗口恰好只能处理一条数据。
根据如上的例子,可以看到设置ProcessingTime或是IngestionTime对生成窗口的影响。谢谢大家!
设置为ProcessingTime时,输出可能会变化,经测试,应该是本机计算的效率有关系。我在虚拟机上测试,不太稳定。
所以基于ProcessingTime、IngestionTime的结果不稳定,用EventTime才行!!