计算性能较高,占用存储空间少,主要基于中间状态的计算结果,窗口中只维护中间结果状态值,不需要缓存原始数据
两种方式定义:
(1)使用表达式形式定义
(2)创建class实现ReduceFunction接口
代码中分别使用了两种方式
package com.windowfunction
import com.fouth_sink.CustomSourceFunction
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object ReduceFunctionDemo {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 自定义的SourceFunction方便测试
val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
// source
val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
// 打出来,看一看
kafkaDS.print("streamPPP")
// transform
val resultDS: DataStream[(String, Int)] = kafkaDS
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
// 1、使用表达式方式
// .reduce((v1, v2) => (v1._1, v1._2 + v2._2))
// 2、创建class实现ReduceFunction接口
.reduce(new ReduceFunction[(String, Int)] {
override def reduce(t: (String, Int), t1: (String, Int)): (String, Int) = {
(t._1, t._2 + t1._2)
}
})
// sink
resultDS.print("stream")
// execute
env.execute("ReduceFunctionDemo")
}
}
package com.windowfunction
import com.fouth_sink.CustomSourceFunction
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object AggregateFunctionDemo {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 自定义的SourceFunction方便测试
val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
// source
val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
// 打出来,看一看
kafkaDS.print("streamPPP")
// transform
val resultDS: DataStream[(String, Double)] = kafkaDS
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
// 求key对应数据的平均值
.aggregate(new AggregateFunction[(String, Int), (String, Long, Long), (String, Double)] {
// 创建一个包含两个参数的元组的累加器
override def createAccumulator(): (String, Long, Long) = ("", 0L, 0L)
override def add(in: (String, Int), acc: (String, Long, Long)): (String, Long, Long) = {
(in._1, in._2 + acc._2, acc._3 + 1L)
}
override def getResult(acc: (String, Long, Long)): (String, Double) = {
(acc._1, acc._2 / acc._3)
}
override def merge(acc: (String, Long, Long), acc1: (String, Long, Long)): (String, Long, Long) = {
(acc._1, acc._2 + acc1._2, acc._3 + acc1._3)
}
})
resultDS.print("result")
env.execute("AggregateFunctionDemo")
}
}
将窗口中的输入元素与外部的元素合并的逻辑,官网标记为@Deprecated,即可能在外来版本中移除,建议使用AggregateFunction来替换,故参考上面的AggregateFunction示例
使用代价高,性能比较弱,此时算子需要对所有属于该窗口的接入数据进行缓存,然后等到窗口触发的时候,对所有的原始数据进行汇总计算。
在某些情况下,统计更复杂的指标可能需要依赖窗口中所有的数据元素,或需要操作窗口中的状态数据和窗口元数据,例如:统计窗口数据元素中某一字段的中位数和众数
package com.windowfunction
import com.fouth_sink.CustomSourceFunction
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object ProcessWindowsFunctionDemo {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 自定义的SourceFunction方便测试
val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
// source
val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
// 打出来,看一看
kafkaDS.print("streamPPP")
// transform
val resultDS: DataStream[(String, Long, Long, Long, Long, Long)] = kafkaDS
.keyBy(_._1) // 注意这里面要是传入数字,返回一个tuple会导致后面结果错误
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.process(new ProcessWindowFunction[(String, Int), (String, Long, Long, Long, Long, Long), String, TimeWindow] {
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Long, Long, Long, Long, Long)]): Unit = {
val SUM: Long = elements.map(_._2).sum.toLong
val MIN: Long = elements.map(_._2).min.toLong
val MAX: Long = elements.map(_._2).max.toLong
val AVG: Long = SUM / elements.size
val windowEnd: Long = context.window.getEnd
out.collect((key, SUM, MIN, MAX, AVG, windowEnd))
}
})
resultDS.print("result")
env.execute("ProcessWindowsFunctionDemo")
}
}
增量聚合函数虽然在一定程度上能够提升窗口计算的性能,但是这些函数的灵活性不及ProcessWindowFunction,例如对窗口状态数据的操作以及对窗口中元数据信息的获取等。但是如果使用ProcessWindowFunction完成一些基础的增量统计运算比较浪费资源。
此时应将增量聚合函数与全量窗口函数整合
package com.windowfunction
import com.fouth_sink.CustomSourceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object ReduceFunctionAndProcessWindowFunctionDemo {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 自定义的SourceFunction方便测试
val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
// source
val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
// 打出来,看一看
kafkaDS.print("streamPPP")
// transform
val resultDS: DataStream[(Long, (String, Int))] = kafkaDS
.keyBy(_._1) // 注意这里面要是传入数字,返回一个tuple会导致后面结果错误
.timeWindow(Time.seconds(10))
.reduce(
// 定义ReduceFunction,完成求取最小值的逻辑
(r1: (String, Int), r2: (String, Int)) => {
if (r1._2 < r2._2) r1 else r2
},
// 定义ProcessWindowFunction,完成对窗口元数据的采集
(key: String,
window: TimeWindow,
minReadings: Iterable[(String, Int)],
out: Collector[(Long, (String, Int))]) => {
val min: (String, Int) = minReadings.iterator.next()
// 采集窗口结束时间和最小值对应的数据元素
out.collect(window.getEnd, min)
}
)
resultDS.print("result")
env.execute()
}
}
在reduce方法中定义了两个function,分别是ReduceFunction和ProcessWindowFunction
ReduceFunction中定义了数据元素根据指定Key求取第二个字段对应最小值的逻辑,ProcessWindowFunction中定义了从窗口元数据中获取窗口结束时间属性,然后将数据元素最小值和窗口结束时间共同返回。