Window Function超详细解析

文章目录

  • 用于测试的CustomSourceFunction
  • 增量聚合函数
    • ReduceFunction
    • AggregateFunction
    • FoldFunction
  • 全量窗口函数
    • ProcessWindowFunction
  • 增量聚合函数与全量窗口函数整合

用于测试的CustomSourceFunction

增量聚合函数

计算性能较高,占用存储空间少,主要基于中间状态的计算结果,窗口中只维护中间结果状态值,不需要缓存原始数据

ReduceFunction

两种方式定义:
(1)使用表达式形式定义
(2)创建class实现ReduceFunction接口
代码中分别使用了两种方式

package com.windowfunction

import com.fouth_sink.CustomSourceFunction
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time

object ReduceFunctionDemo {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    // 自定义的SourceFunction方便测试
    val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
    // source
    val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
    // 打出来,看一看
    kafkaDS.print("streamPPP")

    // transform
    val resultDS: DataStream[(String, Int)] = kafkaDS
      .keyBy(0)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
      // 1、使用表达式方式
      //      .reduce((v1, v2) => (v1._1, v1._2 + v2._2))
      // 2、创建class实现ReduceFunction接口
      .reduce(new ReduceFunction[(String, Int)] {
      override def reduce(t: (String, Int), t1: (String, Int)): (String, Int) = {
        (t._1, t._2 + t1._2)
      }
    })
    // sink
    resultDS.print("stream")

    // execute
    env.execute("ReduceFunctionDemo")
  }
}

AggregateFunction

package com.windowfunction

import com.fouth_sink.CustomSourceFunction
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time

object AggregateFunctionDemo {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    // 自定义的SourceFunction方便测试
    val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
    // source
    val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
    // 打出来,看一看
    kafkaDS.print("streamPPP")

    // transform
    val resultDS: DataStream[(String, Double)] = kafkaDS
      .keyBy(0)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
      // 求key对应数据的平均值
      .aggregate(new AggregateFunction[(String, Int), (String, Long, Long), (String, Double)] {
      // 创建一个包含两个参数的元组的累加器
      override def createAccumulator(): (String, Long, Long) = ("", 0L, 0L)

      override def add(in: (String, Int), acc: (String, Long, Long)): (String, Long, Long) = {
        (in._1, in._2 + acc._2, acc._3 + 1L)
      }

      override def getResult(acc: (String, Long, Long)): (String, Double) = {
        (acc._1, acc._2 / acc._3)
      }

      override def merge(acc: (String, Long, Long), acc1: (String, Long, Long)): (String, Long, Long) = {
        (acc._1, acc._2 + acc1._2, acc._3 + acc1._3)
      }
    })

    resultDS.print("result")


    env.execute("AggregateFunctionDemo")
  }
}

FoldFunction

将窗口中的输入元素与外部的元素合并的逻辑,官网标记为@Deprecated,即可能在外来版本中移除,建议使用AggregateFunction来替换,故参考上面的AggregateFunction示例

全量窗口函数

使用代价高,性能比较弱,此时算子需要对所有属于该窗口的接入数据进行缓存,然后等到窗口触发的时候,对所有的原始数据进行汇总计算。
在某些情况下,统计更复杂的指标可能需要依赖窗口中所有的数据元素,或需要操作窗口中的状态数据和窗口元数据,例如:统计窗口数据元素中某一字段的中位数和众数

ProcessWindowFunction

package com.windowfunction

import com.fouth_sink.CustomSourceFunction
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector


object ProcessWindowsFunctionDemo {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    // 自定义的SourceFunction方便测试
    val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
    // source
    val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
    // 打出来,看一看
    kafkaDS.print("streamPPP")

    // transform
    val resultDS: DataStream[(String, Long, Long, Long, Long, Long)] = kafkaDS
      .keyBy(_._1) // 注意这里面要是传入数字,返回一个tuple会导致后面结果错误
      .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
      .process(new ProcessWindowFunction[(String, Int), (String, Long, Long, Long, Long, Long), String, TimeWindow] {
        override def process(key: String,
                             context: Context,
                             elements: Iterable[(String, Int)],
                             out: Collector[(String, Long, Long, Long, Long, Long)]): Unit = {
          val SUM: Long = elements.map(_._2).sum.toLong
          val MIN: Long = elements.map(_._2).min.toLong
          val MAX: Long = elements.map(_._2).max.toLong
          val AVG: Long = SUM / elements.size
          val windowEnd: Long = context.window.getEnd
          out.collect((key, SUM, MIN, MAX, AVG, windowEnd))
        }
      })
    resultDS.print("result")


    env.execute("ProcessWindowsFunctionDemo")
  }
}

增量聚合函数与全量窗口函数整合

增量聚合函数虽然在一定程度上能够提升窗口计算的性能,但是这些函数的灵活性不及ProcessWindowFunction,例如对窗口状态数据的操作以及对窗口中元数据信息的获取等。但是如果使用ProcessWindowFunction完成一些基础的增量统计运算比较浪费资源。
此时应将增量聚合函数与全量窗口函数整合

package com.windowfunction

import com.fouth_sink.CustomSourceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object ReduceFunctionAndProcessWindowFunctionDemo {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    // 自定义的SourceFunction方便测试
    val customSourceFunction: CustomSourceFunction = new CustomSourceFunction
    // source
    val kafkaDS: DataStream[(String, Int)] = env.addSource(customSourceFunction)
    // 打出来,看一看
    kafkaDS.print("streamPPP")

    // transform
    val resultDS: DataStream[(Long, (String, Int))] = kafkaDS
      .keyBy(_._1) // 注意这里面要是传入数字,返回一个tuple会导致后面结果错误
      .timeWindow(Time.seconds(10))
      .reduce(
        // 定义ReduceFunction,完成求取最小值的逻辑
        (r1: (String, Int), r2: (String, Int)) => {
          if (r1._2 < r2._2) r1 else r2
        },
        // 定义ProcessWindowFunction,完成对窗口元数据的采集
        (key: String,
         window: TimeWindow,
         minReadings: Iterable[(String, Int)],
         out: Collector[(Long, (String, Int))]) => {
          val min: (String, Int) = minReadings.iterator.next()
          // 采集窗口结束时间和最小值对应的数据元素
          out.collect(window.getEnd, min)
        }
      )
    resultDS.print("result")


    env.execute()
  }
}

在reduce方法中定义了两个function,分别是ReduceFunction和ProcessWindowFunction
ReduceFunction中定义了数据元素根据指定Key求取第二个字段对应最小值的逻辑,ProcessWindowFunction中定义了从窗口元数据中获取窗口结束时间属性,然后将数据元素最小值和窗口结束时间共同返回。

你可能感兴趣的:(#,Flink,flink)