Flink 中的 Window Function(窗口函数)及示例代码

定义窗口分配器后,我们需要指定要在每个窗口上执行的计算。这是Window Fucntion的职责,一旦系统确定窗口已准备好进行处理,就可以处理每个窗口的元素。

窗口函数可以是ReduceFunction,AggregateFunction,FoldFunction,WindowFunction或ProcessWindowFunction之一。前两个可以更有效地执行,因为Flink可以在每个窗口元素到达时以递增方式聚合它们。ProcessWindowFunction获取窗口中包含的所有元素的Iterable以及有关元素所属的窗口的其他元信息。

使用ProcessWindowFunction进行窗口转换不能像其他情况一样有效的执行,因为Flink必须在调用函数之前在内部缓冲函数的所有元素。可以通过将ProcessWindowFunction与ReduceFunction,AggregateFunction或FoldFunction结合使用来获得窗口元素的增量聚合以及ProcessWindowFunction接收的其他窗口元数据,从而减轻这种情况。

ReduceFunction

ReduceFunction指定如何将输入中的两个元素组合在一起以产生相同类型的输出元素。Flink使用ReduceFunction来逐步聚合窗口中的元素。

class UserDefineReduceFucntion extends ReduceFunction[(String,Int)]{
  override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
    (v1._1,v1._2+v2._2)
  }
}
env.socketTextStream("CentOS", 9999)
  .flatMap(_.split("\\s+"))
  .map((_,1))
  .keyBy(t =>t._1)
  .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
  .reduce(new UserDefineReduceFucntion)
  .print()

AggregateFunction

AggregateFunctionFunction是ReduceFunction的通用版本,具有三种类型:输入类型(IN),累加器类型(ACC)和输出类型(OUT)。输入类型是输入流中元素的类型,AggregateFunctioon具有一种将一个输入元素添加到累加器的方法。该接口还具有创建初始累加器,将两个累加器合并为一个累加器以及从累加器提取输出(OUT类型)的方法。与ReduceFunction相同,Flink将在窗口的输入元素到达时对其进行增量聚合。

class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),(String,Int)]{
  override def createAccumulator(): (String, Int) = ("",0)
  override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = (value._1,value._2+accumulator._2)
  override def getResult(accumulator: (String, Int)): (String, Int) = accumulator
  override def merge(a: (String, Int), b: (String, Int)): (String, Int) = (a._1,a._2+b._2)
}
env.socketTextStream("CentOS", 9999)
        .flatMap(_.split("\\s+"))
        .map((_,1))
        .keyBy(t =>t._1)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
        .aggregate(new UserDefineAggregateFunction)
        .print()

FoldFunction

FoldFunction指定将窗口的输入元素与输入类型的元素组合。对于添加到窗口的每个元素和当前输出值,将递增调用FoldFunction。第一个元素与输出类型的预定义初始值组合。

class UserDefineFoldFunction extends FoldFunction[(String,Int),(String,Int)]{
  override def fold(accumulator: (String, Int), value: (String, Int)): (String, Int) = {
    (value._1,accumulator._2+value._2)
  }
}
env.socketTextStream("Flink", 9999)
  .flatMap(_.split("\\s+"))
  .map((_,1))
  .keyBy(t =>t._1)
  .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
  .fold(("",0),new UserDefineFoldFunction)
  .print()

注意:fold()不能与会话窗口或其他可合并窗口一起使用

Exception in thread "main" java.lang.UnsupportedOperationException: Fold cannot be used with a merging WindowAssigner.
	at org.apache.flink.streaming.api.datastream.WindowedStream.fold(WindowedStream.java:506)
	at org.apache.flink.streaming.api.datastream.WindowedStream.fold(WindowedStream.java:450)
	at org.apache.flink.streaming.api.scala.WindowedStream.fold(WindowedStream.scala:391)
	at com.baizhi.windowfunction.FlinkFoldFunction$.main(FlinkFoldFunction.scala:19)
	at com.baizhi.windowfunction.FlinkFoldFunction.main(FlinkFoldFunction.scala)

WindowFunction(Legacy)

在某些可以试用ProcessWindowFunction的地方,也可以使用WindowFunction。这个是较旧的版本,没有某些高级功能。

class UserDefineWindowFunction extends WindowFunction[(String,Int),String,String,TimeWindow]{
  override def apply(key: String,
                     window: TimeWindow,
                     input: Iterable[(String, Int)],
                     out: Collector[String]): Unit = {

    val sdf = new SimpleDateFormat("HH:mm:ss")
    val start = sdf.format(window.getStart)
    val end = sdf.format(window.getEnd)
    var maxTimestamp=sdf.format(window.maxTimestamp())
    println(s"key:${key},start:${start},end:${end},maxTimestamp:"+maxTimestamp)
    out.collect(s"${key},${input.map(_._2).sum}")
  }
}
env.socketTextStream("Flink", 9999)
  .flatMap(_.split("\\s+"))
  .map((_,1))
  .keyBy(t =>t._1)
  .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
  .apply(new UserDefineWindowFunction)
  .print()

注意:这里的keyBy中只能使用KeySelector指定key,不可以使用基于position。

ProcessWindowFunction

ProcessWindowFunction获取一个Iterable,该Iterable包含窗口的所有元素,以及一个Context对象,该对象可以访问时间和状态信息。以及使其比其他窗口函数更具灵活性。这是以性能和资源消耗为代价,因为不能增量聚合元素,而是需要在内部对其进行缓冲,直到认为该窗口已准备好进行处理为止。

class UserDefineProcessWindowFunction extends ProcessWindowFunction[(String,Int),String,String,TimeWindow]{
  override def process(key: String,
                       context: Context,
                       elements: Iterable[(String, Int)],
                       out: Collector[String]): Unit = {
    val sdf = new SimpleDateFormat("HH:mm:ss")
    val start = sdf.format(context.window.getStart)
    val end = sdf.format(context.window.getEnd)
    var maxTimestamp=sdf.format(context.window.maxTimestamp())
    println(s"key:${key},start:${start},end:${end},maxTimestamp:"+maxTimestamp)
    out.collect(s"${key},${elements.toList.map(_._2).sum}")
  }
}
env.socketTextStream("Flink", 9999)
  .flatMap(_.split("\\s+"))
  .map((_,1))
  .keyBy(t =>t._1)
  .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
  .process(new UserDefineProcessWindowFunction)
  .print()

ProcessWindowFunction/WindowFunction Incremental Aggregation(进程窗口函数/窗口函数增量聚合)

从以上WindowFunction或者ProcessWindowFunction可以看出,以上两个方法在执行效率上不如ReduceFunction,AggregateFunction以及FoldFunction,但是以上两个方法都可以拿到窗口对象,获取窗口元数据。为了即可拿到窗口对象,又可以使用增量式计算,可以尝试将WindowFunction或者ProcessWindowFunction与前三个方法连用。

class UserDefineReduceFucntion1 extends ReduceFunction[(String,Int)]{
  override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
    (v1._1,v1._2+v2._2)
  }
}
class UserDefineProcessWindowFunction1 extends ProcessWindowFunction[(String,Int),String,String,TimeWindow]{
  override def process(key: String,
                       context: Context,
                       elements: Iterable[(String, Int)],
                       out: Collector[String]): Unit = {
    val sdf = new SimpleDateFormat("HH:mm:ss")
    val start = sdf.format(context.window.getStart)
    val end = sdf.format(context.window.getEnd)
    var maxTimestamp=sdf.format(context.window.maxTimestamp())

    var list=elements.toList

    println(s"key:${key},start:${start},end:${end},maxTimestamp:"+maxTimestamp+",list:"+list.mkString(","))
    out.collect(s"${key},${list.map(_._2).sum}")
  }
}
env.socketTextStream("Flink", 9999)
  .flatMap(_.split("\\s+"))
  .map((_,1))
  .keyBy(t=>t._1)//
  .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
  .reduce(new UserDefineReduceFucntion1,new UserDefineProcessWindowFunction1)
  .print()

Using per-window state in ProcessWindowFunction(在进程窗口函数中使用每个窗口的状态)

ProcessWindowFunction与WindowFunction不同点在于使用ProcessWindowFunction不仅仅可以拿到窗口的院内数据信息,还可以获取WindowState和GlobalState。

  • WindowState - 表示窗口的状态,该状态值和窗口绑定的,一旦窗口消亡状态消失。
  • GlobalState - 表示窗口的状态,该状态值和Key绑定,可以累计多个窗口的值。
class UserDefineProcessWindowFunction2 extends ProcessWindowFunction[(String,Int),String,String,TimeWindow]{
  var windowStateDescriptor:ValueStateDescriptor[Int]=_
    var globalStateDescriptor:ValueStateDescriptor[Int]=_

      override def open(parameters: Configuration): Unit = {
      windowStateDescriptor=new ValueStateDescriptor[Int]("window",createTypeInformation[Int])
        globalStateDescriptor=new ValueStateDescriptor[Int]("global",createTypeInformation[Int])
    }

  override def process(key: String,
                       context: Context,
                       elements: Iterable[(String, Int)],
                       out: Collector[String]): Unit = {
    val sum = elements.toList.map(_._2).sum

      val windowState = context.windowState.getState(windowStateDescriptor)

      val globalSate = context.globalState.getState(globalStateDescriptor)

      windowState.update(sum+windowState.value())
      globalSate.update(sum+globalSate.value())

      out.collect(s"${key},window:${windowState.value()}\tglobal:${globalSate.value()}")
  }
}
env.socketTextStream("Flink", 9999)
  .flatMap(_.split("\\s+"))
  .map((_,1))
  .keyBy(t =>t._1)
  .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
  .process(new UserDefineProcessWindowFunction2)
  .print()

你可能感兴趣的:(Flink)