Flink的AggregateFunction

AggregateFunction 是什么

Flink 的AggregateFunction是一个基于中间计算结果状态进行增量计算的函数。由于是迭代计算方式,所以,在窗口处理过程中,不用缓存整个窗口的数据,所以效率执行比较高。

AggregateFunction定义


/**
 * The {@code AggregateFunction} is a flexible aggregation function, characterized by the
 * following features:
 *
 * 
    *
  • The aggregates may use different types for input values, intermediate aggregates, * and result type, to support a wide range of aggregation types.
  • * *
  • Support for distributive aggregations: Different intermediate aggregates can be * merged together, to allow for pre-aggregation/final-aggregation optimizations.
  • *
* *

The {@code AggregateFunction}'s intermediate aggregate (in-progress aggregation state) * is called the accumulator. Values are added to the accumulator, and final aggregates are * obtained by finalizing the accumulator state. This supports aggregation functions where the * intermediate state needs to be different than the aggregated values and the final result type, * such as for example average (which typically keeps a count and sum). * Merging intermediate aggregates (partial aggregates) means merging the accumulators. * *

The AggregationFunction itself is stateless. To allow a single AggregationFunction * instance to maintain multiple aggregates (such as one aggregate per key), the * AggregationFunction creates a new accumulator whenever a new aggregation is started. * *

Aggregation functions must be {@link Serializable} because they are sent around * between distributed processes during distributed execution. * *

Example: Average and Weighted Average

* *
{@code
 * // the accumulator, which holds the state of the in-flight aggregate
 * public class AverageAccumulator {
 *     long count;
 *     long sum;
 * }
 *
 * // implementation of an aggregation function for an 'average'
 * public class Average implements AggregateFunction {
 *
 *     public AverageAccumulator createAccumulator() {
 *         return new AverageAccumulator();
 *     }
 *
 *     public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) {
 *         a.count += b.count;
 *         a.sum += b.sum;
 *         return a;
 *     }
 *
 *     public void add(Integer value, AverageAccumulator acc) {
 *         acc.sum += value;
 *         acc.count++;
 *     }
 *
 *     public Double getResult(AverageAccumulator acc) {
 *         return acc.sum / (double) acc.count;
 *     }
 * }
 *
 * // implementation of a weighted average
 * // this reuses the same accumulator type as the aggregate function for 'average'
 * public class WeightedAverage implements AggregateFunction {
 *
 *     public AverageAccumulator createAccumulator() {
 *         return new AverageAccumulator();
 *     }
 *
 *     public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) {
 *         a.count += b.count;
 *         a.sum += b.sum;
 *         return a;
 *     }
 *
 *     public void add(Datum value, AverageAccumulator acc) {
 *         acc.count += value.getWeight();
 *         acc.sum += value.getValue();
 *     }
 *
 *     public Double getResult(AverageAccumulator acc) {
 *         return acc.sum / (double) acc.count;
 *     }
 * }
 * }
* * @param The type of the values that are aggregated (input values) * @param The type of the accumulator (intermediate aggregate state). * @param The type of the aggregated result */ @PublicEvolving public interface AggregateFunction extends Function, Serializable { /** * Creates a new accumulator, starting a new aggregate. * *

The new accumulator is typically meaningless unless a value is added * via {@link #add(Object, Object)}. * *

The accumulator is the state of a running aggregation. When a program has multiple * aggregates in progress (such as per key and window), the state (per key and window) * is the size of the accumulator. * * @return A new accumulator, corresponding to an empty aggregate. */ ACC createAccumulator(); /** * Adds the given input value to the given accumulator, returning the * new accumulator value. * *

For efficiency, the input accumulator may be modified and returned. * * @param value The value to add * @param accumulator The accumulator to add the value to */ ACC add(IN value, ACC accumulator); /** * Gets the result of the aggregation from the accumulator. * * @param accumulator The accumulator of the aggregation * @return The final aggregation result. */ OUT getResult(ACC accumulator); /** * Merges two accumulators, returning an accumulator with the merged state. * *

This function may reuse any of the given accumulators as the target for the merge * and return that. The assumption is that the given accumulators will not be used any * more after having been passed to this function. * * @param a An accumulator to merge * @param b Another accumulator to merge * * @return The accumulator with the merged state */ ACC merge(ACC a, ACC b); }

有定义可知,需要实现4个接口

  1. ACC createAccumulator(); 迭代状态的初始值
  2. ACC add(IN value, ACC accumulator); 每一条输入数据,和迭代数据如何迭代
  3. ACC merge(ACC a, ACC b); 多个分区的迭代数据如何合并
  4. OUT getResult(ACC accumulator); 返回数据,对最终的迭代数据如何处理,并返回结果。

下面是一个求平均值的demo


val input:DataStream[(String, Int)] = …………
val result: DataStream[Double] = input.keyBy(_._1)
	// 设置窗口为滑动窗口,使用事件时间,窗口大小1小时,滑动步长10秒
      .window(SlidingEventTimeWindows.of(Time.hours(1), Time.seconds(10)))
      .aggregate(new AggregateFunction[(String, Int), (Int, Int), Double] {
        // 迭代的初始值
        override def createAccumulator(): (Int, Int) = (0, 0)

        // 每一个数据如何和迭代数据 迭代
        override def add(value: (Int, Int), accumulator: (Int, Int)): (Int, Int) = (accumulator._1 + value._1, accumulator._2 + 1)

        // 每个分区数据之间如何合并数据
        override def merge(a: (Int, Int), b: (Int, Int)): (Int, Int) = (a._1 + b._1, a._2 + b._2)
      })
        // 返回结果
        override def getResult(accumulator: (Int, Int)): Double = accumulator._1 / accumulator._2

上面的代码,输入的数据是(String,Int)。String可以认为是key,Int可以认为是分数。

aggregate执行过程讲解

以上面的demo为例讲解。

  1. 给定迭代初始值 (0, 0)。 元组 第一个记录分数,第二个记录数据条数
  2. 输入的数据,获取分数,累加到迭代值元组的第一个元素中,迭代值元组的第二个值记录条数加1 。
  3. 每一个分区迭代完毕后,各分区的迭代值合并成最终的迭代值
  4. 对最终的迭代处理,获取最终的输出结果。

转载于:https://my.oschina.net/u/1396185/blog/3068390

你可能感兴趣的:(Flink的AggregateFunction)