Batch Streaming
OVER aggregates compute an aggregated value for every input row over a range of ordered rows. In contrast to GROUP BY aggregates, OVER aggregates do not reduce the number of result rows to a single row for every group. Instead OVER aggregates produce an aggregated value for every input row.
OVER聚合为有序行范围内的每个输入行计算聚合值。与GROUP BY聚合不同,OVER聚合不会将每个组的结果行数减少到一行。相反,OVER聚合为每个输入行生成聚合值。
The following query computes for every order the sum of amounts of all orders for the same product that were received within one hour before the current order.
下面的查询为每个订单计算当前订单前一小时内收到的同一产品的所有订单的金额总和。
SELECT order_id, order_time, amount,
SUM(amount) OVER (
PARTITION BY product
ORDER BY order_time
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
) AS one_hour_prod_amount_sum
FROM Orders
The syntax for an OVER window is summarized below.
OVER窗口的语法总结如下。
SELECT
agg_func(agg_col) OVER (
[PARTITION BY col1[, col2, ...]]
ORDER BY time_col
range_definition),
...
FROM ...
You can define multiple OVER window aggregates in a SELECT clause. However, for streaming queries, the OVER windows for all aggregates must be identical due to current limitation.
可以在SELECT子句中定义多个OVER窗口聚合。但是,对于流式查询,由于当前的限制,所有聚合的OVER窗口必须相同。
OVER windows are defined on an ordered sequence of rows. Since tables do not have an inherent order, the ORDER BY clause is mandatory. For streaming queries, Flink currently only supports OVER windows that are defined with an ascending time attributes order. Additional orderings are not supported.
OVER窗口是在有序的行序列上定义的。由于表没有固有顺序,order BY子句是必需的。对于流式查询,Flink目前仅支持以升序时间属性顺序定义的OVER窗口。不支持其他顺序。
OVER windows can be defined on a partitioned table. In presence of a PARTITION BY clause, the aggregate is computed for each input row only over the rows of its partition.
可以在分区表上定义OVER窗口。如果存在PARTITION BY子句,则仅在其分区的行上计算每个输入行的聚合。
The range definition specifies how many rows are included in the aggregate. The range is defined with a BETWEEN clause that defines a lower and an upper boundary. All rows between these boundaries are included in the aggregate. Flink only supports CURRENT ROW as the upper boundary.
范围定义指定聚合中包含多少行。范围由一个BETWEE子句定义,该子句定义了一个下限和一个上限。这些边界之间的所有行都包含在聚合中。Flink仅支持CURRENT ROW作为上边界。
There are two options to define the range, ROWS intervals and RANGE intervals.
有两个选项可以定义范围:ROWS间隔和range间隔。
A RANGE interval is defined on the values of the ORDER BY column, which is in case of Flink always a time attribute. The following RANGE interval defines that all rows with a time attribute of at most 30 minutes less than the current row are included in the aggregate.
RANGE间隔是在ORDER BY列的值上定义的,在Flink总是一个时间属性的情况下。以下RANGE间隔定义了时间属性最多比当前行少30分钟的所有行都包含在聚合中。
RANGE BETWEEN INTERVAL '30' MINUTE PRECEDING AND CURRENT ROW
A ROWS interval is a count-based interval. It defines exactly how many rows are included in the aggregate. The following ROWS interval defines that the 10 rows preceding the current row and the current row (so 11 rows in total) are included in the aggregate.
ROWS间隔是基于计数的间隔。它精确定义了聚合中包含的行数。下面的ROWS间隔定义了当前行和当前行之前的10行(总共11行)包含在聚合中。
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
WINDOW
The WINDOW clause can be used to define an OVER window outside of the SELECT clause. It can make queries more readable and also allows us to reuse the window definition for multiple aggregates.
WINDOW子句可用于定义SELECT子句之外的OVER窗口。它可以使查询更具可读性,还允许我们为多个聚合重用窗口定义。
SELECT order_id, order_time, amount,
SUM(amount) OVER w AS sum_amount,
AVG(amount) OVER w AS avg_amount
FROM Orders
WINDOW w AS (
PARTITION BY product
ORDER BY order_time
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW)