Batch Streaming
Window aggregations are defined in the GROUP BY clause contains “window_start” and “window_end” columns of the relation applied Windowing TVF. Just like queries with regular GROUP BY clauses, queries with a group by window aggregation will compute a single result row per group.
窗口聚合在GROUP BY子句中定义,该子句包含应用窗口化TVF的关系的“window_start”和“window_end”列。就像使用常规GROUP BY子句的查询一样,使用窗口聚合的查询将计算每个组的单个结果行。
SELECT ...
FROM <windowed_table> -- relation applied windowing TVF
GROUP BY window_start, window_end, ...
Unlike other aggregations on continuous tables, window aggregation do not emit intermediate results but only a final result, the total aggregation at the end of the window. Moreover, window aggregations purge all intermediate state when no longer needed.
与持续表上的其他聚合不同,窗口聚合不发出中间结果,而只发出最终结果,即窗口末尾的总聚合。此外,窗口聚合在不再需要时清除所有中间状态。
Flink supports TUMBLE, HOP and CUMULATE types of window aggregations. In streaming mode, the time attribute field of a window table-valued function must be on either event or processing time attributes. See Windowing TVF for more windowing functions information. In batch mode, the time attribute field of a window table-valued function must be an attribute of type TIMESTAMP or TIMESTAMP_LTZ.
Flink支持TUMBLE、HOP和CUMULATE类型的窗口聚合。在流模式下,窗口化表值函数的时间属性字段必须位于事件或处理时间属性上。有关窗口功能的更多信息,请参见窗口化TVF。在批处理模式下,窗口化表值函数的时间属性字段必须是TIMESTAMP或TIMESTAMP_LTZ类型的属性。
Here are some examples for TUMBLE, HOP and CUMULATE window aggregations.
下面是一些TUMBLE、HOP和CUMULATE窗口聚合的示例。
-- tables must have time attribute, e.g. `bidtime` in this table
Flink SQL> desc Bid;
+-------------+------------------------+------+-----+--------+---------------------------------+
| name | type | null | key | extras | watermark |
+-------------+------------------------+------+-----+--------+---------------------------------+
| bidtime | TIMESTAMP(3) *ROWTIME* | true | | | `bidtime` - INTERVAL '1' SECOND |
| price | DECIMAL(10, 2) | true | | | |
| item | STRING | true | | | |
| supplier_id | STRING | true | | | |
+-------------+------------------------+------+-----+--------+---------------------------------+
Flink SQL> SELECT * FROM Bid;
+------------------+-------+------+-------------+
| bidtime | price | item | supplier_id |
+------------------+-------+------+-------------+
| 2020-04-15 08:05 | 4.00 | C | supplier1 |
| 2020-04-15 08:07 | 2.00 | A | supplier1 |
| 2020-04-15 08:09 | 5.00 | D | supplier2 |
| 2020-04-15 08:11 | 3.00 | B | supplier2 |
| 2020-04-15 08:13 | 1.00 | E | supplier1 |
| 2020-04-15 08:17 | 6.00 | F | supplier2 |
+------------------+-------+------+-------------+
-- tumbling window aggregation
Flink SQL> SELECT window_start, window_end, SUM(price)
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
+------------------+------------------+-------+
| window_start | window_end | price |
+------------------+------------------+-------+
| 2020-04-15 08:00 | 2020-04-15 08:10 | 11.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | 10.00 |
+------------------+------------------+-------+
-- hopping window aggregation
Flink SQL> SELECT window_start, window_end, SUM(price)
FROM TABLE(
HOP(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
+------------------+------------------+-------+
| window_start | window_end | price |
+------------------+------------------+-------+
| 2020-04-15 08:00 | 2020-04-15 08:10 | 11.00 |
| 2020-04-15 08:05 | 2020-04-15 08:15 | 15.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | 10.00 |
| 2020-04-15 08:15 | 2020-04-15 08:25 | 6.00 |
+------------------+------------------+-------+
-- cumulative window aggregation
Flink SQL> SELECT window_start, window_end, SUM(price)
FROM TABLE(
CUMULATE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '2' MINUTES, INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
+------------------+------------------+-------+
| window_start | window_end | price |
+------------------+------------------+-------+
| 2020-04-15 08:00 | 2020-04-15 08:06 | 4.00 |
| 2020-04-15 08:00 | 2020-04-15 08:08 | 6.00 |
| 2020-04-15 08:00 | 2020-04-15 08:10 | 11.00 |
| 2020-04-15 08:10 | 2020-04-15 08:12 | 3.00 |
| 2020-04-15 08:10 | 2020-04-15 08:14 | 4.00 |
| 2020-04-15 08:10 | 2020-04-15 08:16 | 4.00 |
| 2020-04-15 08:10 | 2020-04-15 08:18 | 10.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | 10.00 |
+------------------+------------------+-------+
Note: in order to better understand the behavior of windowing, we simplify the displaying of timestamp values to not show the trailing zeros, e.g. 2020-04-15 08:05 should be displayed as 2020-04-15 08:05:00.000 in Flink SQL Client if the type is TIMESTAMP(3).
注意:为了更好地理解窗口化的行为,我们简化了时间戳值的显示,以不显示尾随的零,例如,如果类型为TIMESTAMP(3),则在Flink SQL Client中,2020-04-15 08:05应显示为2020-04-14 08:05:00.000。
Window aggregations also support GROUPING SETS syntax. Grouping sets allow for more complex grouping operations than those describable by a standard GROUP BY. Rows are grouped separately by each specified grouping set and aggregates are computed for each group just as for simple GROUP BY clauses.
窗口聚合还支持GROUPING SETS语法。Grouping sets允许比标准GROUP BY描述的操作更复杂的分组操作。行按每个指定的分组集单独分组,并为每个组计算聚合,就像简单的GROUPBY子句一样。
Window aggregations with GROUPING SETS require both the window_start and window_end columns have to be in the GROUP BY clause, but not in the GROUPING SETS clause.
带有GROUPING SETS的窗口聚合要求Window_start和Window_end列必须在GROUP BY子句中,但不在GROUPING SETS子句中。
Flink SQL> SELECT window_start, window_end, supplier_id, SUM(price) as price
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end, GROUPING SETS ((supplier_id), ());
+------------------+------------------+-------------+-------+
| window_start | window_end | supplier_id | price |
+------------------+------------------+-------------+-------+
| 2020-04-15 08:00 | 2020-04-15 08:10 | (NULL) | 11.00 |
| 2020-04-15 08:00 | 2020-04-15 08:10 | supplier2 | 5.00 |
| 2020-04-15 08:00 | 2020-04-15 08:10 | supplier1 | 6.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | (NULL) | 10.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | supplier2 | 9.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | supplier1 | 1.00 |
+------------------+------------------+-------------+-------+
Each sublist of GROUPING SETS may specify zero or more columns or expressions and is interpreted the same way as though used directly in the GROUP BY clause. An empty grouping set means that all rows are aggregated down to a single group, which is output even if no input rows were present.
GROUPING SETS的每个子列表可以指定零个或多个列或表达式,并以与直接在GROUP BY子句中使用相同的方式进行解释。空的分组集意味着所有行都被聚合到一个组中,即使没有输入行也会输出该组。
References to the grouping columns or expressions are replaced by null values in result rows for grouping sets in which those columns do not appear.
对分组列或表达式的引用将被结果行中的空值替换,这些空值用于对不显示这些列的集合进行分组。
ROLLUP is a shorthand notation for specifying a common type of grouping set. It represents the given list of expressions and all prefixes of the list, including the empty list.
ROLLUP是用于指定通用类型分组集的简写符号。它表示给定的表达式列表和列表的所有前置列,包括空列表。
Window aggregations with ROLLUP requires both the window_start and window_end columns have to be in the GROUP BY clause, but not in the ROLLUP clause.
使用ROLLUP的窗口聚合要求window_start 和window_end 列必须在GROUP BY子句中,但不在ROLLUP子句中。
For example, the following query is equivalent to the one above.
例如,下面的查询等同于上面的查询。
SELECT window_start, window_end, supplier_id, SUM(price) as price
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end, ROLLUP (supplier_id);
CUBE is a shorthand notation for specifying a common type of grouping set. It represents the given list and all of its possible subsets - the power set.
CUBE是用于指定常用分组集类型的简写符号。它表示给定的列表及其所有可能的子集-幂集。
Window aggregations with CUBE requires both the window_start and window_end columns have to be in the GROUP BY clause, but not in the CUBE clause.
使用CUBE的窗口聚合要求window_start 和window_end 列必须在GROUP BY子句中,但不在CUBE子句中。
For example, the following two queries are equivalent.
例如,以下两个查询是等效的。
SELECT window_start, window_end, item, supplier_id, SUM(price) as price
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end, CUBE (supplier_id, item);
SELECT window_start, window_end, item, supplier_id, SUM(price) as price
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end, GROUPING SETS (
(supplier_id, item),
(supplier_id ),
( item),
( )
)
The start and end timestamps of group windows can be selected with the grouped window_start and window_end columns.
可以使用分组的window_start和window_end列选择分组窗口的开始和结束时间戳。
The window_start and window_end columns are regular timestamp columns, not time attributes. Thus they can’t be used as time attributes in subsequent time-based operations. In order to propagate time attributes, you need to additionally add window_time column into GROUP BY clause. The window_time is the third column produced by Windowing TVFs which is a time attribute of the assigned window. Adding window_time into GROUP BY clause makes window_time also to be group key that can be selected. Then following queries can use this column for subsequent time-based operations, such as cascading window aggregations and Window TopN.
window_start和window_end列是常规时间戳列,而不是时间属性。因此,它们不能用作后续基于时间的操作中的时间属性。为了传播时间属性,您需要在GROUP BY子句中添加window_time列。window_time是TVF窗口化产生的第三列,TVF是指定窗口的时间属性。将window_time添加到GROUPBY子句中,使window_time也成为可以选择的组键。然后,以下查询可以将此列用于后续基于时间的操作,例如级联窗口聚合和窗口TopN。
The following shows a cascading window aggregation where the first window aggregation propagates the time attribute for the second window aggregation.
下面显示了一个级联窗口聚合,其中第一个窗口聚合传播第二个窗口聚合的时间属性。
-- tumbling 5 minutes for each supplier_id
CREATE VIEW window1 AS
-- Note: The window start and window end fields of inner Window TVF are optional in the select clause. However, if they appear in the clause, they need to be aliased to prevent name conflicting with the window start and window end of the outer Window TVF.
SELECT window_start as window_5mintumble_start, window_end as window_5mintumble_end, window_time as rowtime, SUM(price) as partial_price
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '5' MINUTES))
GROUP BY supplier_id, window_start, window_end, window_time;
-- tumbling 10 minutes on the first window
SELECT window_start, window_end, SUM(partial_price) as total_price
FROM TABLE(
TUMBLE(TABLE window1, DESCRIPTOR(rowtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
Batch Streaming
Warning: Group Window Aggregation is deprecated. It’s encouraged to use Window TVF Aggregation which is more powerful and effective.
警告:不推荐使用分组窗口聚合。鼓励使用更强大、更有效的Window TVF聚合。
Compared to Group Window Aggregation, Window TVF Aggregation have many advantages, including:
与分组窗口聚合相比,窗口TVF聚合有许多优点,包括:
Group Window Aggregations are defined in the GROUP BY clause of a SQL query. Just like queries with regular GROUP BY clauses, queries with a GROUP BY clause that includes a group window function compute a single result row per group. The following group windows functions are supported for SQL on batch and streaming tables.
分组窗口聚合在SQL查询的Group BY子句中定义。就像使用常规GROUP BY子句的查询一样,使用包含分组窗口函数的GROUP B子句的查询计算每个组的单个结果行。批表和流表上的SQL支持以下分组窗口函数。
Group Window Function | Description |
---|---|
TUMBLE(time_attr, interval) | Defines a tumbling time window. A tumbling time window assigns rows to non-overlapping, continuous windows with a fixed duration (interval). For example, a tumbling window of 5 minutes groups rows in 5 minutes intervals. Tumbling windows can be defined on event-time (stream + batch) or processing-time (stream). 定义滚动时间窗口。滚动时间窗口将行分配给具有固定持续时间(间隔)的非重叠连续窗口。例如,5分钟的滚动窗口以5分钟的间隔对行进行分组。滚动窗口可以在事件时间(流+批处理)或处理时间(流)上定义。 |
HOP(time_attr, interval, interval) | Defines a hopping time window (called sliding window in the Table API). A hopping time window has a fixed duration (second interval parameter) and hops by a specified hop interval (first interval parameter). If the hop interval is smaller than the window size, hopping windows are overlapping. Thus, rows can be assigned to multiple windows. For example, a hopping window of 15 minutes size and 5 minute hop interval assigns each row to 3 different windows of 15 minute size, which are evaluated in an interval of 5 minutes. Hopping windows can be defined on event-time (stream + batch) or processing-time (stream). 定义跳跃时间窗口(在Table API中称为滑动窗口)。跳跃时间窗口具有固定的持续时间(第二个间隔参数),并且跳跃指定的跳跃间隔(第一个间隔参数)。如果跳跃间隔小于窗口大小,则跳跃窗口重叠。因此,可以将行分配给多个窗口。例如,15分钟大小和5分钟跳跃间隔的跳跃窗口将每行分配给3个15分钟大小的不同窗口,这些窗口在5分钟的间隔内进行评估。跳跃窗口可以在事件时间(流+批处理)或处理时间(流)上定义。 |
SESSION(time_attr, interval) | Defines a session time window. Session time windows do not have a fixed duration but their bounds are defined by a time interval of inactivity, i.e., a session window is closed if no event appears for a defined gap period. For example a session window with a 30 minute gap starts when a row is observed after 30 minutes inactivity (otherwise the row would be added to an existing window) and is closed if no row is added within 30 minutes. Session windows can work on event-time (stream + batch) or processing-time (stream). 定义会话时间窗口。会话时间窗口没有固定的持续时间,但其边界由不活动的时间间隔定义,即,如果在定义的间隔期内没有事件出现,会话窗口将关闭。例如,具有30分钟间隔的会话窗口在30分钟不活动后观察到一行时开始(否则该行将被添加到现有窗口),如果30分钟内没有添加行,则关闭。会话窗口可以在事件时间(流+批处理)或处理时间(流)上工作。 |
In streaming mode, the time_attr argument of the group window function must refer to a valid time attribute that specifies the processing time or event time of rows. See the documentation of time attributes to learn how to define time attributes.
在流模式下,分组窗口函数的time_attr参数必须引用指定行的处理时间或事件时间的有效时间属性。请参阅时间属性文档,了解如何定义时间属性。
In batch mode, the time_attr argument of the group window function must be an attribute of type TIMESTAMP.
在批处理模式下,分组窗口函数的time_attr参数必须是TIMESTAMP类型的属性。
The start and end timestamps of group windows as well as time attributes can be selected with the following auxiliary functions:
可以使用以下辅助函数选择分组窗口的开始和结束时间戳以及时间属性:
Auxiliary Function 辅助函数 | Description |
---|---|
TUMBLE_START(time_attr, interval) HOP_START(time_attr, interval, interval) SESSION_START(time_attr, interval) | Returns the timestamp of the inclusive lower bound of the corresponding tumbling, hopping, or session window. 返回相应滚动、跳跃或会话窗口包含的下限时间戳。 |
TUMBLE_END(time_attr, interval) HOP_END(time_attr, interval, interval) SESSION_END(time_attr, interval) | Returns the timestamp of the exclusive upper bound of the corresponding tumbling, hopping, or session window. Note: The exclusive upper bound timestamp cannot be used as a rowtime attribute in subsequent time-based operations, such as interval joins and group window or over window aggregations. 返回相应滚动、跳跃或会话窗口独占的上限时间戳。注意:独占上限时间戳不能用作后续基于时间的操作中的行时间属性,例如interval joins和 group window or over window aggregations。 |
TUMBLE_ROWTIME(time_attr, interval) HOP_ROWTIME(time_attr, interval, interval) SESSION_ROWTIME(time_attr, interval) | Returns the timestamp of the inclusive upper bound of the corresponding tumbling, hopping, or session window. The resulting attribute is a rowtime attribute that can be used in subsequent time-based operations such as interval joins and group window or over window aggregations. 返回相应滚动、跳跃或会话窗口包含的上限时间戳。结果属性是一个rowtime属性,可以在后续基于时间的操作中使用,例如interval joins and group window or over window aggregations。 |
TUMBLE_PROCTIME(time_attr, interval) HOP_PROCTIME(time_attr, interval, interval) SESSION_PROCTIME(time_attr, interval) | Returns a proctime attribute that can be used in subsequent time-based operations such as interval joins and group window or over window aggregations. 返回一个proctime属性,该属性可用于后续基于时间的操作,如interval joins and group window or over window aggregations。 |
Note: Auxiliary functions must be called with exactly same arguments as the group window function in the GROUP BY clause.
注意:必须使用与group BY子句中的分组窗口函数完全相同的参数调用辅助函数。
The following examples show how to specify SQL queries with group windows on streaming tables.
下面的示例演示如何在流表上使用分组窗口指定SQL查询。
CREATE TABLE Orders (
user BIGINT,
product STRING,
amount INT,
order_time TIMESTAMP(3),
WATERMARK FOR order_time AS order_time - INTERVAL '1' MINUTE
) WITH (...);
SELECT
user,
TUMBLE_START(order_time, INTERVAL '1' DAY) AS wStart,
SUM(amount) FROM Orders
GROUP BY
TUMBLE(order_time, INTERVAL '1' DAY),
user