Flink SQL:Queries(Windowing TVF)

Windowing table-valued functions (Windowing TVFs)

Batch Streaming

Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations. This document focuses on how windowing is performed in Flink SQL and how the programmer can benefit to the maximum from its offered functionality.
窗口是处理无限流的核心。窗口将流拆分为有限大小的“桶”,我们可以在其中应用计算。本文档重点介绍如何在Flink SQL中执行窗口化,以及程序员如何从其提供的功能中最大限度地受益。

Apache Flink provides several window table-valued functions (TVF) to divide the elements of your table into windows, including:
Apache Flink提供了几个窗口表值函数 (TVF)来将表的元素划分为窗口,包括:

  • Tumble Windows
    滚动窗口
  • Hop Windows
    跳跃窗口
  • Cumulate Windows
    累积窗口
  • Session Windows (will be supported soon)
    会话窗口(即将支持)

Note that each element can logically belong to more than one window, depending on the windowing table-valued function you use. For example, HOP windowing creates overlapping windows wherein a single element can be assigned to multiple windows.
请注意,每个元素在逻辑上可以属于多个窗口,具体取决于您使用的窗口化表值函数。例如,HOP windowing创建重叠窗口,其中单个元素可以分配给多个窗口。

Windowing TVFs are Flink defined Polymorphic Table Functions (abbreviated PTF). PTF is part of the SQL 2016 standard, a special table-function, but can have a table as a parameter. PTF is a powerful feature to change the shape of a table. Because PTFs are used semantically like tables, their invocation occurs in a FROM clause of a SELECT statement.
窗口化表值函数是Flink定义的多态表函数(缩写为PTF)。PTF是SQL 2016标准的一部分,是一个特殊的表函数,但可以将表作为参数。PTF是改变表形状的强大功能。因为PTF在语义上类似于表,所以它们的调用发生在SELECT语句的FROM子句中。

Windowing TVFs is a replacement of legacy Grouped Window Functions. Windowing TVFs is more SQL standard compliant and more powerful to support complex window-based computations, e.g. Window TopN, Window Join. However, Grouped Window Functions can only support Window Aggregation.
窗口化TVFs是传统Grouped Window Functions的替代品。窗口化TVFs更符合SQL标准,更强大,可以支持复杂的基于窗口的计算,例如Window TopN, Window Join。而Grouped Window Functions只能支持窗口聚合。

See more how to apply further computations based on windowing TVF:
了解更多如何基于窗口化TVF做进一步计算:

  • Window Aggregation
  • Window TopN
  • Window Join
  • Window Deduplication

Window Functions

Apache Flink provides 3 built-in windowing TVFs: TUMBLE, HOP and CUMULATE. The return value of windowing TVF is a new relation that includes all columns of original relation as well as additional 3 columns named “window_start”, “window_end”, “window_time” to indicate the assigned window. In streaming mode, the “window_time” field is a time attributes of the window. In batch mode, the “window_time” field is an attribute of type TIMESTAMP or TIMESTAMP_LTZ based on input time field type. The “window_time” field can be used in subsequent time-based operations, e.g. another windowing TVF, or interval joins, over aggregations. The value of window_time always equal to window_end - 1ms.
Apache Flink提供了3个内置窗口化TVF:TUMBLE, HOP and CUMULATE。窗口化TVF的返回值是一个新的关系,它包括原始关系的所有列以及另外三列,分别名为“window_start”, “window_end”, “window_time”,以指示指定的窗口。在流模式下,“window_time”字段是窗口的时间属性。在批处理模式中,“window_time”字段是基于输入时间字段类型的TIMESTAMP或TIMESTAMP_LTZ类型的属性。“window_time”字段可用于后续基于时间的操作,例如,另一个窗口化TVF或interval joins, over aggregations。window_time的值始终等于window_end-1ms。

TUMBLE

The TUMBLE function assigns each element to a window of specified window size. Tumbling windows have a fixed size and do not overlap. For example, suppose you specify a tumbling window with a size of 5 minutes. In that case, Flink will evaluate the current window, and a new window started every five minutes, as illustrated by the following figure.
TUMBLE函数将每个元素分配给指定窗口大小的窗口。滚动窗口具有固定大小,不会重叠。例如,假设指定一个大小为5分钟的滚动窗口。在这种情况下,Flink将评估当前窗口,每五分钟启动一个新窗口,如下图所示。

Flink SQL:Queries(Windowing TVF)_第1张图片

The TUMBLE function assigns a window for each row of a relation based on a time attribute field. In streaming mode, the time attribute field must be either event or processing time attributes. In batch mode, the time attribute field of window table function must be an attribute of type TIMESTAMP or TIMESTAMP_LTZ. The return value of TUMBLE is a new relation that includes all columns of original relation as well as additional 3 columns named “window_start”, “window_end”, “window_time” to indicate the assigned window. The original time attribute “timecol” will be a regular timestamp column after window TVF.
TUMBLE函数根据时间属性字段为关系的每一行分配一个窗口。在流模式下,时间属性字段必须是事件或处理时间属性。在批处理模式下,窗口化表函数的时间属性字段必须是TIMESTAMP或TIMESTAMP_LTZ类型的属性。TUMBLE的返回值是一个新的关系,它包括原始关系的所有列,以及名为“window_start”、“window_end”和“window_time”的额外3列,以指示指定的窗口。原始时间属性“timecol”将是窗口化TVF之后的常规时间戳列。

TUMBLE function takes three required parameters, one optional parameter:
TUMBLE函数采用三个必需参数,一个可选参数:

TUMBLE(TABLE data, DESCRIPTOR(timecol), size [, offset ])
  • data: is a table parameter that can be any relation with a time attribute column.
    data:是一个表参数,可以是与时间属性列的任何关系。
  • timecol: is a column descriptor indicating which time attributes column of data should be mapped to tumbling windows.
    timecol:是一个列描述符,指示数据的哪些时间属性列应映射到滚动窗口。
  • size: is a duration specifying the width of the tumbling windows.
    size:是指定滚动窗口宽度的持续时间。
  • offset: is an optional parameter to specify the offset which window start would be shifted by.
    offset:是一个可选参数,用于指定窗口开始偏移的偏移量。

Here is an example invocation on the Bid table:
以下是Bid表的调用示例:

-- tables must have time attribute, e.g. `bidtime` in this table
Flink SQL> desc Bid;
+-------------+------------------------+------+-----+--------+---------------------------------+
|        name |                   type | null | key | extras |                       watermark |
+-------------+------------------------+------+-----+--------+---------------------------------+
|     bidtime | TIMESTAMP(3) *ROWTIME* | true |     |        | `bidtime` - INTERVAL '1' SECOND |
|       price |         DECIMAL(10, 2) | true |     |        |                                 |
|        item |                 STRING | true |     |        |                                 |
+-------------+------------------------+------+-----+--------+---------------------------------+

Flink SQL> SELECT * FROM Bid;
+------------------+-------+------+
|          bidtime | price | item |
+------------------+-------+------+
| 2020-04-15 08:05 |  4.00 | C    |
| 2020-04-15 08:07 |  2.00 | A    |
| 2020-04-15 08:09 |  5.00 | D    |
| 2020-04-15 08:11 |  3.00 | B    |
| 2020-04-15 08:13 |  1.00 | E    |
| 2020-04-15 08:17 |  6.00 | F    |
+------------------+-------+------+

Flink SQL> SELECT * FROM TABLE(
   TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
-- or with the named params
-- note: the DATA param must be the first
Flink SQL> SELECT * FROM TABLE(
   TUMBLE(
     DATA => TABLE Bid,
     TIMECOL => DESCRIPTOR(bidtime),
     SIZE => INTERVAL '10' MINUTES));
+------------------+-------+------+------------------+------------------+-------------------------+
|          bidtime | price | item |     window_start |       window_end |            window_time  |
+------------------+-------+------+------------------+------------------+-------------------------+
| 2020-04-15 08:05 |  4.00 | C    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:07 |  2.00 | A    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:09 |  5.00 | D    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
| 2020-04-15 08:17 |  6.00 | F    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
+------------------+-------+------+------------------+------------------+-------------------------+

-- apply aggregation on the tumbling windowed table
Flink SQL> SELECT window_start, window_end, SUM(price)
  FROM TABLE(
    TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
  GROUP BY window_start, window_end;
+------------------+------------------+-------+
|     window_start |       window_end | price |
+------------------+------------------+-------+
| 2020-04-15 08:00 | 2020-04-15 08:10 | 11.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | 10.00 |
+------------------+------------------+-------+

Note: in order to better understand the behavior of windowing, we simplify the displaying of timestamp values to not show the trailing zeros, e.g. 2020-04-15 08:05 should be displayed as 2020-04-15 08:05:00.000 in Flink SQL Client if the type is TIMESTAMP(3).
注意:为了更好地理解窗口化的行为,我们简化了时间戳值的显示,以不显示尾随的零,例如,如果类型为TIMESTAMP(3),则在Flink SQL Client中,2020-04-15 08:05应显示为2020-04-15 08:05:00.000。

HOP

The HOP function assigns elements to windows of fixed length. Like a TUMBLE windowing function, the size of the windows is configured by the window size parameter. An additional window slide parameter controls how frequently a hopping window is started. Hence, hopping windows can be overlapping if the slide is smaller than the window size. In this case, elements are assigned to multiple windows. Hopping windows are also known as “sliding windows”.
HOP函数将元素分配给固定长度的窗口。与TUMBLE窗口化函数一样,窗口的大小由窗口大小参数配置。另一个窗口slide参数控制跳转窗口的启动频率。因此,如果滑动小于窗口大小,跳转窗口可能会重叠。在这种情况下,元素被指定给多个窗口。跳转窗口也称为“滑动窗口”。

For example, you could have windows of size 10 minutes that slides by 5 minutes. With this, you get every 5 minutes a window that contains the events that arrived during the last 10 minutes, as depicted by the following figure.
例如,您可以有10分钟大小的窗口,可滑动5分钟。这样,您每5分钟就会看到一个窗口,其中包含过去10分钟内到达的事件,如下图所示。

Flink SQL:Queries(Windowing TVF)_第2张图片

The HOP function assigns windows that cover rows within the interval of size and shifting every slide based on a time attribute field. In streaming mode, the time attribute field must be either event or processing time attributes. In batch mode, the time attribute field of window table function must be an attribute of type TIMESTAMP or TIMESTAMP_LTZ. The return value of HOP is a new relation that includes all columns of original relation as well as additional 3 columns named “window_start”, “window_end”, “window_time” to indicate the assigned window. The original time attribute “timecol” will be a regular timestamp column after windowing TVF.
The HOP function assigns windows that cover rows within the interval of size and shifting every slide based on a time attribute field。在流模式下,时间属性字段必须是事件或处理时间属性。在批处理模式下,窗口化表函数的时间属性字段必须是TIMESTAMP或TIMESTAMP_LTZ类型的属性。HOP的返回值是一个新的关系,它包括原始关系的所有列以及另外3列,分别名为“window_start”、“window_end”和“window_time”,以指示指定的窗口。原始时间属性“timecol”将是窗口化TVF后的常规时间戳列。

HOP takes four required parameters, one optional parameter:
HOP需要四个必需参数,一个可选参数:

HOP(TABLE data, DESCRIPTOR(timecol), slide, size [, offset ])
  • data: is a table parameter that can be any relation with an time attribute column.
    data:是一个表参数,可以是与时间属性列的任何关系。
  • timecol: is a column descriptor indicating which time attributes column of data should be mapped to hopping windows.
    timecol:是一个列描述符,指示数据的哪个时间属性列应映射到跳跃窗口。
  • slide: is a duration specifying the duration between the start of sequential hopping windows
    slide:指定顺序的跳跃窗口开始之间的持续时间
  • size: is a duration specifying the width of the hopping windows.
    size:是指定跳跃窗口宽度的持续时间。
  • offset: is an optional parameter to specify the offset which window start would be shifted by.
    offset:是一个可选参数,用于指定窗口开始偏移的偏移量。

Here is an example invocation on the Bid table:
以下是Bid表的调用示例:

> SELECT * FROM TABLE(
    HOP(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES));
-- or with the named params
-- note: the DATA param must be the first
> SELECT * FROM TABLE(
    HOP(
      DATA => TABLE Bid,
      TIMECOL => DESCRIPTOR(bidtime),
      SLIDE => INTERVAL '5' MINUTES,
      SIZE => INTERVAL '10' MINUTES));
+------------------+-------+------+------------------+------------------+-------------------------+
|          bidtime | price | item |     window_start |       window_end |           window_time   |
+------------------+-------+------+------------------+------------------+-------------------------+
| 2020-04-15 08:05 |  4.00 | C    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:05 |  4.00 | C    | 2020-04-15 08:05 | 2020-04-15 08:15 | 2020-04-15 08:14:59.999 |
| 2020-04-15 08:07 |  2.00 | A    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:07 |  2.00 | A    | 2020-04-15 08:05 | 2020-04-15 08:15 | 2020-04-15 08:14:59.999 |
| 2020-04-15 08:09 |  5.00 | D    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:09 |  5.00 | D    | 2020-04-15 08:05 | 2020-04-15 08:15 | 2020-04-15 08:14:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:05 | 2020-04-15 08:15 | 2020-04-15 08:14:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:05 | 2020-04-15 08:15 | 2020-04-15 08:14:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
| 2020-04-15 08:17 |  6.00 | F    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
| 2020-04-15 08:17 |  6.00 | F    | 2020-04-15 08:15 | 2020-04-15 08:25 | 2020-04-15 08:24:59.999 |
+------------------+-------+------+------------------+------------------+-------------------------+

-- apply aggregation on the hopping windowed table
> SELECT window_start, window_end, SUM(price)
  FROM TABLE(
    HOP(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES))
  GROUP BY window_start, window_end;
+------------------+------------------+-------+
|     window_start |       window_end | price |
+------------------+------------------+-------+
| 2020-04-15 08:00 | 2020-04-15 08:10 | 11.00 |
| 2020-04-15 08:05 | 2020-04-15 08:15 | 15.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | 10.00 |
| 2020-04-15 08:15 | 2020-04-15 08:25 |  6.00 |
+------------------+------------------+-------+

CUMULATE

Cumulating windows are very useful in some scenarios, such as tumbling windows with early firing in a fixed window interval. For example, a daily dashboard draws cumulative UVs from 00:00 to every minute, the UV at 10:00 represents the total number of UV from 00:00 to 10:00. This can be easily and efficiently implemented by CUMULATE windowing.
累积窗口在某些场景中非常有用,例如在固定的窗口间隔中提前启动的滚动窗口。例如,每日仪表板从00:00到每分钟绘制累积UV,10:00处的UV表示00:00到10:00之间的UV总数。这可以通过CUMULATE窗口轻松高效地实现。

The CUMULATE function assigns elements to windows that cover rows within an initial interval of step size and expand to one more step size (keep window start fixed) every step until the max window size. You can think CUMULATE function as applying TUMBLE windowing with max window size first, and split each tumbling windows into several windows with same window start and window ends of step-size difference. So cumulating windows do overlap and don’t have a fixed size.
CUMULATE函数将元素分配给窗口(that cover rows within an initial interval of step size),并在每一步扩展一个步长(保持窗口开始固定),直到达到最大窗口大小。您可以将CUMULATE函数视为首先应用最大窗口大小的滚动窗口,然后将每个滚动窗口拆分为多个窗口,这些窗口具有相同的窗口开始和窗口结束步长差异。因此,累积窗口确实重叠,并且没有固定的大小。

For example, you could have a cumulating window for 1 hour step and 1 day max size, and you will get windows: [00:00, 01:00), [00:00, 02:00), [00:00, 03:00), …, [00:00, 24:00) for every day.
例如,您可以有一个1小时步长和1天最大大小的累积窗口,您将得到每天的窗口:[00:00, 01:00), [00:00, 02:00), [00:00, 03:00), …, [00:00, 24:00)。

Flink SQL:Queries(Windowing TVF)_第3张图片

The CUMULATE functions assigns windows based on a time attribute column. In streaming mode, the time attribute field must be either event or processing time attributes. In batch mode, the time attribute field of window table function must be an attribute of type TIMESTAMP or TIMESTAMP_LTZ. The return value of CUMULATE is a new relation that includes all columns of original relation as well as additional 3 columns named “window_start”, “window_end”, “window_time” to indicate the assigned window. The original time attribute “timecol” will be a regular timestamp column after window TVF.
CUMULATE函数根据时间属性列指定窗口。在流模式下,时间属性字段必须是事件或处理时间属性。在批处理模式下,窗口化表函数的时间属性字段必须是TIMESTAMP或TIMESTAMP_LTZ类型的属性。CUMULATE的返回值是一个新的关系,它包括原始关系的所有列以及另外3列,分别名为“window_start”、“window_end”和“window_time”,以指示指定的窗口。原始时间属性“timecol”将是窗口化TVF之后的常规时间戳列。

CUMULATE takes four required parameters, one optional parameter:
CUMULATE采用四个必需参数,一个可选参数:

CUMULATE(TABLE data, DESCRIPTOR(timecol), step, size)
  • data: is a table parameter that can be any relation with an time attribute column.
    data:是一个表参数,可以是与时间属性列的任何关系。
  • timecol: is a column descriptor indicating which time attributes column of data should be mapped to cumulating windows.
    timecol:是一个列描述符,指示数据的哪个时间属性列应映射到累积窗口。
  • step: is a duration specifying the increased window size between the end of sequential cumulating windows.
    step:是一个持续时间,指定连续累积窗口结束之间增加的窗口大小。
  • size: is a duration specifying the max width of the cumulating windows. size must be an integral multiple of step.
    size:是指定累积窗口的最大宽度的持续时间。大小必须是步长的整数倍。
  • offset: is an optional parameter to specify the offset which window start would be shifted by.
    offset:是一个可选参数,用于指定窗口开始偏移的偏移量。

Here is an example invocation on the Bid table:
以下是Bid表的调用示例:

> SELECT * FROM TABLE(
    CUMULATE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '2' MINUTES, INTERVAL '10' MINUTES));
-- or with the named params
-- note: the DATA param must be the first
> SELECT * FROM TABLE(
    CUMULATE(
      DATA => TABLE Bid,
      TIMECOL => DESCRIPTOR(bidtime),
      STEP => INTERVAL '2' MINUTES,
      SIZE => INTERVAL '10' MINUTES));
+------------------+-------+------+------------------+------------------+-------------------------+
|          bidtime | price | item |     window_start |       window_end |            window_time  |
+------------------+-------+------+------------------+------------------+-------------------------+
| 2020-04-15 08:05 |  4.00 | C    | 2020-04-15 08:00 | 2020-04-15 08:06 | 2020-04-15 08:05:59.999 |
| 2020-04-15 08:05 |  4.00 | C    | 2020-04-15 08:00 | 2020-04-15 08:08 | 2020-04-15 08:07:59.999 |
| 2020-04-15 08:05 |  4.00 | C    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:07 |  2.00 | A    | 2020-04-15 08:00 | 2020-04-15 08:08 | 2020-04-15 08:07:59.999 |
| 2020-04-15 08:07 |  2.00 | A    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:09 |  5.00 | D    | 2020-04-15 08:00 | 2020-04-15 08:10 | 2020-04-15 08:09:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:10 | 2020-04-15 08:12 | 2020-04-15 08:11:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:10 | 2020-04-15 08:14 | 2020-04-15 08:13:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:10 | 2020-04-15 08:16 | 2020-04-15 08:15:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:10 | 2020-04-15 08:18 | 2020-04-15 08:17:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:10 | 2020-04-15 08:14 | 2020-04-15 08:13:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:10 | 2020-04-15 08:16 | 2020-04-15 08:15:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:10 | 2020-04-15 08:18 | 2020-04-15 08:17:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
| 2020-04-15 08:17 |  6.00 | F    | 2020-04-15 08:10 | 2020-04-15 08:18 | 2020-04-15 08:17:59.999 |
| 2020-04-15 08:17 |  6.00 | F    | 2020-04-15 08:10 | 2020-04-15 08:20 | 2020-04-15 08:19:59.999 |
+------------------+-------+------+------------------+------------------+-------------------------+

-- apply aggregation on the cumulating windowed table
> SELECT window_start, window_end, SUM(price)
  FROM TABLE(
    CUMULATE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '2' MINUTES, INTERVAL '10' MINUTES))
  GROUP BY window_start, window_end;
+------------------+------------------+-------+
|     window_start |       window_end | price |
+------------------+------------------+-------+
| 2020-04-15 08:00 | 2020-04-15 08:06 |  4.00 |
| 2020-04-15 08:00 | 2020-04-15 08:08 |  6.00 |
| 2020-04-15 08:00 | 2020-04-15 08:10 | 11.00 |
| 2020-04-15 08:10 | 2020-04-15 08:12 |  3.00 |
| 2020-04-15 08:10 | 2020-04-15 08:14 |  4.00 |
| 2020-04-15 08:10 | 2020-04-15 08:16 |  4.00 |
| 2020-04-15 08:10 | 2020-04-15 08:18 | 10.00 |
| 2020-04-15 08:10 | 2020-04-15 08:20 | 10.00 |
+------------------+------------------+-------+

Window Offset

Offset is an optional parameter which could be used to change the window assignment. It could be positive duration and negative duration. Default values for window offset is 0. The same record maybe assigned to the different window if set different offset value.
For example, which window would be assigned to for a record with timestamp 2021-06-30 00:00:04 for a Tumble window with 10 MINUTE as size?
偏移量是一个可选参数,可用于更改窗口分配。它可以是正持续时间和负持续时间。窗口偏移的默认值为0。如果设置不同的偏移值,则同一记录可能会分配给不同的窗口。
例如,对于时间戳为2021-06-30 00:00:04的记录,对于大小为10分钟的滚动窗口,将分配给哪个窗口?

  • If offset value is -16 MINUTE, the record assigns to window [2021-06-29 23:54:00, 2021-06-30 00:04:00).
    如果偏移值为-16分钟,则记录分配给窗口[2021-06-29 23:54:00, 2021-06-30 00:04:00)。
  • If offset value is -6 MINUTE, the record assigns to window [2021-06-29 23:54:00, 2021-06-30 00:04:00).
    如果偏移值为-6分钟,则记录分配给窗口[2021-06-29 23:54:00, 2021-06-30 00:04:00)。
  • If offset is -4 MINUTE, the record assigns to window [2021-06-29 23:56:00, 2021-06-30 00:06:00).
    如果偏移量为-4分钟,则记录分配给窗口[2021-06-29 23:56:00, 2021-06-30 00:06:00)。
  • If offset is 0, the record assigns to window [2021-06-30 00:00:00, 2021-06-30 00:10:00).
    如果偏移量为0,则记录分配给窗口[2021-06-30 00:00:00, 2021-06-30 00:10:00)。
  • If offset is 4 MINUTE, the record assigns to window [2021-06-29 23:54:00, 2021-06-30 00:04:00).
    如果偏移量为4分钟,则记录分配给窗口[2021-06-29 23:54:00, 2021-06-30 00:04:00)。
  • If offset is 6 MINUTE, the record assigns to window [2021-06-29 23:56:00, 2021-06-30 00:06:00).
    如果偏移量为6分钟,则记录分配给窗口[2021-06-29 23:56:00, 2021-06-30 00:06:00)。
  • If offset is 16 MINUTE, the record assigns to window [2021-06-29 23:56:00, 2021-06-30 00:06:00). We could find that, some windows offset parameters may have same effect on the assignment of windows. In the above case, -16 MINUTE, -6 MINUTE and 4 MINUTE have same effect for a Tumble window with 10 MINUTE as size.
    如果偏移量为16分钟,记录将分配给窗口[2021-06-29 23:56:00, 2021-06-30 00:06:00)。我们可以发现,一些窗口偏移参数可能对窗口的分配有相同的影响。在上述情况下,-16分钟,-6分钟和4分钟对大小为10分钟的滚动窗口有相同的效果。

Note: The effect of window offset is just for updating window assignment, it has no effect on Watermark.
注意:窗口偏移量的效果仅用于更新窗口分配,对水印没有影响。

We show an example to describe how to use offset in Tumble window in the following SQL.
我们展示了一个示例,描述如何在下面的SQL中使用滚动窗口中的偏移量。

-- NOTE: Currently Flink doesn't support evaluating individual window table-valued function,
--  window table-valued function should be used with aggregate operation,
--  this example is just used for explaining the syntax and the data produced by table-valued function.
Flink SQL> SELECT * FROM TABLE(
   TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES, INTERVAL '1' MINUTES));
-- or with the named params
-- note: the DATA param must be the first
Flink SQL> SELECT * FROM TABLE(
   TUMBLE(
     DATA => TABLE Bid,
     TIMECOL => DESCRIPTOR(bidtime),
     SIZE => INTERVAL '10' MINUTES,
     OFFSET => INTERVAL '1' MINUTES));
+------------------+-------+------+------------------+------------------+-------------------------+
|          bidtime | price | item |     window_start |       window_end |            window_time  |
+------------------+-------+------+------------------+------------------+-------------------------+
| 2020-04-15 08:05 |  4.00 | C    | 2020-04-15 08:01 | 2020-04-15 08:11 | 2020-04-15 08:10:59.999 |
| 2020-04-15 08:07 |  2.00 | A    | 2020-04-15 08:01 | 2020-04-15 08:11 | 2020-04-15 08:10:59.999 |
| 2020-04-15 08:09 |  5.00 | D    | 2020-04-15 08:01 | 2020-04-15 08:11 | 2020-04-15 08:10:59.999 |
| 2020-04-15 08:11 |  3.00 | B    | 2020-04-15 08:11 | 2020-04-15 08:21 | 2020-04-15 08:20:59.999 |
| 2020-04-15 08:13 |  1.00 | E    | 2020-04-15 08:11 | 2020-04-15 08:21 | 2020-04-15 08:20:59.999 |
| 2020-04-15 08:17 |  6.00 | F    | 2020-04-15 08:11 | 2020-04-15 08:21 | 2020-04-15 08:20:59.999 |
+------------------+-------+------+------------------+------------------+-------------------------+

-- apply aggregation on the tumbling windowed table
Flink SQL> SELECT window_start, window_end, SUM(price)
  FROM TABLE(
    TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES, INTERVAL '1' MINUTES))
  GROUP BY window_start, window_end;
+------------------+------------------+-------+
|     window_start |       window_end | price |
+------------------+------------------+-------+
| 2020-04-15 08:01 | 2020-04-15 08:11 | 11.00 |
| 2020-04-15 08:11 | 2020-04-15 08:21 | 10.00 |
+------------------+------------------+-------+

Note: in order to better understand the behavior of windowing, we simplify the displaying of timestamp values to not show the trailing zeros, e.g. 2020-04-15 08:05 should be displayed as 2020-04-15 08:05:00.000 in Flink SQL Client if the type is TIMESTAMP(3).
注意:为了更好地理解窗口化的行为,我们简化了时间戳值的显示,以不显示尾随的零,例如,如果类型为TIMESTAMP(3),则在Flink SQL Client中,2020-04-15 08:05应显示为2020-04-15 08:05:00.000。

你可能感兴趣的:(flink官方文档翻译-SQL,flink,sql,大数据)