本文主要翻译整理了Flink 1.7的新特性,原文见参考文献,还包含了一些Demo验证的小的限制结论。
Temporal Tables(时态表):历史中某个特定时间点上表内容的视图
版本根据主键以类似Map
对于一个输入的时间,返回最新的版本,即当前时间的Value为该时间最近的值。对于定义的时间属性为event-time时会保存从上一个watermark到当前为止的所有版本。
对输入表设置时间属性,根据输入的时间参数决定返回的表版本,根据时间对版本进行跟踪
对时态表指定更新根据的主键
对输入表创建时态函数,指定输入的时态表主键和时间,注册
SELECT * FROM Orders;
o_proctime amount currency
========== ====== =========
10:15 2 Euro
10:30 1 US Dollar
10:32 50 Yen
10:52 3 Euro
11:04 5 US Dollar
SELECT * FROM RatesHistory;
r_proctime currency rate
========== ======== ======
09:00 US Dollar 102
09:00 Euro 114
09:00 Yen 1
10:45 Euro 116
11:15 Euro 119
11:49 Pounds 108
Java:
Table orders = tEnv.fromDataStream(ordersStream, "amount, currency, o_proctime.proctime");
tEnv.registerTable("Orders", orders);
Table ratesHistory = tEnv.fromDataStream(ratesHistoryStream, "currency, rate, r_proctime.proctime");
tEnv.registerTable("RatesHistory", ratesHistory);
TemporalTableFunction rates = ratesHistory.createTemporalTableFunction("r_proctime", "r_currency");
tEnv.registerFunction("Rates", rates);
SQL:
SELECT
o.o_proctime,
o.amount AS n_amount,
r.rate AS rate,
o.currency AS currency
FROM
Orders AS o,
LATERAL TABLE (Rates(o_proctime)) AS r
WHERE
r.currency = o.currency
-------JOIN ON 形式 上下两种效果相同
SELECT
o_proctime,
o.amount AS n_amount,
r.rate AS rate,
o.currency AS currency
FROM
Orders AS o,
JOIN LATERAL TABLE (Rates(o_proctime)) AS r
ON
r.currency = o.currency
o_proctime amount rate currency
========== ====== ==== =========
10:15 2 114 Euro
10:52 3 116 Euro
10:30 1 102 US Dollar
11:04 5 102 US Dollar
10:32 50 1 Yen
MATCH_RECOGNIZE子句使用某种类似于广泛使用的正则表达式语法的强大的和表达性语法在事件流中搜索相匹配的模式。
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-cep_2.11artifactId>
<version>1.7.0version>
dependency>
symbol rowtime price tax
====== ==================== ======= =======
'ACME' '01-Apr-11 10:00:00' 12 1
'ACME' '01-Apr-11 10:00:01' 17 2
'ACME' '01-Apr-11 10:00:02' 19 1
'ACME' '01-Apr-11 10:00:03' 21 3
'ACME' '01-Apr-11 10:00:04' 25 2
'ACME' '01-Apr-11 10:00:05' 18 1
'ACME' '01-Apr-11 10:00:06' 15 1
'ACME' '01-Apr-11 10:00:07' 14 2
'ACME' '01-Apr-11 10:00:08' 24 2
'ACME' '01-Apr-11 10:00:09' 25 2
'ACME' '01-Apr-11 10:00:10' 19 1
寻找对于某只股票(Ticker)的价格(Price)持续下降的时间范围
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY rowtime
MEASURES
START_ROW.rowtime AS start_tstamp,
LAST(PRICE_DOWN.rowtime) AS bottom_tstamp,
LAST(PRICE_UP.rowtime) AS end_tstamp
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST PRICE_UP
PATTERN (START_ROW PRICE_DOWN+ PRICE_UP)
DEFINE
PRICE_DOWN AS
(LAST(PRICE_DOWN.price, 1) IS NULL AND PRICE_DOWN.price < START_ROW.price) OR
PRICE_DOWN.price < LAST(PRICE_DOWN.price, 1),
PRICE_UP AS
PRICE_UP.price > LAST(PRICE_DOWN.price, 1)
) MR;
symbol start_tstamp bottom_tstamp end_tstamp
========= ================== ================== ==================
ACME 01-APR-11 10:00:04 01-APR-11 10:00:07 01-APR-11 10:00:08
表的逻辑划分,与GROUP BY类似。
输入的排序列,模式依赖于一种顺序,十分重要。
排序列必须含时间,时间只能为递增型(默认)且必须为第一个排序列。eg:
ORDER BY rowtime ASC, price DESC
定义输出,和SELECT类似。
划分键(PARTITION BY)会自动添加到第一列。
每一个匹配的输出模式
指定下一个匹配的开始位置,也控制了一个事件中可以有多少个不同的匹配。
symbol tax price rowtime
======== ===== ======= =====================
XYZ 1 7 2018-09-17 10:00:01
XYZ 2 9 2018-09-17 10:00:02
XYZ 1 10 2018-09-17 10:00:03
XYZ 2 5 2018-09-17 10:00:04
XYZ 2 17 2018-09-17 10:00:05
XYZ 2 14 2018-09-17 10:00:06
SELECT *
FROM Ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY rowtime
MEASURES
SUM(A.price) AS sumPrice,
FIRST(rowtime) AS startTime,
LAST(rowtime) AS endTime
PATTERN (A+ C)
ONE ROW PER MATCH
[AFTER MATCH STRATEGY]
DEFINE
A AS SUM(A.price) < 30
)
注意 DEFINE 当前还不支持 SUM 等聚合函数,在这里只用于讲解。
AFTER MATCH SKIP PAST LAST ROW
symbol sumPrice startTime endTime
======== ========== ===================== =====================
XYZ 26 2018-09-17 10:00:01 2018-09-17 10:00:04
XYZ 17 2018-09-17 10:00:05 2018-09-17 10:00:06
AFTER MATCH SKIP TO NEXT ROW
symbol sumPrice startTime endTime
======== ========== ===================== =====================
XYZ 26 2018-09-17 10:00:01 2018-09-17 10:00:04
XYZ 24 2018-09-17 10:00:02 2018-09-17 10:00:05
XYZ 15 2018-09-17 10:00:03 2018-09-17 10:00:05
XYZ 22 2018-09-17 10:00:04 2018-09-17 10:00:06
XYZ 17 2018-09-17 10:00:05 2018-09-17 10:00:06
AFTER MATCH SKIP TO LAST A
symbol sumPrice startTime endTime
======== ========== ===================== =====================
XYZ 26 2018-09-17 10:00:01 2018-09-17 10:00:04
XYZ 15 2018-09-17 10:00:03 2018-09-17 10:00:05
XYZ 22 2018-09-17 10:00:04 2018-09-17 10:00:06
XYZ 17 2018-09-17 10:00:05 2018-09-17 10:00:06
AFTER MATCH SKIP TO FIRST A
***SKIP TO FIRST/LAST variable
***当没有匹配时会报错:a runtime exception will be thrown as the standard requires a valid row to continue the matching.
构建使用类似正则表达式用于搜索的模式。
每个模式都是由称为模式变量的基本构建块构建的,操作符(量词和其他修饰符)可以应用于模式变量,由DEFINE关键字定义模式变量。整个图案必须用括号括起来。eg:
PATTERN (A B+ C* D)
模式可以使用以下运算符
注意 不支持可能产生空匹配的模式。例如PATTERN (A*), PATTERN (A? B*), PATTERN (A{0,} B{0,} C*)
等。
每个量词可以是贪婪的(Greedy,默认)或懒惰的(Reluctant)。贪婪的量词试图匹配尽可能多的行,而懒惰的量词试图匹配尽可能少的行。懒惰的量词就是在默认的量词后添加?
symbol tax price rowtime
======= ===== ======== =====================
XYZ 1 10 2018-09-17 10:00:02
XYZ 2 11 2018-09-17 10:00:03
XYZ 1 12 2018-09-17 10:00:04
XYZ 2 13 2018-09-17 10:00:05
XYZ 1 14 2018-09-17 10:00:06
XYZ 2 16 2018-09-17 10:00:07
SELECT *
FROM Ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY rowtime
MEASURES
C.price AS lastPrice
PATTERN (A B* C) --贪婪 (1)
PATTERN (A B*? C) --懒惰 (2)
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
DEFINE
A AS A.price > 10,
B AS B.price < 15,
C AS B.price > 12
)
symbol lastPrice
======== ===========
XYZ 16
++ (2)懒惰 只匹配一行对于B
symbol lastPrice
======== ===========
XYZ 13
注意 对于模式的最后一个变量,不可能使用贪婪的量词。因此,不允许出现像(A B*)
这样的模式。通过引入具有否定条件B的人工状态(如下例子中的C),可以很容易地解决这个问题。eg:
PATTERN (A B* C)
DEFINE
A AS condA(),
B AS condB(),
C AS NOT condB()
注意 可选的懒惰量词(A?? or A{0,1}?)
当前还不支持
和WHERE类似。指定了行必须满足的条件,据此被分类为相应的模式变量。如果没有为模式变量定义条件,则将使用默认条件,该默认条件对每行求值为true。对应于每一行。
可在变量中定义上一个变量的相关条件,聚合函数可以应用在当前变量块或所有的模式匹配输入。eg:
PATTERN (A B+)
DEFINE
A AS A.price > 10,
B AS B.price > A.price AND SUM(price) < 100 AND SUM(B.price) < 80
Offset functions | Description |
---|---|
LAST(variable.field, n) | 返回符合模式的倒数第n个的变量的字段值。n取1就是上一个。因为不包括当前行。 |
FIRST(variable.field, n) | 返回符合模式的正数第n个的变量的字段值。n取1就是第一个。如当前行之前没有匹配的技术null,也是不包括当前行。 |
LAST/FIRST 只能使用在单行,例如不允许出现类似LAST(A.price * B.tax)
的表达式,但LAST(A.price * A.tax)
是可以的。
在编写MATCH_RECOGNIZE查询时,内存消耗是一个重要的考虑因素,因为潜在匹配的空间是以广度优先的方式构建的。考虑到这一点,我们必须确保模式能够完成。最好将合理的行数映射到匹配,因为它们必须适应内存。例如下面的例子就是不可取的,B会匹配所有的行:
PATTERN (A B+ C)
DEFINE
A as A.price > 10,
C as C.price > 20
可以使用否定后续条件或者懒惰模式来控制:
PATTERN (A B+ C)
DEFINE
A as A.price > 10,
B as B.price <= 20,
C as C.price > 20
PATTERN (A B+? C)
DEFINE
A as A.price > 10,
C as C.price > 20
注意 当前不可以使用配置时间状态限制模式的结束。SQL也没有相应的标准。社区正在为之努力
(A?? or A{0,1}?)
当前还不支持(A (B C)+)
((A B | C D) E)
PATTERN (PERMUTE (A, B, C)) = PATTERN (A B C | A C B | B A C | B C A | C A B | C B A)
PATTERN ({- A -} B)
PATTERN A??
更详细的请见参考文献[4]
[1] Apache Flink 1.7.0 Release Announcement
[2] Temporal Tables
[3] Joins in Continuous Queries
[4] Detecting Patterns in Tables Beta