实时计算的JOIN和传统批处理JOIN的语义一致,都用于将两张表关联起来。区别为实时计算关联的是两张动态表,关联的结果也会动态更新,以保证最终结果和批处理结果一致。
tableReference [, tableReference ]* | tableexpression
[ LEFT ] JOIN tableexpression [ joinCondition ];
注意事项
- 只支持等值连接,不支持非等值连接。
- 只支持INNER JOIN和LEFT OUTER JOIN两种JOIN方式。
表 1. Orders
rowtime | productId | orderId | units |
---|---|---|---|
10:17:00 |
30 | 5 | 4 |
10:17:05 |
10 | 6 | 1 |
10:18:05 |
20 | 7 | 2 |
10:18:07 |
30 | 8 | 20 |
11:02:00 |
10 | 9 | 6 |
11:04:00 |
10 | 10 | 1 |
11:09:30 |
40 | 11 | 12 |
11:24:11 |
10 | 12 | 4 |
表 2. Products
productId | name | unitPrice |
---|---|---|
30 | Cheese | 17 |
10 | Beer | 0.25 |
20 | Wine | 6 |
30 | Cheese | 17 |
10 | Beer | 0.25 |
10 | Beer | 0.25 |
40 | Bread | 100 |
10 | Beer | 0.25 |
SELECT o.rowtime, o.productId, o.orderId, o.units,p.name, p.unitPrice
FROM Orders AS o
JOIN Products AS p
ON o.productId = p.productId;
o.rowtime | o.productId | o.orderId | o.units | p.name | p.unitPrice |
---|---|---|---|---|---|
10:17:00 |
30 | 5 | 4 | Cheese | 17 |
10:17:05 |
10 | 6 | 1 | Beer | 0.25 |
10:18:05 |
20 | 7 | 2 | Wine | 6 |
10:18:07 |
30 | 8 | 20 | Cheese | 17 |
11:02:00 |
10 | 9 | 6 | Beer | 0.25 |
11:04:00 |
10 | 10 | 1 | Beer | 0.25 |
11:09:30 |
40 | 11 | 12 | Bread | 100 |
11:24:11 |
10 | 12 | 4 | Beer | 0.25 |
表 3. datahub_stream1
a(BIGINT) | b(BIGINT) | c(VARCHAR) |
---|---|---|
0 | 10 | test11 |
1 | 10 | test21 |
表 4. datahub_stream2
a(BIGINT) | b(BIGINT) | c(VARCHAR) |
---|---|---|
0 | 10 | test11 |
1 | 10 | test21 |
0 | 10 | test31 |
1 | 10 | test41 |
SELECT s1.c,s2.c
FROM datahub_stream1 AS s1
JOIN datahub_stream2 AS s2
ON s1.a =s2.a
WHERE s1.a = 0;
s1.c(VARCHAR) | s2.c(VARCHAR) |
---|---|
test11 | test11 |
test11 | test31 |
对于每条流式数据,可以关联一个外部维表数据源,为实时计算Flink版提供数据关联查询。
说明 维表是一张不断变化的表,在维表JOIN时,需指明该条记录关联维表快照的时刻。维表JOIN仅支持对当前时刻维表快照的关联,未来会支持关联左表rowtime所对应的维表快照。
SELECT column-names
FROM table1 [AS <alias1>]
[LEFT] JOIN table2 FOR SYSTEM_TIME AS OF PROCTIME() [AS <alias2>]
ON table1.column-name1 = table2.key-name1;
事件流JOIN白名单维表,示例如下。
SELECT e.*, w.*
FROM event AS e
JOIN white_list FOR SYSTEM_TIME AS OF PROCTIME() AS w
ON e.id = w.id;
说明
- 维表支持
INNER JOIN
和LEFT JOIN
,不支持RIGHT JOIN
或FULL JOIN
。- 必须加上
FOR SYSTEM_TIME AS OF PROCTIME()
,表示JOIN维表当前时刻所看到的每条数据。- 源表后面进来的数据只会关联当时维表的最新信息,即JOIN行为只发生在处理时间(Processing Time)。如果JOIN行为发生后,维表中的数据发生了变化(新增、更新或删除),则已关联的维表数据不会被同步变化。
- ON条件中必须包含维表所有的PRIMARY KEY的等值条件(且要求与真实表定义一致)。此外,ON条件中也可以有其他等值条件。
- 如果有一对多JOIN需求,请在维表DDL INDEX中指定关联的KEY。
- 维表和维表不能进行JOIN。
- ON条件中维表字段不能使用CAST等类型转换函数。如果有类型转换需求,请在源表字段进行操作。
表 1. datahub_input1
id(bigint) | name(varchar) | age(bigint) |
---|---|---|
1 | lilei | 22 |
2 | hanmeimei | 20 |
3 | libai | 28 |
表 2. phoneNumber
name(varchar) | phoneNumber(bigint) |
---|---|
dufu | 13900001111 |
baijuyi | 13900002222 |
libai | 13900003333 |
lilei | 13900004444 |
CREATE TABLE datahub_input1 (
id BIGINT,
name VARCHAR,
age BIGINT
) WITH (
type='datahub'
);
create table phoneNumber(
name VARCHAR,
phoneNumber bigint,
primary key(name),
PERIOD FOR SYSTEM_TIME
)with(
type='rds'
);
CREATE table result_infor(
id bigint,
phoneNumber bigint,
name VARCHAR
)with(
type='rds'
);
INSERT INTO result_infor
SELECT
t.id,
w.phoneNumber,
t.name
FROM datahub_input1 as t
JOIN phoneNumber FOR SYSTEM_TIME AS OF PROCTIME() as w
ON t.name = w.name;
id(bigint) | phoneNumber(bigint) | name(varchar) |
---|---|---|
1 | 13900004444 | lilei |
3 | 13900003333 | libai |
IntervalJoin语句可以让两个流进行JOIN时,左流和右流中每条记录只关联另外一条流上同一时间段内的数据,且进行完JOIN后,仍然保留输入流上的时间列,让您继续进行基于Event Time的操作。
SELECT column-names
FROM table1 [AS <alias1>]
[INNER | LEFT | RIGHT |FULL ] JOIN table2
ON table1.column-name1 = table2.key-name1 AND TIMEBOUND_EXPRESSION
说明
- 支持INNER JOIN、LEFT JOIN、RIGHT JOIN和FULL JOIN,如果直接使用JOIN,默认为INNER JOIN。
- 暂不支持SEMI JOIN和ANTI JOIN。
- TIMEBOUND_EXPRESSION为左右两个流时间属性列上的区间条件表达式,支持以下三种条件表达式:
- ltime = rtime
- ltime >= rtime AND ltime < rtime + INTERVAL ‘10’ MINUTE
- ltime BETWEEN rtime - INTERVAL ‘10’ SECOND AND rtime + INTERVAL ‘5’ SECOND
统计下单后4个小时内的物流信息。
订单表(orders)
id | productName | orderTime |
---|---|---|
1 | iphone | 2020-04-01 10:00:00.0 |
2 | mac | 2020-04-01 10:02:00.0 |
3 | huawei | 2020-04-01 10:03:00.0 |
4 | pad | 2020-04-01 10:05:00.0 |
物流表(shipments)
shipId | orderId | status | shiptime |
---|---|---|---|
0 | 1 | shipped | 2020-04-01 11:00:00.0 |
1 | 2 | delivered | 2020-04-01 17:00:00.0 |
2 | 3 | shipped | 2020-04-01 12:00:00.0 |
3 | 4 | shipped | 2020-04-01 11:30:00.0 |
CREATE TABLE Orders(
id BIGINT,
productName VARCHAR,
orderTime TIMESTAMP,
WATERMARK wk FOR orderTime as withOffset(orderTime, 2000) --为rowtime定义Watermark。
) WITH (
type='datahub',
endpoint='' ,
accessId='' ,
accessKey='' ,
projectName='' ,
topic='' ,
project=''
);
CREATE TABLE Shipments(
shipId BIGINT,
orderId BIGINT,
status VARCHAR,
shiptime TIMESTAMP,
WATERMARK wk FOR shiptime as withOffset(shiptime, 2000) --为rowtime定义Watermark。
) WITH (
type='datahub',
endpoint='' ,
accessId='' ,
accessKey='' ,
projectName='' ,
topic='' ,
project=''
);
--使用RDS作为结果表
CREATE TABLE rds_output(
id BIGINT,
productName VARCHAR,
status VARCHAR
) WITH (
type='rds',
url='' ,
tableName='' ,
userName='' ,
password=''
);
INSERT INTO rds_output
SELECT id, productName, status
FROM Orders AS o
JOIN Shipments AS s on o.id = s.orderId AND
o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime;
id(bigint) | productName(varchar) | status(varchar) |
---|---|---|
1 | iphone | shipped |
3 | huawei | shipped |
4 | pad | shipped |
datahub_stream1
k1 | v1 |
---|---|
1 | val1 |
2 | val2 |
3 | val3 |
datahub_stream2
k1 | v1 |
---|---|
1 | val1 |
2 | val2 |
3 | val3 |
CREATE TABLE datahub_stream1 (
k1 BIGINT,
v1 VARCHAR,
d AS PROCTIME()
) WITH (
type='datahub',
endpoint='' ,
accessId='' ,
accessKey='' ,
projectName='' ,
topic='' ,
project=''
);
CREATE TABLE datahub_stream2 (
k2 BIGINT,
v2 VARCHAR,
e AS PROCTIME()
) WITH (
type='datahub',
endpoint='' ,
accessId='' ,
accessKey='' ,
projectName='' ,
topic='' ,
project=''
);
--使用RDS作为结果表
CREATE TABLE rds_output(
k1 BIGINT,
v1 VARCHAR,
v2 VARCHAR
) WITH (
type='rds',
url='' ,
tableName='' ,
userName='' ,
password=''
);
INSERT INTO rds_output
SELECT k1, v1, v2
FROM datahub_stream1 AS o
JOIN datahub_stream2 AS s on o.k1 = s.k2 AND
o.d BETWEEN s.e - INTERVAL '4' MINUTE AND s.e;
说明 由于结果取决于两个流里每条数据进入系统的时间,具有不确定性,因此该示例不提供预期结果。