双流join是最通用的联接类型(支持 Batch\Streaming),其中任何新记录或联接两侧的更改都是可见的,并影响整体的Join结果。
特点:
适用场景:因为资源问题 Regular Join 通常是不可持续的,一般只用做有界数据流的 Join。
总结以上特点,双流join支持基本特性如下:
SELECT * FROM Orders
[INNER|RIGHT|LEFT|FULL OUTER] JOIN Product
ON Orders.productId = Product.id
如果其中一个流表触发更新操作,同样触发join生成最新的结果。
CREATE TABLE users (
user_id STRING,
name STRING,
age INT,
gmt_time TIMESTAMP(3)
) WITH (
'connector' = 'kafka',
'topic' = 'users',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'orders2ConsumerGroup',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
);
CREATE TABLE address (
user_id STRING,
address STRING,
update_time TIMESTAMP(3)
) WITH (
'connector' = 'kafka',
'topic' = 'address',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'orders2ConsumerGroup',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
);
select u.user_id,u.name,u.age,a.address
FROM users u
LEFT JOIN address a
ON u.user_id = a.user_id;
CREATE TABLE users (
`user_id` STRING,
`name` STRING,
`age` INT,
`gmt_time` TIMESTAMP(3)
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://localhost:3306/flink',
'table-name' = 'user',
'username' = 'root',
'password' = '123456'
)
CREATE TABLE address (
`user_id` STRING,
`address` STRING,
`gmt_time` TIMESTAMP(3)
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://localhost:3306/flink',
'table-name' = 'user_address',
'username' = 'root',
'password' = '123456'
)
select users.name, users.user_id, users.age, address.address
from users,address
where users.user_id = address.user_id;
实际项目流处理中,维表通常存储在外部设备中(MySQL,OceanBase、HBase等),对于每条流式数据,可以关联一个外部维表数据源,为实时计算提供数据关联查询。所以流表关联维表通常采用Temporal Joins(时态join)方式进行关联,后续会继续介绍。