Flink SQL的Sink表只支持全字段数据插入,不支持指定字段数据插入和更新操作,那后面结果表增加字段如何处理?
首先我们需要给Sink指定主键,如果输出存储是声明了主键(primary key)的数据库(例如,RDS/ES/HBASE等),数据流的输出结果有以下2种情况:
这里跟mysql数据库的for update效果一样,所以我们可以使用这个特性进行分批次插入。解决了两个问题:
这里简单举个例子,我们的源数据表如下:
CREATE TABLE `user_source` (
`user_id` string,
`user_name` STRING,
`mobile` STRING,
`create_time` timestamp,
`update_time` timestamp
) with (
'connector' = 'mysql-cdc',
'debezium.snapshot.locking.mode' = 'none',
'hostname' = 'localhost',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'database-name' = 'source',
'table-name' = 'user'
)
Sink表结构如下,这里我们指定userId为主键:
CREATE TABLE `user_sink` (
userId STRING,
userName STRING,
mobile STRING,
createTime TIMESTAMP,
updateTime TIMESTAMP,
PRIMARY KEY (userId) NOT ENFORCED
) with (
'connector' = 'elasticsearch-7',
'hosts' = 'localhost:9200',
'index' = 'user',
'username' ='root',
'password' ='123456'
)
我们原本的Task任务应该是这样的:
INSERT INTO user_sink
SELECT
`user_id` as userId,
`user_name` as userName,
`mobile` as mobile,
`create_time` as createTime,
`update_time` as updateTime
FROM user_source;
这是大家经常普遍遇到的情况,假设我们这里的ES的Index不是5个字段,而是500个字段那会怎么样呢?如果我们将500个字段写在同一个SQL中,这个SQL得多大?
做拆分SQL的第一步就是将Sink表进行拆分:
CREATE TABLE `user_sink_1` (
userId STRING,
createTime TIMESTAMP,
updateTime TIMESTAMP,
PRIMARY KEY (userId) NOT ENFORCED
) with (
'connector' = 'elasticsearch-7',
'hosts' = 'localhost:9200',
'index' = 'user',
'username' ='root',
'password' ='123456'
)
CREATE TABLE `user_sink_2` (
userId STRING,
userName STRING,
mobile STRING
PRIMARY KEY (userId) NOT ENFORCED
) with (
'connector' = 'elasticsearch-7',
'hosts' = 'localhost:9200',
'index' = 'user',
'username' ='root',
'password' ='123456'
)
然后我们就可以将同步的SQL根据Sink进行拆分:
INSERT INTO user_sink_1
SELECT
`user_id` as userId
`create_time` as createTime,
`update_time` as updateTime
FROM user_source;
INSERT INTO user_sink_2
SELECT
`user_id` as userId,
`user_name` as userName,
`mobile` as mobile
FROM user_source;
插入数据到 user_sink_1 表同步到ES增加该数据,在插入数据到 user_sink_2 表同步数据到ES就会更加主键做 for update 操作,将 user_name 和 mobile 字段根据 user_id 进行更新。
结果表新增字段就跟上面的步骤差不多
需要注意的是每个 Task 执行都需要分配一定的CU(CPU+内存),所以如果可以跟之前的Task合并的话,会大幅减少Task数量并降低硬件要求。合并的过程差不多: