Flink CDC底层是使用Debezium来进行data changes的capture
特色:
scan.incremental.snapshot.enabled
关闭snapshot增量读取下面用json格式,展示了change event
{
"before": {
"id": 111,
"name": "scooter",
"description": "Big 2-wheel scooter",
"weight": 5.18
},
"after": {
"id": 111,
"name": "scooter",
"description": "Big 2-wheel scooter",
"weight": 5.15
},
"source": {...},
"op": "u", // operation type, u表示这是一个update event
"ts_ms": 1589362330904, // connector处理event的时间
"transaction": null
}
字段含义可以参考Debezium文档
在DataStrea API中,用户可以使用Constructor:JsonDebeziumDeserializationSchema(true),在message中包含schema。但是不推荐使用
JsonDebeziumDeserializationSchema也可以接收JsonConverter的自定义配置。如下示例在output中包含小数的数据
Map customConverterConfigs = new HashMap<>();
customConverterConfigs.put(JsonConverterConfig.DECIMAL_FORMAT_CONFIG, "numeric");
JsonDebeziumDeserializationSchema schema =
new JsonDebeziumDeserializationSchema(true, customConverterConfigs);
集成步骤如下:
添加如下依赖到pom.xml中
com.ververica
flink-connector-mysql-cdc
2.2.0
建表语句如下:
CREATE TABLE `info_message` (
`id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT '主键',
`msg_title` varchar(100) DEFAULT NULL COMMENT '消息名称',
`msg_ctx` varchar(2048) DEFAULT NULL COMMENT '消息内容',
`msg_time` datetime DEFAULT NULL COMMENT '消息发送时间',
PRIMARY KEY (`id`)
)
部分数据内容如下:
mysql>
mysql> select * from d_general.info_message limit 3;
+--------------------+-----------+-------------------------------------------------------+---------------------+
| id | msg_title | msg_ctx | msg_time |
+--------------------+-----------+-------------------------------------------------------+---------------------+
| 1 | title1 | content1 | 2019-03-29 15:27:21 |
| 2 | title2 | content2 | 2019-03-29 15:38:36 |
| 3 | title3 | content3 | 2019-03-29 15:38:36 |
+--------------------+-----------+-------------------------------------------------------+---------------------+
3 rows in set (0.00 sec)
mysql>
Flink SQL> set 'execution.checkpointing.interval' = '10s';
[INFO] Session property has been set.
Flink SQL>
Flink SQL> create table mysql_source(
> database_name string metadata from 'database_name' virtual,
> table_name string metadata from 'table_name' virtual,
> id decimal(20,0) not null,
> msg_title string,
> msg_ctx string,
> msg_time timestamp(9),
> primary key (id) not enforced
> ) with (
> 'connector' = 'mysql-cdc',
> 'hostname' = '192.168.8.124',
> 'port' = '3306',
> 'username' = 'hnmqet',
> 'password' = 'hnmq123456',
> 'server-time-zone' = 'Asia/Shanghai',
> 'scan.startup.mode' = 'initial',
> 'database-name' = 'd_general',
> 'table-name' = 'info_message'
> );
[INFO] Execute statement succeed.
Flink SQL>
说明如下:
Flink SQL> create table hudi_sink(
> database_name string,
> table_name string,
> id decimal(20,0) not null,
> msg_title string,
> msg_ctx string,
> msg_time timestamp(6),
> primary key (database_name, table_name, id) not enforced
> ) with (
> 'connector' = 'hudi',
> 'path' = 'hdfs://nnha/user/hudi/warehouse/hudi_db/info_message',
> 'table.type' = 'MERGE_ON_READ',
> 'hoodie.datasource.write.recordkey.field' = 'database_name.table_name.id',
> 'write.precombine.field' = 'msg_time',
> 'write.rate.limit' = '2000',
> 'write.tasks' = '2',
> 'write.operation' = 'upsert',
> 'compaction.tasks' = '2',
> 'compaction.async.enabled' = 'true',
> 'compaction.trigger.strategy' = 'num_commits',
> 'compaction.delta_commits' = '5',
> 'read.tasks' = '2',
> 'changelog.enabled' = 'true'
> );
[INFO] Execute statement succeed.
Flink SQL>
说明如下:
先同步snapshot,再同步事务日志
Flink SQL> insert into hudi_sink select database_name, table_name, id, msg_title, msg_ctx, msg_time from mysql_source /*+ OPTIONS('server-id'='5401') */ where msg_time is not null;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: afa575f5451af65d1ee7d225d77888ac
Flink SQL>
'server-id'='5401-5404'
。这样Mysql server就能正确维护network connection和binlog position