## Flink CDC
### 一些参考文档
- 简介
https://developer.aliyun.com/article/777502
- 项目wiki
https://github.com/ververica/flink-cdc-connectors/wiki
- 深入分析文档参考
https://blog.csdn.net/Baron_ND/article/details/115752972
### 源码层面深入
1. flink调用代码示例如下:
```java
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;
import com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;
public class MySqlBinlogSourceExample {
public static void main(String[] args) throws Exception {
SourceFunction
.hostname("localhost")
.port(3306)
.databaseList("inventory") // monitor all tables under inventory database
.username("flinkuser")
.password("flinkpw")
.deserializer(new StringDebeziumDeserializationSchema()) // converts SourceRecord to String
.build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.addSource(sourceFunction)
.print().setParallelism(1); // use parallelism 1 for sink to keep message ordering
env.execute();
}
}
```
2. flink-cdc-connectors这个项目的Mysql读取相关的UML类图,整理如下:
![MySQLSource UML类图](../pic/flink_cdc_MySQLSource.png)
- 上步骤的重点为一个Mysql的source类,即com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;
此为一个builder,进行相关参数和启动模式的梳理,最后创建一个debezium读取的类:
com.alibaba.ververica.cdc.debezium.DebeziumSourceFunction,此为flink的一个SourceFunction,
由它进行snapshot和增量binlog的读取。
该类的代码描述:
```java
/**
* The {@link DebeziumSourceFunction} is a streaming data source that pulls captured change data
* from databases into Flink.
*
*
There are two workers during the runtime. One worker periodically pulls records from the
* database and pushes the records into the {@link Handover}. The other worker consumes the records
* from the {@link Handover} and convert the records to the data in Flink style. The reason why
* don't use one workers is because debezium has different behaviours in snapshot phase and
* streaming phase.
*
*
Here we use the {@link Handover} as the buffer to submit data from the producer to the
* consumer. Because the two threads don't communicate to each other directly, the error reporting
* also relies on {@link Handover}. When the engine gets errors, the engine uses the {@link
* DebeziumEngine.CompletionCallback} to report errors to the {@link Handover} and wakes up the
* consumer to check the error. However, the source function just closes the engine and wakes up the
* producer if the error is from the Flink side.
*
*
If the execution is canceled or finish(only snapshot phase), the exit logic is as same as the
* logic in the error reporting.
*
*
The source function participates in checkpointing and guarantees that no data is lost during a
* failure, and that the computation processes elements "exactly once".
*
*
Note: currently, the source function can't run in multiple parallel instances.
*
*
Please refer to Debezium's documentation for the available configuration properties:
* https://debezium.io/documentation/reference/1.2/development/engine.html#engine-properties
*/
```
3. CDC具体调用流程图汇总
![Flink CDC调用流程图](../pic/flink_mysql_cdc.png)
## 初步结论
- 目前由于使用debezium server进行数据同步,目前只支持单并发;多并发的实现issue中反馈正在开发中,待新版本确认;
- 省去了kafka和debezium的部署,整体架构较简单;
- 如果现架构既有kafka部署,而且希望中间缓存解耦,或者需要做多topic多分区以提高并发度的话;目前还是得保留kafka。