Flink CDC

## Flink CDC

### 一些参考文档

- 简介

https://developer.aliyun.com/article/777502

- 项目wiki

https://github.com/ververica/flink-cdc-connectors/wiki

- 深入分析文档参考

https://blog.csdn.net/Baron_ND/article/details/115752972

### 源码层面深入

1. flink调用代码示例如下:

```java

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import org.apache.flink.streaming.api.functions.source.SourceFunction;

import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;

import com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;

public class MySqlBinlogSourceExample {

public static void main(String[] args) throws Exception {

SourceFunction sourceFunction = MySQLSource.builder()

.hostname("localhost")

.port(3306)

.databaseList("inventory") // monitor all tables under inventory database

.username("flinkuser")

.password("flinkpw")

.deserializer(new StringDebeziumDeserializationSchema()) // converts SourceRecord to String

.build();

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env

.addSource(sourceFunction)

.print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

env.execute();

}

```

2. flink-cdc-connectors这个项目的Mysql读取相关的UML类图，整理如下:

![MySQLSource UML类图](../pic/flink_cdc_MySQLSource.png)

- 上步骤的重点为一个Mysql的source类，即com.alibaba.ververica.cdc.connectors.mysql.MySQLSource；

此为一个builder，进行相关参数和启动模式的梳理，最后创建一个debezium读取的类：

com.alibaba.ververica.cdc.debezium.DebeziumSourceFunction，此为flink的一个SourceFunction，

由它进行snapshot和增量binlog的读取。

该类的代码描述:

```java

/**

* The {@link DebeziumSourceFunction} is a streaming data source that pulls captured change data

* from databases into Flink.

There are two workers during the runtime. One worker periodically pulls records from the

* database and pushes the records into the {@link Handover}. The other worker consumes the records

* from the {@link Handover} and convert the records to the data in Flink style. The reason why

* don't use one workers is because debezium has different behaviours in snapshot phase and

* streaming phase.

Here we use the {@link Handover} as the buffer to submit data from the producer to the

* consumer. Because the two threads don't communicate to each other directly, the error reporting

* also relies on {@link Handover}. When the engine gets errors, the engine uses the {@link

* DebeziumEngine.CompletionCallback} to report errors to the {@link Handover} and wakes up the

* consumer to check the error. However, the source function just closes the engine and wakes up the

* producer if the error is from the Flink side.

If the execution is canceled or finish(only snapshot phase), the exit logic is as same as the

* logic in the error reporting.

The source function participates in checkpointing and guarantees that no data is lost during a

* failure, and that the computation processes elements "exactly once".

Note: currently, the source function can't run in multiple parallel instances.

Please refer to Debezium's documentation for the available configuration properties:

* https://debezium.io/documentation/reference/1.2/development/engine.html#engine-properties

```

3. CDC具体调用流程图汇总

![Flink CDC调用流程图](../pic/flink_mysql_cdc.png)

## 初步结论

- 目前由于使用debezium server进行数据同步，目前只支持单并发；多并发的实现issue中反馈正在开发中，待新版本确认；

- 省去了kafka和debezium的部署，整体架构较简单；

- 如果现架构既有kafka部署，而且希望中间缓存解耦，或者需要做多topic多分区以提高并发度的话；目前还是得保留kafka。

Flink CDC

你可能感兴趣的:(Flink CDC)