Flink CDC

## Flink CDC

### 一些参考文档

- 简介

https://developer.aliyun.com/article/777502

- 项目wiki

https://github.com/ververica/flink-cdc-connectors/wiki

- 深入分析文档参考

https://blog.csdn.net/Baron_ND/article/details/115752972

### 源码层面深入

1. flink调用代码示例如下:

```java

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import org.apache.flink.streaming.api.functions.source.SourceFunction;

import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;

import com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;

public class MySqlBinlogSourceExample {

  public static void main(String[] args) throws Exception {

    SourceFunction sourceFunction = MySQLSource.builder()

      .hostname("localhost")

      .port(3306)

      .databaseList("inventory") // monitor all tables under inventory database

      .username("flinkuser")

      .password("flinkpw")

      .deserializer(new StringDebeziumDeserializationSchema()) // converts SourceRecord to String

      .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    env

      .addSource(sourceFunction)

      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute();

  }

}

```

2. flink-cdc-connectors这个项目的Mysql读取相关的UML类图,整理如下:

![MySQLSource UML类图](../pic/flink_cdc_MySQLSource.png)

- 上步骤的重点为一个Mysql的source类,即com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;

此为一个builder,进行相关参数和启动模式的梳理,最后创建一个debezium读取的类:

com.alibaba.ververica.cdc.debezium.DebeziumSourceFunction,此为flink的一个SourceFunction,

由它进行snapshot和增量binlog的读取。

该类的代码描述:

```java

/**

* The {@link DebeziumSourceFunction} is a streaming data source that pulls captured change data

* from databases into Flink.

*

*

There are two workers during the runtime. One worker periodically pulls records from the

* database and pushes the records into the {@link Handover}. The other worker consumes the records

* from the {@link Handover} and convert the records to the data in Flink style. The reason why

* don't use one workers is because debezium has different behaviours in snapshot phase and

* streaming phase.

*

*

Here we use the {@link Handover} as the buffer to submit data from the producer to the

* consumer. Because the two threads don't communicate to each other directly, the error reporting

* also relies on {@link Handover}. When the engine gets errors, the engine uses the {@link

* DebeziumEngine.CompletionCallback} to report errors to the {@link Handover} and wakes up the

* consumer to check the error. However, the source function just closes the engine and wakes up the

* producer if the error is from the Flink side.

*

*

If the execution is canceled or finish(only snapshot phase), the exit logic is as same as the

* logic in the error reporting.

*

*

The source function participates in checkpointing and guarantees that no data is lost during a

* failure, and that the computation processes elements "exactly once".

*

*

Note: currently, the source function can't run in multiple parallel instances.

*

*

Please refer to Debezium's documentation for the available configuration properties:

* https://debezium.io/documentation/reference/1.2/development/engine.html#engine-properties

*/

```

3. CDC具体调用流程图汇总

![Flink CDC调用流程图](../pic/flink_mysql_cdc.png)

## 初步结论

- 目前由于使用debezium server进行数据同步,目前只支持单并发;多并发的实现issue中反馈正在开发中,待新版本确认;

- 省去了kafka和debezium的部署,整体架构较简单;

- 如果现架构既有kafka部署,而且希望中间缓存解耦,或者需要做多topic多分区以提高并发度的话;目前还是得保留kafka。

你可能感兴趣的:(Flink CDC)