Flink CDC调查

Flink CDC

一些参考文档

  • 简介
    https://developer.aliyun.com/article/777502

  • 项目wiki
    https://github.com/ververica/flink-cdc-connectors/wiki

  • 深入分析文档参考
    https://blog.csdn.net/Baron_ND/article/details/115752972

源码层面深入

  1. flink调用代码示例如下:
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;
import com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;

public class MySqlBinlogSourceExample {
  public static void main(String[] args) throws Exception {
    SourceFunction sourceFunction = MySQLSource.builder()
      .hostname("localhost")
      .port(3306)
      .databaseList("inventory") // monitor all tables under inventory database
      .username("flinkuser")
      .password("flinkpw")
      .deserializer(new StringDebeziumDeserializationSchema()) // converts SourceRecord to String
      .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    env
      .addSource(sourceFunction)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute();
  }
}
  1. flink-cdc-connectors这个项目的Mysql读取相关的UML类图,整理如下:


    flink_cdc_MySQLSource.png
  • 上步骤的重点为一个Mysql的source类,即com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;
    此为一个builder,进行相关参数和启动模式的梳理,最后创建一个debezium读取的类:
    com.alibaba.ververica.cdc.debezium.DebeziumSourceFunction,此为flink的一个SourceFunction,
    由它进行snapshot和增量binlog的读取。
    该类的代码描述:
/**
 * The {@link DebeziumSourceFunction} is a streaming data source that pulls captured change data
 * from databases into Flink.
 *
 * 

There are two workers during the runtime. One worker periodically pulls records from the * database and pushes the records into the {@link Handover}. The other worker consumes the records * from the {@link Handover} and convert the records to the data in Flink style. The reason why * don't use one workers is because debezium has different behaviours in snapshot phase and * streaming phase. * *

Here we use the {@link Handover} as the buffer to submit data from the producer to the * consumer. Because the two threads don't communicate to each other directly, the error reporting * also relies on {@link Handover}. When the engine gets errors, the engine uses the {@link * DebeziumEngine.CompletionCallback} to report errors to the {@link Handover} and wakes up the * consumer to check the error. However, the source function just closes the engine and wakes up the * producer if the error is from the Flink side. * *

If the execution is canceled or finish(only snapshot phase), the exit logic is as same as the * logic in the error reporting. * *

The source function participates in checkpointing and guarantees that no data is lost during a * failure, and that the computation processes elements "exactly once". * *

Note: currently, the source function can't run in multiple parallel instances. * *

Please refer to Debezium's documentation for the available configuration properties: * https://debezium.io/documentation/reference/1.2/development/engine.html#engine-properties */

  1. CDC具体调用流程图汇总


    flink_mysql_cdc.png

初步结论

  • 目前由于使用debezium server进行数据同步,目前只支持单并发;多并发的实现issue中反馈正在开发中,待新版本确认;
  • 省去了kafka和debezium的部署,整体架构较简单;
  • 如果现架构既有kafka部署,而且希望中间缓存解耦,或者需要做多topic多分区以提高并发度的话;目前还是得保留kafka。

你可能感兴趣的:(Flink CDC调查)