目录
一 : 什么是CDC ?使用场景是什么?
二: 目前有哪些技术
基于查询的 CDC:
基于日志的 CDC:
三- FlinkCDC采集mysql 到 mysql的demo
1- mysql必须开启binlog
2- 创建一个用户,权限 SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT 。必须有reload
3- 将flink-cdc-connectors的 jar包放入 /lib 目录下。
4- 引入依赖 (注意 打包成 jar的时候,不要打包进去。不然会报错)
5.1- SQL的实现方式
CDC是(change data capture),翻译过来就是 捕获数据变更。通常数据处理上,我们说的 CDC 技术主要面向 数据库的变更,是一种用于捕获数据库中数据变更的技术。
它的使用场景(作用)主要有:
1- 数据同步,用于备份,容灾
2- 数据分发,一个数据源分发给多个下游
3- 数据采集(E),面向数据仓库/数据湖的 ETL 数据集成
根据实现机制可以分为两个方向,基于查询和基于日志。
基于查询是就是select进行全表扫描过滤出变更的数据。
基于日志就是连续实时读取数据库的操作log,例如msyql的binlog
离线调度查询作业,批处理。把一张表同步到其他系统,每次通过查询去获取表中最新的数据;
无法保障数据一致性,查的过程中有可能数据已经发生了多次变更;
无法保障实时性,基于离线调度存在天然的延迟。
影响数据库性能
因我们的业务场景是要求近实时(分钟级),所以必须采用基于binlog的技术,canal的demo可以参考我的另外文章。又因为 初始化时需要导入全量数据(msyql到kudu),canal得依赖其他的组件,需要保证数据完整一致性(数据不丢,不重复),且对数据库影响小(锁表先导入全量数据,在进行增量)。操作起来较为麻烦,此时FlinkCDC闪亮登场( 如何全量,增量和精准一次可参考)。
前置条件:Mysql 必须是 5.7 或 8.0.X
server-id = # 可以自定义,但必须唯一
log_bin = # 可以自定义,binlog文件的前缀名
binlog_format = ROW # 必须是row
binlog_row_image = FULL # 必须是full
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO ''@'' identified by '';
Flink必须是 1.12以上的,如果使用flinkCDC2.0且使用flinkSQL,必须是1.13,java 8
下载网页flink-cdc-connector(包括了 mysql postgres和mongdb)
没放入可能会包找不到的错误
org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'mysql-cdc' that implements 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath.
com.ververica
flink-connector-mysql-cdc
2.0.0
provided <- 编译打包是不要打包进去,不然运行会报错->
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Unable to instantiate java compiler
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.lang.IllegalStateException: Unable to instantiate java compiler
at org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile(JaninoRelMetadataProvider.java:428)
at org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.load3(JaninoRelMetadataProvider.java:374)
at org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.lambda$static$0(JaninoRelMetadataProvider.java:109)
at org.apache.flink.calcite.shaded.com.google.common.cache.CacheLoader$FunctionToCacheLoader.load(CacheLoader.java:165)
at org.apache.flink.calcite.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
at org.apache.flink.calcite.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
at org.apache.flink.calcite.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
at org.apache.flink.calcite.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
at org.apache.flink.calcite.shaded.com.google.common.cache.LocalCache.get(LocalCache.java:3951)
at org.apache.flink.calcite.shaded.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974)
at org.apache.flink.calcite.shaded.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958)
at org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.create(JaninoRelMetadataProvider.java:469)
at org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.revise(JaninoRelMetadataProvider.java:481)
at org.apache.calcite.rel.metadata.RelMetadataQueryBase.revise(RelMetadataQueryBase.java:95)
at org.apache.calcite.rel.metadata.RelMetadataQuery.getPulledUpPredicates(RelMetadataQuery.java:784)
at org.apache.calcite.rel.rules.ReduceExpressionsRule$ProjectReduceExpressionsRule.onMatch(ReduceExpressionsRule.java:303)
at org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:333)
at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:542)
。。
。。
。。
。。
Caused by: java.lang.ClassCastException: org.codehaus.janino.CompilerFactory cannot be cast to org.codehaus.commons.compiler.ICompilerFactory
at org.codehaus.commons.compiler.CompilerFactoryFactory.getCompilerFactory(CompilerFactoryFactory.java:129)
at org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory(CompilerFactoryFactory.java:79)
at org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile(JaninoRelMetadataProvider.java:426)
完整的pom文件(包含了 DataStreaming实现和 SQL实现的依赖)
4.0.0
org.example
Flink
1.0-SNAPSHOT
8
8
1.13.0
2.12
5.1.49
2.0.0
1.2.75
org.apache.flink
flink-java
${flink.version}
provided
org.apache.flink
flink-streaming-java_${scala.binary.version}
${flink.version}
provided
org.apache.flink
flink-clients_${scala.binary.version}
${flink.version}
provided
org.apache.flink
flink-table-planner-blink_${scala.binary.version}
${flink.version}
provided
com.ververica
flink-connector-mysql-cdc
${flinkcdc.version}
com.alibaba
fastjson
${fastjson.version}
org.apache.flink
flink-connector-jdbc_${scala.binary.version}
${flink.version}
mysql
mysql-connector-java
${mysql.version}
src/main/java/
org.apache.maven.plugins
maven-compiler-plugin
2.3.2
1.8
UTF-8
true
org.apache.maven.plugins
maven-surefire-plugin
2.8.1
**/*.java
true
org.apache.maven.plugins
maven-shade-plugin
2.4.3
package
shade
flink jar包选择provide也是官方的建议
本地测试 从 bigdata.person中读取到 ,bigdata.person1。
建表语句
CREATE DATABASE bigdata;
use bigdata;
CREATE TABLE person(
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(20) NOT NULL DEFAULT ""
) ENGINE=INNODB DEFAULT CHARSET=UTF8 ;
CREATE TABLE person1(
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(20) NOT NULL DEFAULT ""
) ENGINE=INNODB DEFAULT CHARSET=UTF8 ;
往数据库写入数据
#随机创建字符串函数
DELIMITER $$
CREATE FUNCTION rand_string(n INT) RETURNS VARCHAR(255)
BEGIN
DECLARE chars_str VARCHAR(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ';
DECLARE return_str VARCHAR(255) DEFAULT '';
DECLARE i INT DEFAULT 0;
WHILE i < n DO
SET return_str =CONCAT(return_str,SUBSTRING(chars_str,FLOOR(1+RAND()*52),1));
SET i = i + 1;
END WHILE;
RETURN return_str;
END $$
#往表中插入数据的 存储过程
DELIMITER $$
CREATE PROCEDURE insert_person(IN START INT(10),IN max_num INT(10))
BEGIN
DECLARE i INT DEFAULT 0;
#set autocommit =0 把autocommit设置成0
SET autocommit = 0;
REPEAT
SET i = i + 1;
INSERT INTO person (name ) VALUES (rand_string(6));
UNTIL i = max_num
END REPEAT;
COMMIT;
END $$
#调用存储过程,写入数据
DELIMITER ;
CALL insert_person(100001,5000);
FlinkSQL实现方式
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.TableResult;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
public class Mysql2MysqlLocal {
public static void main(String[] args) throws Exception {
EnvironmentSettings envSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
TableEnvironment tableEnv = TableEnvironment.create(envSettings);
String sourceDDL =
"CREATE TABLE mysql_binlog (\n" +
" id Int,\n" +
" name STRING,\n" +
" primary key (id) not enforced\n" +
") WITH (\n" +
" 'connector' = 'mysql-cdc',\n" +
" 'hostname' = '127.0.0.1',\n" +
" 'port' = '3306',\n" +
" 'username' = 'root',\n" +
" 'password' = '123456',\n" +
" 'database-name' = 'bigdata',\n" +
" 'table-name' = 'person',\n" +
" 'scan.startup.mode' = 'earliest-offset'\n" +
")";
String sinkDDL =
"CREATE TABLE test_cdc (" +
" id Int," +
" name STRING," +
" primary key (id) not enforced" +
") WITH (" +
" 'connector' = 'jdbc'," +
" 'driver' = 'com.mysql.cj.jdbc.Driver'," +
" 'url' = 'jdbc:mysql://127.0.0.1:3306/bigdata?serverTimezone=UTC&useSSL=false'," +
" 'username' = 'root'," +
" 'password' = '123456'," +
" 'table-name' = 'person1'" +
")";
// 简单的聚合处理
String transformDmlSQL = "insert into test_cdc select * from mysql_binlog";
TableResult tableResult = tableEnv.executeSql(sourceDDL);
TableResult sinkResult = tableEnv.executeSql(sinkDDL);
tableEnv.executeSql(transformDmlSQL);
}
}
测试步骤:
case1 : 增量测试
1- 本地idea测试的话,直接运行程序
2- 调用存储过程往 表person写入数据(也可以手动单条插入)
3- 查看 person1的数据条数是否一致
case2: 全量+增量测试
1- 调用存储过程 往表person写入 100w条数据
2- 在idea启动运行程序,继续调用存储过程往表person写入100w条数据
3- 查看person1的 数据数量
case3: 本地运行jar包 进行增量测试
1- idea等编译生成 jar包
2- 启动flink集群
3- 启动运行 jar程序, bin/flink run -m 127.0.0.1 -c Mysql2MysqlLocal
4- 往person表写入数据
5- 查看person1 表的数据
先写到这里
参考文档:FlinkCDC的git地址
Flink 中文社区 | Flink CDC 2.0 正式发布,详解核心改进