软件版本
Mysql: 5.7
Hadoop: 3.1.3
Flink: 1.12.2
Hudi: 0.9.0
Hive: 2.3.7
1.Mysql建表并开启bin_log
create table users(
id bigint auto_increment primary key,
name varchar(20) null,
birthday timestamp default CURRENT_TIMESTAMP not null,
ts timestamp default CURRENT_TIMESTAMP not null
);
2.安装Hadoop
(1)解压hadoop安装包:tar -zxvf hadoop-3.1.3.tar.gz
(2)配置环境变量
export HADOOP_HOME=/Users/xxx/hadoop/hadoop-3.1.3
export HADOOP_COMMON_HOME=$HADOOP_HOME
export PATH=$HADOOP_HOME/bin:$PATH
#添加hadoop classpath
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
3.下载安装Flink
(1)在Flink官网下载flink软件包:https://flink.apache.org/downloads.html
(2)解压:tar -zxvf flink-1.12.2-bin-scala_2.11.tgz
(3)配置flink(vim conf/flink-conf.yaml),开启checkpoint(flink-cdc需要开启checkpoint才能生成hudi commit,提交数据)
state.backend: filesystem
execution.checkpointing.interval: 10000
state.checkpoints.dir: file:///Users/xxx/flink/flink-1.12.2/hudi/flink-checkpoints
state.savepoints.dir: file:///Users/xxx/flink/flink-1.12.2/hudi/flink-savepoints
(4)配置flink(vim conf/flink-conf.yaml),增加slot数
taskmanager.numberOfTaskSlots: 4
vim workers
1 localhost
2 localhost
3 localhost
4 localhost
(4)启动Flink:bin/start-cluster.sh
4.编译Hudi,拷贝jar包
(1)下载Hudi源码:git clone https://github.com/apache/hudi.git
(2)切换到0.9.0分支:git checkout origin release-0.9.0
(3)编译:mvn clean package -DskipTests
(4)编译完成后,会在packaging/hudi-flink-bundle/target目录下生成对应的jar包(hudi-flink-bundle_2.11-0.9.0.jar),将此jar包拷贝至flink的lib目录中:
cp hudi-flink-bundle_2.11-0.9.0.jar ~/flink/lib
5.将其他相关jar包拷贝至flink/lib目录下
(1)flink-sql-connector-mysql-cdc-1.2.0.jar:用于连接mysql
(2)aws-java-sdk-bundle-1.11.874.jar/hadoop-aws-3.1.3.jar:用于连接aws s3
6.启动sql-client
1.bin/sql-client.sh embedded
2.建立mysql 映射表
create table mysql_users(
id bigint primary key not enforced,
name string,
birthday timestamp(3),
ts timestamp(3)
) with (
'connector' = 'mysql-cdc',
'hostname' = '127.0.0.1',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'database-name' = 'test_cdc',
'table-name' = 'users'
);
3.建立hudi映射表
create table hudi_users(
id bigint primary key not enforced,
name string,
birthday timestamp(3),
ts timestamp(3),
`partition` varchar(20)
) partitioned by (`partition`) with (
'connector' = 'hudi',
'table.type' = 'COPY_ON_WRITE',
'path' = 's3a://xxx/yyy/hudi_users',
'read.streaming.enabled' = 'true',
'read.streaming.check-interval' = '1'
);
4.创建任务
insert into hudi_users select *, date_format(birthday, 'yyyyMMdd') from mysql_users;
检查s3上是否生成了数据;
7.Hive建立external table
1.通过beeline连接hive
!connect jdbc:hive2://[ELB-DEV-Presto-hs2-s0000e2c5-06a22927ec8bb2f6.elb.us-east-1.amazonaws.com:10000/default;auth=noSasl](http://elb-dev-presto-hs2-s0000e2c5-06a22927ec8bb2f6.elb.us-east-1.amazonaws.com:10000/default;auth=noSasl)
CREATE EXTERNAL TABLE `hudi_user_mor`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`id` bigint,
`name` string,
`birthday` bigint,
`ts` bigint)
PARTITIONED BY (
`partition` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://xxx/yyy/hudi_users';
添加分区:
alter table hudi_user_mor add if not exists partition(`partition`='par1') location 's3a://fw-itf/DFMOD-c34db792/target_table/par1';
8.通过presto查询数据
1.进入presto
./presto-cli-0.248-executable.jar --server ELB-DEV-Presto-master-s0000eca1-efaff1be86b6ffa3.elb.us-east-1.amazonaws.com:9106 --catalog db
2.查询数据
select * from hudi_user_mor where partition = 'par1' limit 5;
8.测试同步
在mysql中执行增、删、改语句,并在Hive或presto中进行查询,可以实时的查询到改动。