【数据湖Hudi-8-Hudi集成Flink-入门】

数据湖Hudi-8-Hudi集成Flink-入门

  • Hudi集成Flink入门
    • 1.Hudi集成Flink版本对照关系
    • 2.Flink环境准备
    • 3.Flink SQL Client方式处理任务
      • 1.修改配置
      • 2.创建表格,插入数据
      • 3.流式插入数据
    • 4.Flink IDEA编码方式处理任务
      • 1.环境准备
      • 2.创建Maven工程,并编写代码
      • 3.提交运行
    • 5.Flink和Hudi类型映射关系

Hudi集成Flink入门

1.Hudi集成Flink版本对照关系

【数据湖Hudi-8-Hudi集成Flink-入门】_第1张图片
0.11.x不建议使用,如果要用请使用补丁分支:https://github.com/apache/hudi/pull/6182

2.Flink环境准备

1)拷贝编译好的jar包到Flink的lib目录下

cp /opt/software/hudi-0.12.0/packaging/hudi-flink-bundle/target/hudi-flink1.13-bundle_2.12-0.12.0.jar /opt/module/flink-1.13.6/lib/

2)拷贝guava包,解决依赖冲突

cp /opt/module/hadoop-3.1.3/share/hadoop/common/lib/guava-27.0-jre.jar /opt/module/flink-1.13.6/lib/

3)配置Hadoop环境变量

sudo vim /etc/profile.d/my_env.sh

export HADOOP_CLASSPATH=`hadoop classpath`
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

source /etc/profile

3.Flink SQL Client方式处理任务

1.修改配置

  • 1)修改flink-conf.yaml配置
vim /opt/module/flink-1.13.6/conf/flink-conf.yaml

classloader.check-leaked-classloader: false
taskmanager.numberOfTaskSlots: 4 # hudi写出数据默认taskslots是4,如果不调整hudi,就在这里调整

state.backend: rocksdb
execution.checkpointing.interval: 30000
state.checkpoints.dir: hdfs://hadoop1:8020/ckps
state.backend.incremental: true
  • 2)yarn-session模式
    (1)解决依赖问题
    注意:
    下面包依赖问题的处理,主要是解决 flink集成Hudi的时候,flink任务在执行的时候,需要进行 compaction,但是 compaction不会成功,且此错误不会上报到总日志服务器上,所以需要进入到Flink对应的单独的任务里面,查看报错,报错信息如下,实际上在flink集成hudi里面有这个包,最终原因是以来冲突问题。
    在这里插入图片描述
cp /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.1.3.jar /opt/module/flink-1.13.6/lib/

(2)启动yarn-session

/opt/module/flink-1.13.6/bin/yarn-session.sh -d

(3)启动sql-client

/opt/module/flink-1.13.6/bin/sql-client.sh embedded -s yarn-session

2.创建表格,插入数据

set sql-client.execution.result-mode=tableau;

– 创建hudi表

CREATE TABLE t1(
  uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://hadoop1:8020/tmp/hudi_flink/t1',
  'table.type' = 'MERGE_ON_READ' –- 默认是COW
);
或如下写法
CREATE TABLE t1(
  uuid VARCHAR(20),
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20),
  PRIMARY KEY(uuid) NOT ENFORCED
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://hadoop1:8020/tmp/hudi_flink/t1',
  'table.type' = 'MERGE_ON_READ'
);
  • 插入数据
INSERT INTO t1 VALUES
  ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
  ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
  ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
  ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
  ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
  ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
  ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
  ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
  • 查询数据
select * from t1;
  • 更新数据
insert into t1 values ('id1','Danny',27,TIMESTAMP '1970-01-01 00:00:01','par1');

注意,保存模式现在是Append。通常,除非是第一次创建表,否则请始终使用追加模式。现在再次查询数据将显示更新的记录。每个写操作都会生成一个用时间戳表示的新提交。查找前一次提交中相同的_hoodie_record_keys在_hoodie_commit_time、age字段中的变化。

3.流式插入数据

  • 1)创建测试表
CREATE TABLE sourceT (
  uuid varchar(20),
  name varchar(10),
  age int,
  ts timestamp(3),
  `partition` varchar(20)
) WITH (
  'connector' = 'datagen',
  'rows-per-second' = '1'
);

create table t2(
  uuid varchar(20),
  name varchar(10),
  age int,
  ts timestamp(3),
  `partition` varchar(20)
)
with (
  'connector' = 'hudi',
  'path' = '/tmp/hudi_flink/t2',
  'table.type' = 'MERGE_ON_READ'
);
  • 2)执行插入
insert into t2 select * from sourceT;
  • 3)查询结果
set sql-client.execution.result-mode=tableau;
select * from t2 limit 10;

4.Flink IDEA编码方式处理任务

1.环境准备

  • 1.手动install依赖
    在hudi-flink1.13-bundle-0.12.0.jar所在目录下,打开cmd,执行此命令,然后查看idea中settings的maven中 local repository多对应的本地依赖库目录跟执行完下面命令所对应的目录是否一致,如果不一致,需要将下面命令编译完的jar移动到刚刚目录下面。
mvn install:install-file -DgroupId=org.apache.hudi -DartifactId=hudi-flink_2.12 -Dversion=0.12.0 -Dpackaging=jar -Dfile=./hudi-flink1.13-bundle-0.12.0.jar

2.创建Maven工程,并编写代码

代码如下:

import org.apache.flink.configuration.Configuration;
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.contrib.streaming.state.PredefinedOptions;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment;

import java.util.concurrent.TimeUnit;

public class HudiDemo {
    public static void main(String[] args) {
        //IDEA运行时,提供WEBUI
//        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //设置状态后端 RocksDB
        EmbeddedRocksDBStateBackend embeddedRocksDBStateBackend = new EmbeddedRocksDBStateBackend(true);
        //idea本地运行时,指定rocksdb存储路径
//        embeddedRocksDBStateBackend.setDbStoragePath("file:///E:/rocksdb");
        embeddedRocksDBStateBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM);
        env.setStateBackend(embeddedRocksDBStateBackend);

        //checkpoint配置
        env.enableCheckpointing(TimeUnit.SECONDS.toMillis(5), CheckpointingMode.EXACTLY_ONCE);
        CheckpointConfig checkpointConfig = env.getCheckpointConfig();
        checkpointConfig.setCheckpointStorage("hdfs://hadoop102:8020/ckps");
        checkpointConfig.setMinPauseBetweenCheckpoints(TimeUnit.SECONDS.toMillis(2));
        checkpointConfig.setTolerableCheckpointFailureNumber(5);
        checkpointConfig.setCheckpointTimeout(TimeUnit.MINUTES.toMillis(1));
        checkpointConfig.setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

        StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env);
        tableEnvironment.executeSql("CREATE TABLE sourceT (\n" +
                "  uuid varchar(20),\n" +
                "  name varchar(10),\n" +
                "  age int,\n" +
                "  ts timestamp(3),\n" +
                "  `partition` varchar(20)\n" +
                ") WITH (\n" +
                "  'connector' = 'datagen',\n" +
                "  'rows-per-second' = '1'\n" +
                ")");
        tableEnvironment.executeSql("create table t2(\n" +
                "  uuid varchar(20),\n" +
                "  name varchar(10),\n" +
                "  age int,\n" +
                "  ts timestamp(3),\n" +
                "  `partition` varchar(20)\n" +
                ")\n" +
                "with (\n" +
                "  'connector' = 'hudi',\n" +
                "  'path' = 'hdfs://hadoop102:8020/tmp/hudi_flink/t2',\n" +
                "  'table.type' = 'MERGE_ON_READ'\n" +
                ")");

        tableEnvironment.executeSql("insert into t2 select * from sourceT");

    }
}

3.提交运行

将代码打成jar包,上传到目录myjars,执行提交命令:

flink run -t yarn-per-job \
-c com.yang.hudi.flink.HudiDemo \
./myjars/flink-hudi-demo-1.0-SNAPSHOT.jar

5.Flink和Hudi类型映射关系

【数据湖Hudi-8-Hudi集成Flink-入门】_第2张图片

你可能感兴趣的:(大数据,数据湖,hudi,flink,hadoop,大数据)