flink连接“数据湖”hudi,并将数据存入hdfs

依赖:其实最重要的是前面hudi 和 hadoop \ fink 的依赖,不过懒得再挑,直接全部粘这里了


    
        org.apache.hadoop
        hadoop-client
        3.1.3
    
    
        org.apache.hadoop
        hadoop-hdfs
        3.1.3
    
    
        org.apache.hadoop
        hadoop-common
        3.1.3
    

    
        org.apache.hudi
        hudi-flink-bundle_2.12
        0.9.0
    



    
        com.alibaba.ververica
        flink-connector-mysql-cdc
        1.2.0
    

   
    
    
        ru.yandex.clickhouse
        clickhouse-jdbc
        0.2
    

    
    
        org.apache.hbase
        hbase-client
        2.4.3
    

    
    
        org.apache.flink
        flink-cep-scala_2.12
        1.13.0
    

    
    
        org.apache.flink
        flink-core
        ${flink.version}
    
    
        org.apache.flink
        flink-clients_2.12
        1.13.0
    
    
        org.apache.flink
        flink-java
        ${flink.version}
    
    
        org.apache.flink
        flink-streaming-java_2.12
        1.13.0
    

    
    
        org.apache.flink
        flink-table-api-java-bridge_2.12
        ${flink.version}
    
    
        org.apache.flink
        flink-table-planner-blink_2.12
        ${flink.version}
    
    
        org.apache.flink
        flink-table-runtime-blink_2.12
        ${flink.version}
    
    
        org.apache.flink
        flink-table-common
        ${flink.version}
    
    
        org.apache.flink
        flink-csv
        1.9.0
    
    
        org.apache.flink
        flink-json
        ${flink.version}
    

    
    
        org.apache.flink
        flink-connector-kafka_2.12
        1.13.0
    
    
        org.apache.bahir
        flink-connector-redis_2.11
        1.0
    
    
        org.apache.flink
        flink-jdbc_2.12
        1.9.2
    
    
        org.apache.flink
        flink-connector-jdbc_2.12
        1.13.0
    
    
        org.apache.flink
        flink-connector-elasticsearch7_2.12
        1.10.1
    
    
        org.apache.flink
        flink-connector-filesystem_2.11
        1.4.2
    

    
    
        org.apache.kafka
        kafka-clients
        2.5.0
    


    
    
        junit
        junit
        4.12
    
    
        org.slf4j
        slf4j-log4j12
        1.7.30
    
    
    
        org.projectlombok
        lombok
        1.16.22
    
    
    
        mysql
        mysql-connector-java
        8.0.11
    
    
        com.alibaba
        fastjson
        1.2.68
    



    
        
            org.apache.maven.plugins
            maven-shade-plugin
            3.2.4
            
                
                    package
                    
                        shade
                    
                    
                        
                            
                                com.google.code.findbugs:jsr305
                                org.slf4j:*
                                log4j:*
                                org.apache.hadoop:*
                            
                        
                        
                            
                                
                                *:*
                                
                                    META-INF/*.SF
                                    META-INF/*.DSA
                                    META-INF/*.RSA
                                
                            
                        
                        
                            
                            
                        
                    
                
            
        
    

还有这几个包,不知道为什么,没在maven仓库找到 ,直接导进来也一样

 flink连接“数据湖”hudi,并将数据存入hdfs_第1张图片

 

 通过flinkSQl 从kafka拿到数据,然后存进hudi

 

这里是我将存在kafka的指标(json形式)拿出 , 然后存入hudi 

public class SinkHuDi {
    public static void main(String[] args) {
        // 1-获取表执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().build();
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings);

        // TODO: 由于增量将数据写入到Hudi表,所以需要启动Flink Checkpoint检查点
        // 1.1 开启CK
        env.enableCheckpointing(5000L);
        env.getCheckpointConfig().setCheckpointTimeout(10000L);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

        //正常Cancel任务时,保留最后一次CK
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        //重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000L));
        //状态后端
        env.setStateBackend(new FsStateBackend("hdfs://192.168.16.101:8020/HuDi_ck/ck1"));

        //设置访问HDFS的用户名
        System.setProperty("HADOOP_USER_NAME", "root");

        // 2-创建输入表,TODO:从Kafka消费数据
        tableEnv.executeSql(
                "CREATE TABLE order_kafka_source (\n" +
                        "  arity BIGINT,\n" +
                        "  f0 BIGINT,\n" +
                        "  f1 DOUBLE\n" +
                        ") WITH (\n" +
                        "  'connector' = 'kafka',\n" +
                        "  'topic' = 'dws_saleHot',\n" +
                        "  'properties.bootstrap.servers' = 'hadoop101:9092',\n" +
                        "  'properties.group.id' = 'gid-1002',\n" +
//                        "  'scan.startup.mode' = 'latest-offset',\n" +
                        "  'scan.startup.mode' = 'earliest-offset',\n" +
                        "  'format' = 'json',\n" +
                        "  'json.fail-on-missing-field' = 'false',\n" +
                        "  'json.ignore-parse-errors' = 'true'\n" +
                        ")"
        );


        // 4-创建输出表,TODO: 关联到Hudi表,指定Hudi表名称,存储路径,字段名称等等信息
        tableEnv.executeSql(
                "CREATE TABLE saleHot (\n" +
                        "  arity BIGINT,\n" +
                        "  f0 BIGINT,\n" +
                        "  f1 DOUBLE\n" +
                        ")\n" +
                        "PARTITIONED BY (f0)\n" +
                        "WITH (\n" +
                        "    'connector' = 'hudi',\n" +
//                               "    'path' = 'file:///D:/flink_hudi_order',\n" +
                        "  'path' = 'hdfs://192.168.16.101:8020/hudi-warehouse/saleHot' ,\n" +
                        "    'table.type' = 'MERGE_ON_READ',\n" +
                        "    'write.operation' = 'upsert',\n" +
                        "    'hoodie.datasource.write.recordkey.field'= 'f0',\n" +
//                        "    'write.precombine.field' = 'ts',\n" +
                        "    'write.tasks'= '1'\n" +
                        ")"
        );

        // 5-通过子查询方式,将数据写入输出表
        /*tableEnv.executeSql(
                "INSERT INTO saleHot " +
                        "SELECT arity, f0, f1 FROM order_kafka_source"
        );*/

        tableEnv.executeSql("select * from saleHot").print();

        //tableEnv.executeSql("select * from order_kafka_source").print();

    }
}

 存入之后,hdfs 结果如下:

flink连接“数据湖”hudi,并将数据存入hdfs_第2张图片

 

你可能感兴趣的:(flink,hdfs,hadoop)