Doris hdfs数据导入doris动态分区表

@羲凡——只为了更好的活着

Doris hdfs数据导入doris动态分区表

本文重点
1.动态分区表创建
2.读取路径作为分区参数
3.预聚合增加count列——set (cnt = 1)
4.broker load 的 hdfs HA 配置

前期准备

1.安装好doris——Doris 编译安装(完整版)
2.doris基本知识扫盲
Doris的数据模型主要分为3类:Aggregate、Duplicate、Unique;
Doris支持单分区和复合分区两种建表方式;
单分区指的是只指定分桶;
复合分区指的是先指定分区再指定分桶;

假设需求:将hive的分区表导入doris的动态分区表中

1、Doris建表

doris的样例做的非常好,进入fe后,建表 help create table
CREATE TABLE test_db.doris_table_1
(
dt DATE,
type varchar(20),
id varchar(100),
campaignid varchar(10),
spotid varchar(10),
ts bigint min,
cnt bigint sum
)
ENGINE=olap
AGGREGATE KEY(dt, type, id, campaignid, spotid)
PARTITION BY RANGE (dt)
(
PARTITION p1 VALUES LESS THAN ("2020-09-01"),
PARTITION p2 VALUES LESS THAN ("2020-12-01"),
PARTITION p3 VALUES LESS THAN ("2021-03-01"),
PARTITION p4 VALUES LESS THAN ("2021-06-01")
)
DISTRIBUTED BY HASH(id) BUCKETS 10
PROPERTIES (
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "10"
 );
重要参数解释:
ENGINE:默认是olap,也可以是 mysql、broker、es等,但是这些事创建外表使用,推荐olap;
AGGREGATE KEY:表会按照这后面的列进行聚合
PARTITION BY:实际按什么字段分区(如果不做复合分区就不用添加)
DISTRIBUTED BY:按什么字段分桶
dynamic_partition.enable:默认为false,true表示开启动态分区
dynamic_partition.time_unit:按什么字时间进行分区,可选择为HOUR,DAY,WEEK,MONTH
dynamic_partition.end:用于指定提前创建的分区数量,值必须大于0
dynamic_partition.prefix:用于指定自动创建的分区名前缀,但是本文已经制定分区值了,就用不上这个参数
dynamic_partition.buckets:用于指定自动创建的分区分桶数量

2、Doris——broker load

doris的样例做的非常好,进入fe后,导入 help broker load
LOAD LABEL doris_table_1_001
(
DATA INFILE("hdfs://ns/user/test_db/hive_table_1/dt=*/*")
INTO TABLE `doris_table_1`
COLUMNS TERMINATED BY "|"
(type,id,idtype,campaignid,spotid,referrer,ts,pcos,browser,ip)
COLUMNS FROM PATH AS (dt)
set (cnt = 1)
)
WITH BROKER broker_name (
    "username"="hdfs_user",
    "password"="hdfs_password", 
    "dfs.nameservices" = "ns",
    "dfs.ha.namenodes.ns" = "namenode30, namenode55",
    "dfs.namenode.rpc-address.ns.namenode30" = "yc-nsg-h2:8020",
    "dfs.namenode.rpc-address.ns.namenode55" = "yc-nsg-h3:8020",
    "dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
PROPERTIES (
    "timeout" = "3600",
    "max_filter_ratio" = "0.1",
    "timezone" = "Asia/Shanghai");
重要参数解释
(type,id,...,ip):对应hive列的顺序,名称可以不一样,但是需导入doris的列必须和doris列名一致才能自动导入
columns from path as (dt):读取hdfs路径上值作为传入doris的参数
set (cnt = 1):相当于给源表增加一列,这样进doris时就有预聚合后的cnt

|
|
|

====================================================================

@羲凡——只为了更好的活着

若对博客中有任何问题,欢迎留言交流

你可能感兴趣的:(doris,hdfs,数据导入,动态分区表,broker,load)