@羲凡——只为了更好的活着
Doris的数据模型主要分为3类:Aggregate、Duplicate、Unique;
Doris支持单分区和复合分区两种建表方式;
单分区指的是只指定分桶;
复合分区指的是先指定分区再指定分桶;
假设需求:将hive的分区表导入doris的动态分区表中
help create table
CREATE TABLE test_db.doris_table_1
(
dt DATE,
type varchar(20),
id varchar(100),
campaignid varchar(10),
spotid varchar(10),
ts bigint min,
cnt bigint sum
)
ENGINE=olap
AGGREGATE KEY(dt, type, id, campaignid, spotid)
PARTITION BY RANGE (dt)
(
PARTITION p1 VALUES LESS THAN ("2020-09-01"),
PARTITION p2 VALUES LESS THAN ("2020-12-01"),
PARTITION p3 VALUES LESS THAN ("2021-03-01"),
PARTITION p4 VALUES LESS THAN ("2021-06-01")
)
DISTRIBUTED BY HASH(id) BUCKETS 10
PROPERTIES (
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "10"
);
ENGINE:默认是olap,也可以是 mysql、broker、es等,但是这些事创建外表使用,推荐olap;
AGGREGATE KEY:表会按照这后面的列进行聚合
PARTITION BY:实际按什么字段分区(如果不做复合分区就不用添加)
DISTRIBUTED BY:按什么字段分桶
dynamic_partition.enable:默认为false,true表示开启动态分区
dynamic_partition.time_unit:按什么字时间进行分区,可选择为HOUR,DAY,WEEK,MONTH
dynamic_partition.end:用于指定提前创建的分区数量,值必须大于0
dynamic_partition.prefix:用于指定自动创建的分区名前缀,但是本文已经制定分区值了,就用不上这个参数
dynamic_partition.buckets:用于指定自动创建的分区分桶数量
help broker load
LOAD LABEL doris_table_1_001
(
DATA INFILE("hdfs://ns/user/test_db/hive_table_1/dt=*/*")
INTO TABLE `doris_table_1`
COLUMNS TERMINATED BY "|"
(type,id,idtype,campaignid,spotid,referrer,ts,pcos,browser,ip)
COLUMNS FROM PATH AS (dt)
set (cnt = 1)
)
WITH BROKER broker_name (
"username"="hdfs_user",
"password"="hdfs_password",
"dfs.nameservices" = "ns",
"dfs.ha.namenodes.ns" = "namenode30, namenode55",
"dfs.namenode.rpc-address.ns.namenode30" = "yc-nsg-h2:8020",
"dfs.namenode.rpc-address.ns.namenode55" = "yc-nsg-h3:8020",
"dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
PROPERTIES (
"timeout" = "3600",
"max_filter_ratio" = "0.1",
"timezone" = "Asia/Shanghai");
(type,id,...,ip):对应hive列的顺序,名称可以不一样,但是需导入doris的列必须和doris列名一致才能自动导入
columns from path as (dt):读取hdfs路径上值作为传入doris的参数
set (cnt = 1):相当于给源表增加一列,这样进doris时就有预聚合后的cnt
|
|
|
====================================================================
@羲凡——只为了更好的活着
若对博客中有任何问题,欢迎留言交流