Hive存储优化之Cluster By + Parquet

场景:

        在业务场景中,会经常有join或者group by操作,这样会使数据打散,使Parquet无法达到最大的压缩比,使用Cluster By使相同的key聚合排序,达到Parquet最大的压缩比

基础知识:要熟悉以下概念,简单介绍一下

Distribute By:reduce阶段key值聚合分发

Sort By:分组排序

Cluster By = Distribute By + Sort By

Parquet:列存储模式 + 列压缩

优化示例:

CREATE TABLE IF NOT EXISTS tmp.test(
    id            string COMMENT ,
    feature       string COMMENT ,
    value         string COMMENT 
)
PARTITIONED BY (
    data_date     bigint COMMENT '时间分区'
);

INSERT OVERWRITE TABLE tmp.test partition(data_date=001)
SELECT id, alias_name, value
FROM (
    SELECT alias_name, feature
    FROM tmp.mapping
    WHERE data_date = 20200618
) a
JOIN (
    SELECT id, feature, value
    FROM tmp.source
    WHERE data_date = 20200706
) b
ON a.feature = b.feature;



INSERT OVERWRITE TABLE tmp.test partition(data_date=002)
SELECT id, alias_name, value
FROM (
    SELECT alias_name, feature
    FROM tmp.mapping
    WHERE data_date = 20200618
) a
JOIN (
    SELECT id, feature, value
    FROM tmp.source
    WHERE data_date = 20200706
) b
ON a.feature = b.feature
DISTRIBUTE BY id;



INSERT OVERWRITE TABLE tmp.test partition(data_date=003)
SELECT id, alias_name, value
FROM (
    SELECT alias_name, feature
    FROM tmp.mapping
    WHERE data_date = 20200618
) a
JOIN (
    SELECT id, feature, value
    FROM tmp.source
    WHERE data_date = 20200706
) b
ON a.feature = b.feature
Cluster By id;

结论:

使用Cluster By可以是parquet达到最大压缩比

 

你可能感兴趣的:(hive,数据仓库)