hive 拉链表 实现全量数据 增量更新

背景:

数据表字段有IP,IP省份,IP城市,最新访问时间,假如目前一共有100W的IP我第一次初始化的时候完成初始化表插入,然后每天都有5W左右的IP活跃,然后我要更新这5W个IP的记录到初始化的表,他的IP省份如果有变更也要更新保留最新的数据,更新最新的访问时间等。

所有操作均在hive表上

1.需要2个表 数据拉链表,每日更新数据表

DDL语句如下

每日更新表
create table dm.day_tbl(
ip string comment'ip',
country string comment'国家',
province string comment'省份',
city string comment'城市',
day string comment'day',
)
comment '每日更新表'
row format delimited
fields terminated by '\u0001'
stored as parquet;
数据拉链表
create table dm.zipper_tbl(
ip string comment'ip',
country string comment'国家',
province string comment'省份',
city string comment'城市',
t_start_date string comment'start_date',
t_end_date string comment'end_date'
)
comment '数据拉链表'
row format delimited
fields terminated by '\u0001'
stored as parquet;

ETL语句如下

INSERT OVERWRITE TABLE dm.zipper_tbl
SELECT * FROM 
( 
    SELECT zipper_tbl.ip, 
           zipper_tbl.country, 
           zipper_tbl.province, 
           zipper_tbl.city, 
           zipper_tbl.t_start_date, 
           CASE 
                WHEN zipper_tbl.t_end_date = '9999-12-31' AND day_tbl.ip IS NOT NULL THEN date_add(day_tbl.day,-1)
                ELSE zipper_tbl.t_end_date 
           END AS t_end_date 
    FROM dm.zipper_tbl AS zipper_tbl 
    LEFT JOIN dm.day_tbl AS day_tbl 
    ON zipper_tbl.ip = day_tbl.ip 
UNION ALL
    SELECT union_day_tbl.ip,
           union_day_tbl.country, 
           union_day_tbl.province, 
           union_day_tbl.city, 
           union_day_tbl.day AS t_start_date, 
           '9999-12-31' AS t_end_date 
    FROM dm.day_tbl AS union_day_tbl 
) AS main_tbl WHERE main_tbl.t_end_date = "9999-12-31"

上面ETL是参考常见的拉链表,最后过滤一下时间,就可以得到全量最新的数据

参考:
https://www.cnblogs.com/lxbma...

你可能感兴趣的:(hive)