在数据仓库的数据模型设计过程中,经常会遇到下面这种表的设计:
拉链表可以解决上述数据缓慢化场景问题,既能保存历史状态,又能尽量节省空间资源,数仓中某些维度缓慢变化的应用场景可以使用缓慢变化维拉链表。
1)下图为业务后台数据库中每日的数据变化情况,每天只保留客户最新手机号码。
2)根据最后修改时间字段从业务后台数据库抽取最新被修改数据,形成分区表。
3)拉链表的形成过程(重点!!!)
首先根据上面的分析过程,对sql比较熟悉的小伙伴估计已经能够想出整套的解决方案了。一般在工作中,和业务人员沟通之后,自己如果能清楚的描述出业务过程,基本上需求已经实现一大半了,接下来就是写代码啦!
1)整体思路分析
2)进一步思考
1)算法思想
1、采集当日全量数据到ND(NewDay)表;
2、可从历史表中取出昨日全量数据存储到OD(OldDay)表;
3、(ND-OD)就是当日新增和变化的数据,也就是当天的增量,用W_I表示;
4、(OD-ND)为状态到此结束需要封链的数据,用W_U表示;
5、将W_I表的内容全部插入到历史表中,这些是新增记录,start_date为当天,而end_date为max值;
6、对历史表进行W_U部份的更新操作,start_date保持不变,而end_date改为当天,也就是关链操作
2)代码实现
1. ods层user表结构
CREATE EXTERNAL TABLE ods.user (
user_num STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
reg_date STRING COMMENT '注册日期',
modified_time STRING COMMENT '更新日期'
COMMENT '用户资料表'
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/ods/user';
)
2. ods层user_update表结构
CREATE EXTERNAL TABLE ods.user_update (
user_num STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
reg_date STRING COMMENT '注册日期',
modified_time STRING COMMENT '更新日期'
COMMENT '每日用户资料更新表'
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/ods/user_update';
)
3. 拉链表结构
CREATE EXTERNAL TABLE dws.user_his (
user_num STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
reg_date STRING COMMENT '注册日期',
modified_time STRING COMMENT '更新日期',
t_start_date STRING COMMENT '数据开始时间',
t_end_date STRING COMMENT '数据结束时间'
COMMENT '用户资料拉链表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/dws/user_his';
)
4. 拉链表从2017-01-01更新到2017-01-02
INSERT OVERWRITE TABLE dws.user_his
SELECT * FROM
(
SELECT A.user_num,
A.mobile,
A.reg_date,
A.modified_time
A.t_start_time,
CASE
WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-01'
ELSE A.t_end_time
END AS t_end_time
FROM dws.user_his AS A
LEFT JOIN ods.user_update AS B
ON A.user_num = B.user_num
UNION
SELECT C.user_num,
C.mobile,
C.reg_date,
C.modified_time,
'2017-01-02' AS t_start_time,
'9999-12-31' AS t_end_time
FROM ods.user_update AS C
) AS T
5. 拉链表从2017-01-02更新到2017-01-03
INSERT OVERWRITE TABLE dws.user_his
SELECT * FROM
(
SELECT A.user_num,
A.mobile,
A.reg_date,
A.modified_time
A.t_start_time,
CASE
WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-02'
ELSE A.t_end_time
END AS t_end_time
FROM dws.user_his AS A
LEFT JOIN ods.user_update AS B
ON A.user_num = B.user_num
UNION
SELECT C.user_num,
C.mobile,
C.reg_date,
C.modified_time,
'2017-01-03' AS t_start_time,
'9999-12-31' AS t_end_time
FROM ods.user_update AS C
) AS T
6. 替换日期为变量,比如今日是2017-01-03,数仓中可以更新2017-01-02数据,此时$date = 2017-01-02
INSERT OVERWRITE TABLE dws.user_his
SELECT * FROM
(
SELECT A.user_num,
A.mobile,
A.reg_date,
A.modified_time
A.t_start_time,
CASE
WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN date_add($date,-1)
ELSE A.t_end_time
END AS t_end_time
FROM dws.user_his AS A
LEFT JOIN ods.user_update AS B
ON A.user_num = B.user_num
UNION
SELECT C.user_num,
C.mobile,
C.reg_date,
C.modified_time,
$date AS t_start_time,
'9999-12-31' AS t_end_time
FROM ods.user_update AS C
) AS T
1. 下图展示拉链表,分区表和全量表的结果
2.三种类型的表各自的对比如下,当然每种表有每种表各自适合的使用场景,具体问题要具体分析!
文中若有不正之处,望请指正!