基于Hadoop生态圈的数据仓库实践 —— 进阶技术(十二)

十二、间接数据源
        本节讨论如何处理间接数据源。间接数据源与维度表具有不同的粒度,因此不能直接装载进数据仓库。在这里通过修改进阶技术(八)——“多路径和参差不齐的层次”里的促销源数据说明怎样处理间接数据源。
CAMPAIGN SESSION,MONTH,YEAR
2016 First Campaign,1,2016
2016 First Campaign,2,2016
2016 First Campaign,3,2016
2016 First Campaign,4,2016
2016 Second Campaign,5,2016
2016 Second Campaign,6,2016
2016 Second Campaign,7,2016
2016 Third Campaign,8,2016
2016 Last Campaign,9,2016
2016 Last Campaign,10,2016
2016 Last Campaign,11,2016
2016 Last Campaign,12,2016
        如上所示,促销期数据源的粒度是月,因为每行都有一个月份元素。而且一个促销期可能延续多个月,正如上面显示的2016年第一个促销期有四个月。这意味着促销期信息重复了四次,也就是四行。比方说希望简化促销期源数据的准备工作,每个促销期不管有多长,只准备一行数据。新的数据格式可以改成下面所示,存在non_campaign_session.csv文件中。
2016 First Campaign,1,2016,4,2016
2016 Second Campaign,5,2016,7,2016
2016 Third Campaign,8,2016,8,2016
2016 Last Campaign,9,2016,12,2016
1. 修改促销期装载脚本
        需要一个不同的过渡表。使用下面的脚本创建它。
USE rds;  
CREATE TABLE non_straight_campaign (  
    campaign_session CHAR(30),  
    start_month CHAR(9),  
    start_year INT,  
    end_month CHAR(9),  
    end_year INT  
)
row format delimited fields terminated by ',' stored as textfile; 
        注意新的过渡表既有开始年月列也有结束年月列。下面给出了修改后的促销期装载脚本。
use rds;  
load data local inpath '/root/non_campaign_session.csv' overwrite into table non_straight_campaign;  

use dw;  
drop table if exists tmp;  
create table tmp as   
select t1.month_sk,
       t1.month,
       t1.month_name,
       t3.campaign_session,
       t1.quarter,
       t1.year
  from month_dim t1, month_dim t2, rds.non_straight_campaign t3
 where t1.year = t3.start_year
   and t1.month >= t3.start_month
   and t2.year = t3.end_year
   and t2.month <= t3.end_month
   and t1.year = t2.year
   and t1.month = t2.month;  
delete from month_dim where month_dim.month_sk in (select month_sk from tmp);  
insert into month_dim select * from tmp;
2. 测试
        执行修改后的促销期装载脚本之前,要执行下面的命令删除已装载的促销期数据。
USE dw;
UPDATE month_dim SET campaign_session = NULL;
        执行修改后的促销期装载脚本后,查询month_dim表,确认它被正确地装载,查询语句如下。
select month_sk m_sk, month_name, month m, campaign_session,quarter q
  from dw.month_dim
 where year = 2016;

        查询结果如下图所示。


你可能感兴趣的:(基于Hadoop生态圈的数据仓库实践 —— 进阶技术(十二))