基于hadoop生态圈的数据仓库实践 —— 进阶技术(十七)

十七、分段维度
        本节说明分段维度的实现技术。分段维度包含连续值的分段。例如,年度销售订单分段维度可能包含有叫做“低”、“中”、“高”的三档;各档定义分别为0.01到15000、15000.01到30000.00、30000.01到99999999.99。如果一个客户的年度销售订单金额为10000,则被归为“低”档。
        分段维度可以存储多个分段集合。例如,可能有一个用于促销分析的分段集合,另一个用于市场细分,可能还有一个用于销售区域计划。分段一般由用户定义,而且很少能从交易源数据直接获得。

1. 年度销售订单星型模式
        本小节说明如何实现一个年度订单分段维度。需要两个新的星型模式,如下图所示。星型模式的事实表使用(关联到)已有的customer_dim和一个新的year_dim表。年维度是日期维度的子集。annual_customer_segment_fact是唯一用到annual_order_segment_dim表的表。annual_order_segement_dim是分段维度表。
基于hadoop生态圈的数据仓库实践 —— 进阶技术(十七)_第1张图片

        annual_order_segment_dim表存储多个分段集合。在下面的例子里将两个分段集合“project alpha”和“grid”导入annual_order_segment_dim表。这两种分段集合都是按照用户的年度销售订单金额将其分类。project alpha分六段,grid分三段。下表显示了这个分段的例子。

Segment Name

Band Name

Start Value

End Value

 PROJECT ALPHA

 Bottom

 0.01

 2500.00

 PROJECT ALPHA

 Low

 2500.01

 3000.00

 PROJECT ALPHA

 Mid-low

 3000.01

 4000.00

 PROJECT ALPHA

 Mid

 4000.01

 5500.00

 PROJECT ALPHA

 Mid-high

 5500.01

 6500.00

 PROJECT ALPHA

 Top

 6500.01

 99999999.99

 Grid

 LOW

 0.01

 3000.00

 Grid

 MED

 3000.01

 6000.00

 Grid

 HIGH

 6000.01

 99999999.99


        每一分段有一个开始值和一个结束值。 分段的粒度就是本段和下段之间的间隙。粒度必须是度量的最小可能值,在销售订单金额的示例中是0.01。最后一个分段的结束值是销售订单金额可能的最大值。下面的脚本用于建立分段维度数据仓库模式。
use dw;  
create table annual_order_segment_dim (  
    segment_sk int,  
    segment_name varchar(30),  
    band_name varchar(50),  
    band_start_amount decimal(10,2),
    band_end_amount decimal(10,2),
    version int,
    effective_date date,  
    expiry_date date  
)
clustered by (segment_sk) into 8 buckets          
stored as orc tblproperties ('transactional'='true'); 

insert into annual_order_segment_dim values  (1, 'project alpha', 'bottom', 0.01, 2500.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values  (2, 'project alpha', 'low', 2500.01, 3000.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values  (3, 'project alpha', 'mid-low', 3000.01, 4000.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values  (4, 'project alpha', 'mid', 4000.01, 5500.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values  (5, 'project alpha', 'mid_high', 5500.01, 6500.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values  (6, 'project alpha', 'top', 6500.01, 99999999.99, 1, ' 1900-01-01', '2200-01-01');  
insert into annual_order_segment_dim values  (7, 'grid', 'low', 0.01, 3000, 1, '1900-01-01', '2200-01-01');  
insert into annual_order_segment_dim values  (8, 'grid', 'med', 3000.01, 6000.00, 1, ' 1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values  (9, 'grid', 'high', 6000.01, 99999999.99, 1, '1900-01-01', '2200-01-01');  

create table year_dim (  
    year_sk int,  
    year int
);  
  
create table annual_sales_order_fact (  
    customer_sk int,  
    year_sk int,  
    annual_order_amount decimal(10, 2)  
); 
  
create table annual_customer_segment_fact (  
    segment_sk int,  
    customer_sk int,  
    year_sk int  
);

2. 初始装载
        本小节说明初始装载并进行测试。下面的初始装载脚本将order_date维度表(date_dim表的一个视图)里的数据导入year_dim表,将sales_order_fact表里的数据导入annual_sales_order_fact表,将annual_sales_order_fact表里的数据导入annual_customer_segment_fact表。此脚本装载所有历史数据。
use dw;  
  
insert into year_dim  
select row_number() over (order by t1.year) + t2.sk_max, year
  from (select distinct year year from order_date_dim) t1
 cross join (select coalesce(max(year_sk),0) sk_max from year_dim) t2; 

insert into annual_sales_order_fact  
select a.customer_sk, 
       year_sk, 
       sum(order_amount)  
  from sales_order_fact a,
       year_dim c,
       order_date_dim d  
 where a.order_date_sk = d.order_date_sk  
   and c.year = d.year  
   and d.year < 2017  
 group by a.customer_sk, c.year_sk;  
  
insert into annual_customer_segment_fact  
select d.segment_sk, 
       a.customer_sk, 
       a.year_sk  
  from annual_sales_order_fact a, 
       annual_order_segment_dim d  
 where annual_order_amount >= band_start_amount  
   and annual_order_amount <= band_end_amount;  
        执行初始装载脚本,查询annual_customer_segment_fact表确认初始装载是成功的。
select a.customer_sk csk,
       a.year_sk ysk,
       annual_order_amount amt,
       segment_name sn,
       band_name bn
  from annual_customer_segment_fact a,
       annual_order_segment_dim b,
       year_dim c,
       annual_sales_order_fact d
 where a.segment_sk = b.segment_sk
   and a.year_sk = c.year_sk
   and a.customer_sk = d.customer_sk
   and a.year_sk = d.year_sk
cluster by csk, ysk, sn, bn;
        查询结果如下图所示。

        注意,这里是按客户代理键customer_sk分组求和来判断分段,实际情况可能是以customer_number进行分组的,因为无论客户的scd属性如何变化,一般还是认为是一个客户。

3. 定期装载
        本小节说明定期装载脚本和如何测试它。除了无需装载year_dim表以外,定期装载与初始装载类似。annual_sales_order_fact表里的数据被导入annual_customer_segment_fact表。每年调度执行下面的定期装载脚本,此脚本装载前一年的销售数据。
use dw;  

insert into annual_sales_order_fact  
select a.customer_sk, 
       year_sk, 
       sum(order_amount)  
  from sales_order_fact a,
       year_dim c,
       order_date_dim d  
 where a.order_date_sk = d.order_date_sk  
   and c.year = d.year  
   and d.year = year(current_date) - 1  
 group by a.customer_sk, c.year_sk;  
  
insert into annual_customer_segment_fact  
select d.segment_sk, 
       a.customer_sk, 
       c.year_sk  
  from annual_sales_order_fact a, 
       year_dim c,
       annual_order_segment_dim d  
 where a.year_sk = c.year_sk
   and c.year = year(current_date) - 1
   and annual_order_amount >= band_start_amount  
   and annual_order_amount <= band_end_amount;


你可能感兴趣的:(基于hadoop生态圈的数据仓库实践 —— 进阶技术(十七))