(五)进阶技术
7. 多路径和参差不齐的层次
本篇讨论多路径层次,它是对单路径层次的扩展。上一篇里数据仓库的月维度只有一条层次路径,即年-季度-月这条路径。在本篇中加一个新的级别,推广期,并且加一个新的年-推广期-月的层次路径。这时月维度将有两条层次路径,因此具有多路径层次。本篇讨论的另一个主题是不完全层次,这种层次在它的一个或多个级别上没有数据。
增加一个层次
执行清单(五)- 7-1里的脚本给month_dim表添加一个叫做campaign_session的新列,并建立campaign_session_stg过渡表。图(五)- 7-1显示添加后的模式。
USE dw;
-- 增加促销期列
ALTER TABLE month_dim ADD campaign_session CHAR (30) AFTER month;
-- 建立促销期过渡表
CREATE TABLE campaign_session_stg (
campaign_session CHAR(30),
month CHAR(9),
year INT(4)
);
清单(五)- 7-1
图(五)- 7-1
为了理解推广期如何工作,看一下表(五)- 7-1的推广期示例。
Campaign Session |
Month |
2005 First Campaign |
January-April |
2005 Second Campaign |
May-July |
2005 Third Campaign |
August-August |
2005 Last Campaign |
September-December |
表(五)- 7-1
每个推广期有一个或多个月。一个推广期也许并不是正好一个季度。也就是说,推广期级别不能上卷到季度(推广期的上一个级别)。但是推广期可以上卷至年级别。
2014年推广期的数据如下,并保存在/root/data-integration/campaign_session.csv文件中。
CAMPAIGN SESSION,MONTH,YEAR
2014 First Campaign,1,2014
2014 First Campaign,2,2014
2014 First Campaign,3,2014
2014 First Campaign,4,2014
2014 Second Campaign,5,2014
2014 Second Campaign,6,2014
2014 Second Campaign,7,2014
2014 Third Campaign,8,2014
2014 Last Campaign,9,2014
2014 Last Campaign,10,2014
2014 Last Campaign,11,2014
2014 Last Campaign,12,2014
通常不会从一个文本文件直接向数据仓库表装载数据,而是使用一个过渡表。清单(五)- 7-1里包含了建立campaign_session_stg表的脚本。
现在可以执行清单(五)- 7-2里的脚本把2014年的推广期数据装载进月维度。使用图(五)- 7-2到图(五)- 7-6所示的Kettle步骤也可完成同样的装载。
USE dw;
TRUNCATE TABLE campaign_session_stg;
LOAD DATA INFILE '/root/data-integration/campaign_session.csv'
INTO TABLE campaign_session_stg
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY ""
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(
campaign_session
, month
, year
);
UPDATE month_dim a,
campaign_session_stg b
SET
a.campaign_session = b.campaign_session
WHERE
a.month = b.month AND a.year = b.year;
COMMIT;
清单(五)- 7-2
图(五)- 7-2
图(五)- 7-3
图(五)- 7-4
图(五)- 7-5
图(五)- 7-6
使用下面的查询语句确认month_dim表装载正确。
mysql> select month_sk, month_name, year, campaign_session
-> from month_dim
-> where year = 2014;
+----------+------------+------+----------------------+
| month_sk | month_name | year | campaign_session |
+----------+------------+------+----------------------+
| 169 | January | 2014 | 2014 First Campaign |
| 170 | February | 2014 | 2014 First Campaign |
| 171 | March | 2014 | 2014 First Campaign |
| 172 | April | 2014 | 2014 First Campaign |
| 173 | May | 2014 | 2014 Second Campaign |
| 174 | June | 2014 | 2014 Second Campaign |
| 175 | July | 2014 | 2014 Second Campaign |
| 176 | August | 2014 | 2014 Third Campaign |
| 177 | September | 2014 | 2014 Last Campaign |
| 178 | October | 2014 | 2014 Last Campaign |
| 179 | November | 2014 | 2014 Last Campaign |
| 180 | December | 2014 | 2014 Last Campaign |
+----------+------------+------+----------------------+
12 rows in set (0.36 sec)
注意,应该在每年的一月份装载推广期的CSV文件,而且必须在装载month_end_sales_order_fact表之前装载。
增加2014年的数据
在“(五)进阶技术-5. 快照”中,已经把2015年2月的数据导入进了month_end_sales_order_fact表。现在需要执行清单(五)- 5-2里脚本或执行对应的Kettle转换,添加2014年全年的数据。必须每月运行一次脚本。别忘了把系统日期设置为每次运行前的下月1号。当装载完这些新增的数据后(总共12次),使用下面的SQL查询month_end_sales_order_fact表,确认它已经正确装载。
mysql> select
-> month,
-> year,
-> product_name,
-> month_order_amount mo_amt,
-> month_order_quantity mo_qty
-> from
-> month_end_sales_order_fact a,
-> month_dim b,
-> product_dim c
-> where
-> a.order_month_sk = b.month_sk
-> and a.product_sk = c.product_sk
-> and year = 2014
-> order BY month , year , product_name;
+-------+------+-----------------+---------+--------+
| month | year | product_name | mo_amt | mo_qty |
+-------+------+-----------------+---------+--------+
| 1 | 2014 | LCD Panel | 1000.00 | NULL |
| 2 | 2014 | Hard Disk Drive | 1000.00 | NULL |
| 3 | 2014 | Floppy Drive | 2000.00 | NULL |
| 4 | 2014 | LCD Panel | 2500.00 | NULL |
| 5 | 2014 | Hard Disk Drive | 3000.00 | NULL |
| 6 | 2014 | Floppy Drive | 3500.00 | NULL |
| 7 | 2014 | LCD Panel | 4000.00 | NULL |
| 8 | 2014 | Hard Disk Drive | 4500.00 | NULL |
| 9 | 2014 | Floppy Drive | 1000.00 | NULL |
| 10 | 2014 | LCD Panel | 1000.00 | NULL |
+-------+------+-----------------+---------+--------+
10 rows in set (0.00 sec)
注意 11月和12月没有数据。
层次查询
本节的两个查询例子分别用于月维度的两个层次路径。第一个查询如清单(五)- 7-3所示,沿年-季度-月路径钻取。这个查询与上篇“维度层次”里的钻取查询类似,除了这个查询查的是month_end_sales_order_fact表,“维度层次”里的查询查的是sales_order_fact表(对应的Kettle转换步骤也与上篇的类似,这里从略)。结果如图(五)- 7-7所示。
USE dw;
SELECT
product_category, time, order_amount, order_quantity
FROM
((SELECT
product_category,
year,
1 month,
year time,
1 sequence,
SUM(month_order_amount) order_amount,
SUM(month_order_quantity) order_quantity
FROM
month_end_sales_order_fact a, product_dim b, month_dim c
WHERE
a.product_sk = b.product_sk
AND a.order_month_sk = c.month_sk
AND year = 2014
GROUP BY product_category , year) UNION ALL (SELECT
product_category,
year,
month,
quarter time,
2 sequence,
SUM(month_order_amount) order_amount,
SUM(month_order_quantity) order_quantity
FROM
month_end_sales_order_fact a, product_dim b, month_dim c
WHERE
a.product_sk = b.product_sk
AND a.order_month_sk = c.month_sk
AND year = 2014
GROUP BY product_category , year , quarter) UNION ALL (SELECT
product_category,
year,
month,
month_name time,
3 sequence,
SUM(month_order_amount) order_amount,
SUM(month_order_quantity) order_quantity
FROM
month_end_sales_order_fact a, product_dim b, month_dim c
WHERE
a.product_sk = b.product_sk
AND a.order_month_sk = c.month_sk
AND year = 2014
GROUP BY product_category , year , quarter , month)) x
ORDER BY product_category , year , month , sequence;
清单(五)- 7-3
图(五)- 7-7
第二个查询如清单(五)- 7-4所示,钻取推广期的年-推广期-月层次。此查询和前一个有相同的结构,除了是按推广期而不是季度分组。结果如图(五)- 7-8所示。
USE dw;
SELECT
product_category, time, order_amount, order_quantity
FROM
((SELECT
product_category,
year,
1 month,
year time,
1 sequence,
SUM(month_order_amount) order_amount,
SUM(month_order_quantity) order_quantity
FROM
month_end_sales_order_fact a, product_dim b, month_dim c
WHERE
a.product_sk = b.product_sk
AND a.order_month_sk = c.month_sk
AND year = 2014
GROUP BY product_category , year) UNION ALL (SELECT
product_category,
year,
month,
campaign_session time,
2 sequence,
SUM(month_order_amount) order_amount,
SUM(month_order_quantity) order_quantity
FROM
month_end_sales_order_fact a, product_dim b, month_dim c
WHERE
a.product_sk = b.product_sk
AND a.order_month_sk = c.month_sk
AND year = 2014
GROUP BY product_category , year , campaign_session) UNION ALL (SELECT
product_category,
year,
month,
month_name time,
3 sequence,
SUM(month_order_amount) order_amount,
SUM(month_order_quantity) order_quantity
FROM
month_end_sales_order_fact a, product_dim b, month_dim c
WHERE
a.product_sk = b.product_sk
AND a.order_month_sk = c.month_sk
AND year = 2014
GROUP BY product_category , year , quarter , month)) x
ORDER BY product_category , year , month , sequence
;
清单(五)- 7-4
图(五)- 7-8
不完全层次
在一个或多个级别上没有数据的层次称为不完全层次。例如在特定月份没有推广期,那么月维度就具有不完全推广期层次。本节说明不完全层次,还有在推广期上如何应用它。
下面是一个不完全推广期(在/root/data-integration/ragged_campaign.csv文件里)的例子,2014年1月、4月、9月、10月、11月和12月没有推广期。
CAMPAIGN SESSION,MONTH,YEAR
NULL,1,2014
2014 Early Spring Campaign,2,2014
2014 Early Spring Campaign,3,2014
NULL,4,2014
2014 Spring Campaign,5,2014
2014 Spring Campaign,6,2014
2014 Last Campaign,7,2014
2014 Last Campaign,8,2014
NULL,9,2014
NULL,10,2014
NULL,11,2014
NULL,12,2014
先使用下面的命令把campaign_session字段置空,然后执行清单(五)- 7-5里的脚本向month_dim表装载推广期数据。
USE dw;
UPDATE month_dim SET campaign_session = NULL ;
COMMIT ;
USE dw;
TRUNCATE TABLE campaign_session_stg;
LOAD DATA INFILE '/root/data-integration/ragged_campaign.csv'
INTO TABLE campaign_session_stg
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(
campaign_session
, month
, year
);
UPDATE month_dim a,
campaign_session_stg b
SET
a.campaign_session = (case
when b.campaign_session IS NOT NULL then b.campaign_session
else a.month_name
end)
WHERE
a.month = b.month AND a.year = b.year;
COMMIT ;
清单(五)- 7-5
图(五)- 7-9到图(五)- 7-12所示为对前面Kettle步骤的修改,完成清单(五)- 7-5里脚本的相同装载。
图(五)- 7-9
图(五)- 7-10
图(五)- 7-11
图(五)- 7-12
使用下面的SQL语句查询month_dim表以确认导入正确。
mysql> select month_sk, month_name, year, campaign_session
-> from month_dim
-> where year = 2014;
+----------+------------+------+----------------------------+
| month_sk | month_name | year | campaign_session |
+----------+------------+------+----------------------------+
| 169 | January | 2014 | January |
| 170 | February | 2014 | 2014 Early Spring Campaign |
| 171 | March | 2014 | 2014 Early Spring Campaign |
| 172 | April | 2014 | April |
| 173 | May | 2014 | 2014 Spring Campaign |
| 174 | June | 2014 | 2014 Spring Campaign |
| 175 | July | 2014 | 2014 Last Campaign |
| 176 | August | 2014 | 2014 Last Campaign |
| 177 | September | 2014 | September |
| 178 | October | 2014 | October |
| 179 | November | 2014 | November |
| 180 | December | 2014 | December |
+----------+------------+------+----------------------------+
12 rows in set (0.01 sec)
再次执行清单(五)- 7-4里的脚本,结果如图(五)- 7-13所示。
图(五)- 7-13
从查询结果可以看出,在有推广期月份的路径,月级别行的汇总与推广期级别的行相同。而对于没有推广期的月份,其推广期级别的行与月级别的行相同。也就是说,在没有推广期级别的月份,月上卷了它们自己。例如,1月没有推广期,所以你在输出看到了两个1月的行(第2行和第3行)。第3行是月份级别的行,第2行表示是没有推广期的行。对于没有推广期的月份,推广期行的销售订单金额(输出里的order_amount列)与月分行的相同。
对于存储产品,二月和三月属于同一个名为“2014 Early Spring Campaign”的推广期。因此,每个月有一行,并上卷至它们的推广期;销售订单金额的汇总就是推广期的金额。