- 基于kylin-3.0
背景
- 学习kylin,当搭建好kylin平台后,无论是单节点的还是集群模式的,接下来就是需要进行kylin官方提供的样例cube进行学习;
- 脚本位置:
${KYLIN_HOME}/bin/sample.sh
,直接执行即可; - 执行结束后,默认是在default库下,会生成5张表;暂命名为
销售模型
,具体如下: - 一定要耐着性子、仔细的理解一下这几张表,对后面的model、cube的创建的理解有很大的帮助;
# 用户账户表
kylin_account
# 日期维度
kylin_cal_dt
# 商品类别表
kylin_category_groupings
# 地理位置国家表
kylin_country
# Sales order table, fact table
kylin_sales
数仓模型
- 典型的星形模型(start model)。其中,事实表为:
kylin_sales
(此表为非分区表,这一点需要注意);维度表(或查找表Lookup table):kylin_account
,kylin_cal_dt
,kylin_category_groupings
,kylin_country
。 - 下面具体学习一下各个表的结构;
事实表 kylin_sales
- 表结构如下:
CREATE TABLE `kylin_sales`(
`trans_id` bigint,
`part_dt` date COMMENT 'Order Date',
`lstg_format_name` string COMMENT 'Order Transaction Type',
`leaf_categ_id` bigint COMMENT 'Category ID',
`lstg_site_id` int COMMENT 'Site ID',
`slr_segment_cd` smallint,
`price` decimal(19,4) COMMENT 'Order Price',
`item_count` bigint COMMENT 'Number of Purchased Goods',
`seller_id` bigint COMMENT 'Seller ID',
`buyer_id` bigint COMMENT 'Buyer ID',
`ops_user_id` string COMMENT 'System User ID',
`ops_region` string COMMENT 'System User Region')
COMMENT 'Sales order table, fact table'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdata/user/hive/warehouse/kylin_sales'
TBLPROPERTIES (
'transient_lastDdlTime'='1580892437')
- 在cube的设计中,采用字段
part_dt
作为时间字段; - 个人觉得事实表应该设计成以dt为分区字段的分区表;
维度表 kylin_account
- 表结构如下:
CREATE TABLE `kylin_account`(
`account_id` bigint,
`account_buyer_level` int COMMENT 'Account Buyer Level',
`account_seller_level` int COMMENT 'Account Seller Level',
`account_country` string COMMENT 'Account Country',
`account_contact` string COMMENT 'Account Contact Info')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdata/user/hive/warehouse/kylin_account'
TBLPROPERTIES (
'transient_lastDdlTime'='1580892437')
- 维度表为非分区表;
KYLIN_SALES.BUYER_ID = BUYER_ACCOUNT.ACCOUNT_ID
KYLIN_SALES.SELLER_ID = SELLER_ACCOUNT.ACCOUNT_ID
维度表 kylin_cal_dt
- 表结构如下:
CREATE TABLE `kylin_cal_dt`(
`cal_dt` date COMMENT 'Date, PK',
`year_beg_dt` date COMMENT 'YEAR Begin Date',
`qtr_beg_dt` date COMMENT 'Quarter Begin Date',
`month_beg_dt` date COMMENT 'Month Begin Date',
`week_beg_dt` date COMMENT 'Week Begin Date',
`age_for_year_id` smallint,
`age_for_qtr_id` smallint,
`age_for_month_id` smallint,
`age_for_week_id` smallint,
`age_for_dt_id` smallint,
`age_for_rtl_year_id` smallint,
`age_for_rtl_qtr_id` smallint,
`age_for_rtl_month_id` smallint,
`age_for_rtl_week_id` smallint,
`age_for_cs_week_id` smallint,
`day_of_cal_id` int,
`day_of_year_id` smallint,
`day_of_qtr_id` smallint,
`day_of_month_id` smallint,
`day_of_week_id` int,
`week_of_year_id` tinyint,
`week_of_cal_id` int,
`month_of_qtr_id` tinyint,
`month_of_year_id` tinyint,
`month_of_cal_id` smallint,
`qtr_of_year_id` tinyint,
`qtr_of_cal_id` smallint,
`year_of_cal_id` smallint,
`year_end_dt` string,
`qtr_end_dt` string,
`month_end_dt` string,
`week_end_dt` string,
`cal_dt_name` string,
`cal_dt_desc` string,
`cal_dt_short_name` string,
`ytd_yn_id` tinyint,
`qtd_yn_id` tinyint,
`mtd_yn_id` tinyint,
`wtd_yn_id` tinyint,
`season_beg_dt` string,
`day_in_year_count` smallint,
`day_in_qtr_count` tinyint,
`day_in_month_count` tinyint,
`day_in_week_count` tinyint,
`rtl_year_beg_dt` string,
`rtl_qtr_beg_dt` string,
`rtl_month_beg_dt` string,
`rtl_week_beg_dt` string,
`cs_week_beg_dt` string,
`cal_date` string,
`day_of_week` string,
`month_id` string,
`prd_desc` string,
`prd_flag` string,
`prd_id` string,
`prd_ind` string,
`qtr_desc` string,
`qtr_id` string,
`qtr_ind` string,
`retail_week` string,
`retail_year` string,
`retail_start_date` string,
`retail_wk_end_date` string,
`week_ind` string,
`week_num_desc` string,
`week_beg_date` string,
`week_end_date` string,
`week_in_year_id` string,
`week_id` string,
`week_beg_end_desc_mdy` string,
`week_beg_end_desc_md` string,
`year_id` string,
`year_ind` string,
`cal_dt_mns_1year_dt` string,
`cal_dt_mns_2year_dt` string,
`cal_dt_mns_1qtr_dt` string,
`cal_dt_mns_2qtr_dt` string,
`cal_dt_mns_1month_dt` string,
`cal_dt_mns_2month_dt` string,
`cal_dt_mns_1week_dt` string,
`cal_dt_mns_2week_dt` string,
`curr_cal_dt_mns_1year_yn_id` tinyint,
`curr_cal_dt_mns_2year_yn_id` tinyint,
`curr_cal_dt_mns_1qtr_yn_id` tinyint,
`curr_cal_dt_mns_2qtr_yn_id` tinyint,
`curr_cal_dt_mns_1month_yn_id` tinyint,
`curr_cal_dt_mns_2month_yn_id` tinyint,
`curr_cal_dt_mns_1week_yn_ind` tinyint,
`curr_cal_dt_mns_2week_yn_ind` tinyint,
`rtl_month_of_rtl_year_id` string,
`rtl_qtr_of_rtl_year_id` tinyint,
`rtl_week_of_rtl_year_id` tinyint,
`season_of_year_id` tinyint,
`ytm_yn_id` tinyint,
`ytq_yn_id` tinyint,
`ytw_yn_id` tinyint,
`kylin_cal_dt_cre_date` string,
`kylin_cal_dt_cre_user` string,
`kylin_cal_dt_upd_date` string,
`kylin_cal_dt_upd_user` string)
COMMENT 'Date Dimension Table'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdata/user/hive/warehouse/kylin_cal_dt'
TBLPROPERTIES (
'transient_lastDdlTime'='1580892438')
- 这个日期维度的表结构,可以适配绝大多数的日期维度;
KYLIN_SALES.PART_DT = KYLIN_CAL_DT.CAL_DT
维度表 kylin_category_groupings
- 表结构
CREATE TABLE `kylin_category_groupings`(
`leaf_categ_id` bigint COMMENT 'Category ID, PK',
`leaf_categ_name` string,
`site_id` int COMMENT 'Site ID, PK',
`categ_busn_mgr` string,
`categ_busn_unit` string,
`regn_categ` string,
`user_defined_field1` string COMMENT 'User Defined Field1',
`user_defined_field3` string COMMENT 'User Defined Field3',
`kylin_groupings_cre_date` string,
`kylin_groupings_upd_date` string COMMENT 'Last Updated Date',
`kylin_groupings_cre_user` string,
`kylin_groupings_upd_user` string COMMENT 'Last Updated User',
`meta_categ_id` decimal(10,0),
`meta_categ_name` string COMMENT 'Level1 Category',
`categ_lvl2_id` decimal(10,0),
`categ_lvl3_id` decimal(10,0),
`categ_lvl4_id` decimal(10,0),
`categ_lvl5_id` decimal(10,0),
`categ_lvl6_id` decimal(10,0),
`categ_lvl7_id` decimal(10,0),
`categ_lvl2_name` string COMMENT 'Level2 Category',
`categ_lvl3_name` string COMMENT 'Level3 Category',
`categ_lvl4_name` string,
`categ_lvl5_name` string,
`categ_lvl6_name` string,
`categ_lvl7_name` string,
`categ_flags` decimal(10,0),
`adult_categ_yn` string,
`domain_id` decimal(10,0),
`user_defined_field5` string,
`vcs_id` decimal(10,0),
`gcs_id` decimal(10,0),
`move_to` decimal(10,0),
`sap_category_id` decimal(10,0),
`src_id` tinyint,
`bsns_vrtcl_name` string)
COMMENT 'Detail category inforamtion, Dimension Table'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdata/user/hive/warehouse/kylin_category_groupings'
TBLPROPERTIES (
'transient_lastDdlTime'='1580892439')
- 表结构 也不短;
KYLIN_SALES.LEAF_CATEG_ID = KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID
KYLIN_SALES.LSTG_SITE_ID = KYLIN_CATEGORY_GROUPINGS.SITE_ID
维度表 kylin_country
- 表结构
CREATE TABLE `kylin_country`(
`country` string,
`latitude` double,
`longitude` double,
`name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdata/user/hive/warehouse/kylin_country'
TBLPROPERTIES (
'transient_lastDdlTime'='1580892438')
- 源码中设计这个国家级别,有点牵强。
- 在实际中,我们需要考虑到地区、城市、城市分区等信息;
BUYER_ACCOUNT.ACCOUNT_COUNTRY = BUYER_COUNTRY.COUNTRY
SELLER_ACCOUNT.ACCOUNT_COUNTRY = SELLER_COUNTRY.COUNTRY
模型关系图
-
在官网给定的learn_kylin中,model:kylin_sales_model中表结构关系如下: