kylin-learn_kylin表结构解读

  • 基于kylin-3.0

背景

  • 学习kylin,当搭建好kylin平台后,无论是单节点的还是集群模式的,接下来就是需要进行kylin官方提供的样例cube进行学习;
  • 脚本位置:${KYLIN_HOME}/bin/sample.sh,直接执行即可;
  • 执行结束后,默认是在default库下,会生成5张表;暂命名为销售模型,具体如下:
  • 一定要耐着性子、仔细的理解一下这几张表,对后面的model、cube的创建的理解有很大的帮助;
# 用户账户表
kylin_account
# 日期维度
kylin_cal_dt
# 商品类别表
kylin_category_groupings
# 地理位置国家表
kylin_country
# Sales order table, fact table
kylin_sales

数仓模型

  • 典型的星形模型(start model)。其中,事实表为:kylin_sales(此表为非分区表,这一点需要注意);维度表(或查找表Lookup table):kylin_account , kylin_cal_dtkylin_category_groupingskylin_country
  • 下面具体学习一下各个表的结构;

事实表 kylin_sales

  • 表结构如下:
CREATE TABLE `kylin_sales`(
  `trans_id` bigint, 
  `part_dt` date COMMENT 'Order Date', 
  `lstg_format_name` string COMMENT 'Order Transaction Type', 
  `leaf_categ_id` bigint COMMENT 'Category ID', 
  `lstg_site_id` int COMMENT 'Site ID', 
  `slr_segment_cd` smallint, 
  `price` decimal(19,4) COMMENT 'Order Price', 
  `item_count` bigint COMMENT 'Number of Purchased Goods', 
  `seller_id` bigint COMMENT 'Seller ID', 
  `buyer_id` bigint COMMENT 'Buyer ID', 
  `ops_user_id` string COMMENT 'System User ID', 
  `ops_region` string COMMENT 'System User Region')
COMMENT 'Sales order table, fact table'
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://bigdata/user/hive/warehouse/kylin_sales'
TBLPROPERTIES (
  'transient_lastDdlTime'='1580892437')
  • 在cube的设计中,采用字段part_dt作为时间字段;
  • 个人觉得事实表应该设计成以dt为分区字段的分区表;

维度表 kylin_account

  • 表结构如下:
CREATE TABLE `kylin_account`(
  `account_id` bigint, 
  `account_buyer_level` int COMMENT 'Account Buyer Level', 
  `account_seller_level` int COMMENT 'Account Seller Level', 
  `account_country` string COMMENT 'Account Country', 
  `account_contact` string COMMENT 'Account Contact Info')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://bigdata/user/hive/warehouse/kylin_account'
TBLPROPERTIES (
  'transient_lastDdlTime'='1580892437')
  • 维度表为非分区表;
  • KYLIN_SALES.BUYER_ID = BUYER_ACCOUNT.ACCOUNT_ID
  • KYLIN_SALES.SELLER_ID = SELLER_ACCOUNT.ACCOUNT_ID

维度表 kylin_cal_dt

  • 表结构如下:
CREATE TABLE `kylin_cal_dt`(
  `cal_dt` date COMMENT 'Date, PK', 
  `year_beg_dt` date COMMENT 'YEAR Begin Date', 
  `qtr_beg_dt` date COMMENT 'Quarter Begin Date', 
  `month_beg_dt` date COMMENT 'Month Begin Date', 
  `week_beg_dt` date COMMENT 'Week Begin Date', 
  `age_for_year_id` smallint, 
  `age_for_qtr_id` smallint, 
  `age_for_month_id` smallint, 
  `age_for_week_id` smallint, 
  `age_for_dt_id` smallint, 
  `age_for_rtl_year_id` smallint, 
  `age_for_rtl_qtr_id` smallint, 
  `age_for_rtl_month_id` smallint, 
  `age_for_rtl_week_id` smallint, 
  `age_for_cs_week_id` smallint, 
  `day_of_cal_id` int, 
  `day_of_year_id` smallint, 
  `day_of_qtr_id` smallint, 
  `day_of_month_id` smallint, 
  `day_of_week_id` int, 
  `week_of_year_id` tinyint, 
  `week_of_cal_id` int, 
  `month_of_qtr_id` tinyint, 
  `month_of_year_id` tinyint, 
  `month_of_cal_id` smallint, 
  `qtr_of_year_id` tinyint, 
  `qtr_of_cal_id` smallint, 
  `year_of_cal_id` smallint, 
  `year_end_dt` string, 
  `qtr_end_dt` string, 
  `month_end_dt` string, 
  `week_end_dt` string, 
  `cal_dt_name` string, 
  `cal_dt_desc` string, 
  `cal_dt_short_name` string, 
  `ytd_yn_id` tinyint, 
  `qtd_yn_id` tinyint, 
  `mtd_yn_id` tinyint, 
  `wtd_yn_id` tinyint, 
  `season_beg_dt` string, 
  `day_in_year_count` smallint, 
  `day_in_qtr_count` tinyint, 
  `day_in_month_count` tinyint, 
  `day_in_week_count` tinyint, 
  `rtl_year_beg_dt` string, 
  `rtl_qtr_beg_dt` string, 
  `rtl_month_beg_dt` string, 
  `rtl_week_beg_dt` string, 
  `cs_week_beg_dt` string, 
  `cal_date` string, 
  `day_of_week` string, 
  `month_id` string, 
  `prd_desc` string, 
  `prd_flag` string, 
  `prd_id` string, 
  `prd_ind` string, 
  `qtr_desc` string, 
  `qtr_id` string, 
  `qtr_ind` string, 
  `retail_week` string, 
  `retail_year` string, 
  `retail_start_date` string, 
  `retail_wk_end_date` string, 
  `week_ind` string, 
  `week_num_desc` string, 
  `week_beg_date` string, 
  `week_end_date` string, 
  `week_in_year_id` string, 
  `week_id` string, 
  `week_beg_end_desc_mdy` string, 
  `week_beg_end_desc_md` string, 
  `year_id` string, 
  `year_ind` string, 
  `cal_dt_mns_1year_dt` string, 
  `cal_dt_mns_2year_dt` string, 
  `cal_dt_mns_1qtr_dt` string, 
  `cal_dt_mns_2qtr_dt` string, 
  `cal_dt_mns_1month_dt` string, 
  `cal_dt_mns_2month_dt` string, 
  `cal_dt_mns_1week_dt` string, 
  `cal_dt_mns_2week_dt` string, 
  `curr_cal_dt_mns_1year_yn_id` tinyint, 
  `curr_cal_dt_mns_2year_yn_id` tinyint, 
  `curr_cal_dt_mns_1qtr_yn_id` tinyint, 
  `curr_cal_dt_mns_2qtr_yn_id` tinyint, 
  `curr_cal_dt_mns_1month_yn_id` tinyint, 
  `curr_cal_dt_mns_2month_yn_id` tinyint, 
  `curr_cal_dt_mns_1week_yn_ind` tinyint, 
  `curr_cal_dt_mns_2week_yn_ind` tinyint, 
  `rtl_month_of_rtl_year_id` string, 
  `rtl_qtr_of_rtl_year_id` tinyint, 
  `rtl_week_of_rtl_year_id` tinyint, 
  `season_of_year_id` tinyint, 
  `ytm_yn_id` tinyint, 
  `ytq_yn_id` tinyint, 
  `ytw_yn_id` tinyint, 
  `kylin_cal_dt_cre_date` string, 
  `kylin_cal_dt_cre_user` string, 
  `kylin_cal_dt_upd_date` string, 
  `kylin_cal_dt_upd_user` string)
COMMENT 'Date Dimension Table'
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://bigdata/user/hive/warehouse/kylin_cal_dt'
TBLPROPERTIES (
  'transient_lastDdlTime'='1580892438')
  • 这个日期维度的表结构,可以适配绝大多数的日期维度;
  • KYLIN_SALES.PART_DT = KYLIN_CAL_DT.CAL_DT

维度表 kylin_category_groupings

  • 表结构
CREATE TABLE `kylin_category_groupings`(
  `leaf_categ_id` bigint COMMENT 'Category ID, PK', 
  `leaf_categ_name` string, 
  `site_id` int COMMENT 'Site ID, PK', 
  `categ_busn_mgr` string, 
  `categ_busn_unit` string, 
  `regn_categ` string, 
  `user_defined_field1` string COMMENT 'User Defined Field1', 
  `user_defined_field3` string COMMENT 'User Defined Field3', 
  `kylin_groupings_cre_date` string, 
  `kylin_groupings_upd_date` string COMMENT 'Last Updated Date', 
  `kylin_groupings_cre_user` string, 
  `kylin_groupings_upd_user` string COMMENT 'Last Updated User', 
  `meta_categ_id` decimal(10,0), 
  `meta_categ_name` string COMMENT 'Level1 Category', 
  `categ_lvl2_id` decimal(10,0), 
  `categ_lvl3_id` decimal(10,0), 
  `categ_lvl4_id` decimal(10,0), 
  `categ_lvl5_id` decimal(10,0), 
  `categ_lvl6_id` decimal(10,0), 
  `categ_lvl7_id` decimal(10,0), 
  `categ_lvl2_name` string COMMENT 'Level2 Category', 
  `categ_lvl3_name` string COMMENT 'Level3 Category', 
  `categ_lvl4_name` string, 
  `categ_lvl5_name` string, 
  `categ_lvl6_name` string, 
  `categ_lvl7_name` string, 
  `categ_flags` decimal(10,0), 
  `adult_categ_yn` string, 
  `domain_id` decimal(10,0), 
  `user_defined_field5` string, 
  `vcs_id` decimal(10,0), 
  `gcs_id` decimal(10,0), 
  `move_to` decimal(10,0), 
  `sap_category_id` decimal(10,0), 
  `src_id` tinyint, 
  `bsns_vrtcl_name` string)
COMMENT 'Detail category inforamtion, Dimension Table'
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://bigdata/user/hive/warehouse/kylin_category_groupings'
TBLPROPERTIES (
  'transient_lastDdlTime'='1580892439')
  • 表结构 也不短;
  • KYLIN_SALES.LEAF_CATEG_ID = KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID
  • KYLIN_SALES.LSTG_SITE_ID = KYLIN_CATEGORY_GROUPINGS.SITE_ID

维度表 kylin_country

  • 表结构
CREATE TABLE `kylin_country`(
  `country` string, 
  `latitude` double, 
  `longitude` double, 
  `name` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://bigdata/user/hive/warehouse/kylin_country'
TBLPROPERTIES (
  'transient_lastDdlTime'='1580892438')
  • 源码中设计这个国家级别,有点牵强。
  • 在实际中,我们需要考虑到地区、城市、城市分区等信息;
  • BUYER_ACCOUNT.ACCOUNT_COUNTRY = BUYER_COUNTRY.COUNTRY
  • SELLER_ACCOUNT.ACCOUNT_COUNTRY = SELLER_COUNTRY.COUNTRY

模型关系图

  • 在官网给定的learn_kylin中,model:kylin_sales_model中表结构关系如下:


    kylin-learn_kylin表结构解读_第1张图片
    kylin_sales_model

你可能感兴趣的:(kylin-learn_kylin表结构解读)