该实战项目的目的在于通过基于小型数据的Hive数仓构建进行的业务分析来做到以小见大,熟悉实际生产情况下构建Hive数仓解决实际问题的场景。本文内容较多,包含了从前期准备到数据分析的方案,代码,问题,解决方法等等,分析的数据文件 和 Zeppelin中的源文件 都已放在文章顶部,请先行下载,并配置好Zeppelin Hive相关环境后再进行阅读。相信认真读完并参与你一定会有收获!
打开Zeppelin
,Zeppelin
相关安装与配置详见Zeppelin安装教程。
创建notebook:
此处需要自定义一个Hive编辑器,教程如下:
#web页面配置hive翻译器
# 右上角anonymous => interpreter => 右上角 create =>
Interpreter Name
hive
Interpreter group
jdbc
#=> 设置properties
default.driver org.apache.hive.jdbc.HiveDriver
default.url jdbc:hive2://single:10000
default.user root
#=> 左下角 save
#web界面 create note
# 以%hive开启作为第一行
使用Zeppelin的一些基础知识
1.set hive.server2.logging.operation.level=NONE;// 设置日志文件不输出
2.在paragraph首行都应该有一个类似`%hive`的语法(表示该paragraph使用了说明hive编译器)
3.对于数据量较小的情况,Zeppelin会不生成一些图像(显示为Data Avaliable),此时需要先在settings中配置后才会生成图像
每部分都需要在Notebook下创建一个新的paragraph并重命名。
(注释的语句是需要执行的)
下载绑定的电商数据资源,将其上传到虚拟机上。查看资源下的文件,整理其字段信息和行数待用,并且检查各份数据可能存在的问题,比如文件乱码,评分数据空缺等。
小Tips:先行备好各表的字段信息有助于轻松构建ODS层的外部表,备好行数有助于在建表之后快速检查是否建表成功。
sum(price),count(transaction_id)
sum,avg(count)
sum,avg(count)
sum,count(distinct)
count(distinct)
count(transaction_id),count(review_score),sum(price)
ODS 近源层(`外部表`)
ODS层的表通常包括两类,一个用于存储当前需要加载的数据,一个用于存储处理完后的历史数据。(历史数据一般保存3-6个月后需要清除)
数据经过ETL装入本层,接近源数据
DWD 明细层(`内部表`)
表合并(列),行不变
ODS层要尽可能地合并,去除无用字段,扩展维度入DWD层
时间维度表
订单表 => pro_dwd.order_detail_by_date
订单详情表
省份表
城市表
店铺表 => pro_dwd.order_detail_by_add_shop
订单表
订单详情表
会采用`维度退化`手法:当一个维度没有数据仓库需要的任何数据时,就可以退化维度,将维度退化至事实表中,减少事实表和维表的关联。
DWT 轻聚层
1.实现聚合操作
聚合操作包括对多个指标(如销售额、用户数等)在多个维度(如时间、地区、产品类别等)上进行统计汇总。
2.提取聚合需求共性
DWS 汇总层
接近指标对应的结果数据,尽量提取出来就能用
大多都是按照主题划分的涵盖多个字段的宽表
DM 数据集市
涉及权限管理和权限申请,不同人看到不同权限的数据,也可以申请查看某份数据的权限。
数据传递性
,建表时,下一层可以直接对上一层的建表代码进行改动;下一层数据传入时的表数据字段也来源于上一层的基础上。检查数据
的习惯,通过select * from TABLE_NAME limit N
对数据进行检查。被其他应用所使用
,而DWD层存储的经过转换、清洗和集成的数据专属于Hive
。
近源层的表都为外部表
,结合前期的数据准备和数据文件上传目录的创建创建近源层的表。
数据清洗
(列裁剪,行压缩)时间维度表
不必要
的维度进行维度合并
加密
,对客户表需要对用户的关键信息(包括名,邮箱,住址,信用卡号)进行脱敏操作,根据信息的敏感度选取不同的脱敏方式和加密函数。例如:对邮箱,即对@
前半部分的邮箱号码进行md5加密,后半部分保持原样并进行拼接。(代码只是将每种加密方式都尝试了一次,实际上更敏感的信息要用非对称加密
)问题解决
,我们需要对language的乱码问题
进行解决,先筛选出存在乱码的language,并搜索到这两种语言的正确格式,进行替换。# 筛选出language列存在乱码的列
select language from rsda_ods.customer where language rlike '.*[^a-zA-Z ].*';
新增original
字段来表示review_score
是否非空。
通常为了保证数据的整体正态分布
不受影响,避免可能产生的数据倾斜,我们会选择在数据清洗阶段将缺失数据替换为平均值
。
由于平均分的结果通常是小数,如果对int类型数求平均数会导致精度缺失从而导致没有实现正态分布的目标,因此此处修改review_score
的数据类型为decimal(3,2)
如何解决store_id的映射错误?
交易表中的数据才是正确的。
关联transaction_id,并且判断store_id的映射是否相同,如果相同,取谁都行;如果不相同,只能取交易表中的标准的store_id。
明细层不需要出现店铺表的原因?
近源层的店铺表就已经很适合,无需再多加列裁剪或行筛选
,并且没有数据清洗
任务,因此不需要在明细层中出现。
总结:不是所有表都必须出现在数仓的每一层,真正决定表是否在这一层的是数据的粒度。
分区表
(注意:要有动态分区配置)。分区字段不需要出现在表中,因此tran_year,tran_month
都不需要在建表的时候出现。OpenCSVSerde
是因为product
列中存在形如"Soup - Campbells, Minestrone"
的数据,如果直接用row format delimited fields terminated by ','
会导致误解析列中多出来的,
,因此需要选用OpenCSVSerde
。tran_date
和tran_time
可以合并为一个完整的时间维度
给transaction_id重复的数据一个新的,未重复的transaction_id
。解决思路如下:我们先以transaction_id
分组对全局开一个row_number()
的窗口,将rn>=2
的数据(即重复数据)拿出来放在另一个表B,对表B再全局开一个row_number()
的窗口,使每个重复数据都获得一个新行号all_rn
,同时我们求出原始交易表中transaction_id的最大值max_id
,将all_rn+max_id
的值作为重复数据的新行号。用交易表已有的时间维度分析,可能会有缺失的时间维度。为了保证时间维度的健全性
,需要单独构建一张时间维度表。
在构建时间维度表之前,需要全局查找交易表中最大和最小的时间
,以确定时间维度的边界。
注意:查找最值一般通过部分聚合-全局聚合
进行优化。
set mapreduce.job.reduces = 3;
with temp_tran as(
select store_id,to_date(replace(tran_date,'/','-')) as tran_date
from rsda_ods.transaction
),max_date_by_score as(
select min(tran_date) as min_date,max(tran_date) as max_date
from temp_tran
group by store_id
)
select min(min_date) as min_date,max(max_date) as max_date
from max_date_by_score;
-----------------------------------------------------------------------------------------
min_date max_date
2018-01-01 2018-12-31
日期数据进行操作前,都需要先转化为标准日期格式:YYYY-MM-DD
:
tran_date
的原始格式为2018/1/31
,需要先替换分隔符,再通过to_date()
补上缺失的前置零
#!/bin/bash
startDate=$1
endDate=$2
startTime=`date -d "$startDate" +%s`
endTime=`date -d "$endDate +1 day" +%s`
while((startTime<endTime));do
year=`date -d "@$startTime" +%Y`
month=`date -d "@$startTime" +%m`
quarter=$[(10#$month-1)/3+1]
yearweek=`date -d "@$startTime" +%W`
day=`date -d "@$startTime" +%d`
hour=`date -d "@$startTime" +%H`
echo "$year,$quarter,$month,$yearweek,$day,$hour" >> dim_date.csv
startDate=`date -d "@$startTime" +"%F %T"`
startTime=`date -d "$startDate 1 hour" +%s`
done
开始边界
和结束边界
)chmod u+x dim_date_create.sh
./dim_date_create.sh 2018-01-01 2018-12-31
hdfs dfs -put dim_date.csv /rsda/dim_date
第2部分
的数据基本粒度和有效列
进行轻聚层的建表最佳评价
如何定义的问题。最佳评价
指的是用户所给的最高评分,那么如果一个用户始终打出低分,那么也不符合最佳
的定义,这就是业务的矛盾点,面对业务的矛盾点,我们通常需要与客户经理对接去定义出更加复杂的指标来解决这类问题。如题目中定义出了4分的评价数
和5分的评价数
的两个指标来作为最佳评价
的指标(不设置为>=4分的评价数是对于没有5分评价
的情况下便于拆分)。set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=3;
select tran_month,sum(sum_amount) as sum_amount
from rsda_dws.transaction_dim_date
group by tran_month;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
select tran_season,sum(sum_amount) as sum_amount
from rsda_dws.transaction_dim_date
group by tran_season;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
select
sum(sum_amount)
from(
select sum(sum_amount) as sum_amount
from rsda_dws.transaction_dim_date
group by tran_season
)A;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
select
sum(sum_amount) as sum_amount
from(
select sum(sum_amount) as sum_amount
from rsda_dws.transaction_dim_date
where dayofweek(concat(tran_year,'-',tran_month,'-',tran_day))<6
group by tran_season
)A;
此处不能用concat_ws
的原因是concat_ws
的参数必须是string or array
,int类型数不能作为concat_ws
的参数。
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
select
floor((tran_hour/3)+1) as time_range,
sum(sum_amount) as sum_amount
from rsda_dws.transaction_dim_date
group by floor((tran_hour/3)+1);
此处以3个小时为一个时段
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
select
floor((tran_hour/3)+1) as time_range,
cast(sum(sum_amount)/sum(count_tran) as decimal(10,2)) as avg_amount
from rsda_dws.transaction_dim_date
group by floor((tran_hour/3)+1);
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
select
cast(sum(sum_amount)/sum(count_tran) as decimal(10,2)) as avg_amount
from(
select sum(sum_amount) as sum_amount,sum(count_tran) as count_tran
from rsda_dws.transaction_dim_date
where dayofweek(concat(tran_year,'-',tran_month,'-',tran_day))<6
group by tran_season
)A;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
select
tran_year,tran_month,tran_day,
sum(count_tran) as tran_count
from rsda_dws.transaction_dim_date
group by tran_year,tran_month,tran_day;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
with customer_count_tran as(
select customer_id,sum(count_tran) as count_tran
from rsda_dws.tran_dim_date_customer
group by customer_id
),customer_count_tran_rank as(
select customer_id,count_tran,dense_rank() over(order by count_tran desc) as rnk
from customer_count_tran
)
select customer_id,count_tran,rnk
from customer_count_tran_rank
where rnk<=10;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
with customer_sum_amount as(
select customer_id,sum(sum_amount) as count_tran
from rsda_dws.tran_dim_date_customer
group by customer_id
),customer_sum_amount_rank as(
select customer_id,sum_amount,dense_rank() over(order by sum_amount desc) as rnk
from customer_sum_amount
)
select customer_id,sum_amount,rnk
from customer_sum_amount_rank
where rnk<=10;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
with customer_sum_amount as(
select customer_id,sum(sum_amount) as sum_amount
from rsda_dws.tran_dim_date_customer
group by customer_id
)
select customer_id,sum_amount
from customer_sum_amount
where sum_amount = (
select min(sum_amount) as sum_amount
from customer_sum_amount
);
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
SELECT tran_year, tran_season, COUNT(*) AS unique_customers
FROM (
SELECT tran_year, tran_season, customer_id
FROM rsda_dws.tran_dim_date_customer
GROUP BY tran_year, tran_season, customer_id
) AS temp
GROUP BY tran_year, tran_season
ORDER BY tran_year, tran_season;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
SELECT tran_year_week, COUNT(*) AS unique_customers
FROM (
SELECT tran_year_week, customer_id
FROM rsda_dws.tran_dim_date_customer
GROUP BY tran_year_week, customer_id
) AS temp
GROUP BY tran_year_week
ORDER BY tran_year_week;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
with customer_sum_amount_count_tran as(
select customer_id,sum(sum_amount) as sum_amount,sum(count_tran) as count_tran
from rsda_dws.tran_dim_date_customer
group by customer_id
),customer_avg_amount as(
select customer_id,cast(sum_amount/count_tran as decimal(10,2)) as avg_amount
from customer_sum_amount_count_tran
group by customer_id
)
select max(avg_amount) as max_amount
from customer_avg_amount;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
WITH MonthlySpending AS (
SELECT tran_month, customer_id, SUM(sum_amount) AS total_spending,
DENSE_RANK() OVER (PARTITION BY tran_month ORDER BY SUM(sum_amount) DESC) AS rank
FROM rsda_dws.tran_dim_date_customer
GROUP BY tran_month, customer_id
)
SELECT tran_month, customer_id, total_spending
FROM MonthlySpending
WHERE rank = 1;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
WITH MonthlyVisits AS (
SELECT tran_month, customer_id, SUM(count_tran) AS total_visits,
DENSE_RANK() OVER (PARTITION BY tran_month ORDER BY SUM(count_tran) DESC) AS rank
FROM rsda_dws.tran_dim_date_customer
GROUP BY tran_month, customer_id
)
SELECT tran_month, customer_id, total_visits
FROM MonthlyVisits
WHERE rank = 1;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
SELECT product, SUM(sum_amount) AS total_sales_amount
FROM rsda_dws.tran_product_customer_month
GROUP BY product
ORDER BY total_sales_amount DESC
LIMIT 5;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
SELECT product, SUM(count_tran) AS total_transactions
FROM rsda_dws.tran_product_customer_month
GROUP BY product
ORDER BY total_transactions DESC
LIMIT 5;
set hive.server2.logging.operation.level=NONE;
set mapreduce.job.reduces=4;
SELECT product,count(customer_id) AS customer_count
FROM rsda_dws.tran_product_customer_month
GROUP BY product
ORDER BY total_transactions DESC
LIMIT 5;
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
select store_id,count(customer_id) as customer_count
from rsda_dws.tran_store_customer_year_month
group by store_id
order by customer_count desc
limit 1;
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT store_id, SUM(sum_amount) AS total_spending
FROM rsda_dws.tran_store_customer_year_month
GROUP BY store_id
ORDER BY total_spending DESC
LIMIT 1;
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT store_id, SUM(count_tran) AS total_transactions
FROM rsda_dws.tran_store_customer_year_month
GROUP BY store_id
ORDER BY total_transactions DESC
LIMIT 1;
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT store_id,customer_id, product, MAX(sum_amount) as max_spent
FROM (
SELECT store_id,
customer_id,
product,
SUM(sum_amount) as sum_amount
FROM rsda_dws.tran_store_product_customer
GROUP BY store_id, customer_id, product
) AS customer_product_sales
GROUP BY store_id, customer_id;
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT store_id, tran_year, tran_month, SUM(sum_amount) AS monthly_revenue
FROM rsda_dws.tran_store_customer_year_month
GROUP BY store_id, tran_year, tran_month;
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT store_id, SUM(sum_amount) AS total_revenue
FROM rsda_dws.tran_store_customer_year_month
GROUP BY store_id;
缺失指标,不写。
我将忠实顾客定义为在这一年至少访问商店10次的顾客。
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT customer_id, COUNT(*) AS months_visited
FROM rsda_dws.tran_store_customer_year_month
GROUP BY customer_id
HAVING months_visited >= 10;
定义冲突的映射关系:一个交易ID在store_review中对应多个不同的store_id
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT
transaction_id
FROM
ext_store_review
GROUP BY
transaction_id
HAVING
COUNT(store_id) > 1;
定义覆盖率:客户提交评价的交易数与总交易数的比例。
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT
store_id,
customer_id,
(review_tran_count / total_tran_count) * 100 AS review_coverage
FROM
rsda_dws.store_review_customer;
定义:通过评分的均分了解客户在各评分段的分布情况
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT
average_score,
COUNT(*) AS customer_count
FROM
(SELECT
customer_id,
(sum_review / review_tran_count) AS average_score
FROM
rsda_dws.store_review_customer
WHERE
review_tran_count > 0) AS customer_scores
GROUP BY
average_score;
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT
total_tran_count,
COUNT(*) AS customer_count
FROM
rsda_dws.store_review_customer
GROUP BY
total_tran_count;
题目转化:筛选出最佳评价不总是同一家门店的客户ID
SET hive.server2.logging.operation.level=NONE;
SET mapreduce.job.reduces=4;
SELECT
customer_id
FROM
rsda_dws.store_review_customer
WHERE
four_count > 0 OR five_count > 0
GROUP BY
customer_id, store_id
HAVING
COUNT(customer_id) > 1;