用户以设备为判断标准,在移动统计中,每个独立设备认为是一个独立用户,Android系统根据IMEI号,IOS系统根据OpenUDID来标识一个独立用户,每部手机一个用户。
首次联网使用应用的用户,如果一个用户首次打开某APP,那么这个用户定义为新增用户,卸载再安装的设备,不会被算作一次新增,新增用户包括日新增用户、周新增用户、月新增用户。
打开应用的用户即为活跃用户。不考虑用户的使用情况,每天一个设备打开多次会被计为一个活跃用户。
某个自然周(月)内启动过应用的用户,该周(月)内的多次启动只记一个活跃用户
月活跃用户与截止到该月累积的用户总和之间的比例
用户仅在安装当天(次日)启动一次,后续时间再无启动行为,该指标可以反映新增用户质量和用户APP的匹配程度。
不同版本的周内各天新增的用户数,活跃用户数和启动次数。利于判断APP个版本之间的优劣好用户行为习惯 。
上周未启动过应用,本周启动了应用的用户。
连续n周,每周至少启动一次
连续活跃5周以上的用户。
连续2周及以上活跃的用户
连续n(2<= n <= 4)周没有启动应用的用户。(第n+1没有启动过)
某段时间内新增的用户,经过一段时间后,任然使用应用的被认为是留存用户:这部分用户占当时新增用户的比例即是留存率。
每天启动应用的新老用户比例,即新增用户数占活跃用户数的比例。
每次启动使用的时间长度
累计一天内的使用时间长度
IOS平台应用退到后台就算一次独立的启动:Android平台我们规定,两次启动之间的间隔小于30秒,被计算一次启动。用户在使用过程中,若因收发短信或接电话等退出应用30秒又再次返回应用中,那这两次行为应该是延续而非独立,所以可以算作一次使用行为,即一次启动。业内大多使用30秒这个标准,但用户还是可以自定义此时间间隔。
聚合函数
基本语法:
NVL(表达式1,表达式2)
如果表达式1为空值,NVL返回值为表达式2的值,否则返回表达式1的值。该函数的目的是把一个空值(null)转换成一个实际的值。其表达式的值可以是数字型、字符型和日期型。但是表达式1和表达式2的数据类型必须为同一一个类型
1)date_format函数(根据格式整理日期)
select date_format('2020-10-10','yyy-MM');
2)date_add函数(加减日期)
select date_add('2020-10-10',-1)
select date_add('2020-10-10',1)
3)next_day函数
a)取当前天的下一个周一
select next_day('2020-10-10','MO');
说明:星期一到星期日的英文(Monday,Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday)
b)取当前周的周一
select date_add (next_day('2020-10-10','MO'),-7)
4)last_day函数(求当月最后一天日期)
select lase_day('2020-10-10');
1.紧密贴合需求,需求要什么字段,就提供什么字段
2.将需求进行分类,按照主题分类,划分设备,用户,商品,活动,地区五大类
3.表都是宽表(既有dwd事实表的字段,还有dwd层维度表的字段)都是分区(每天一个分区)表!
每天设备每天的行为
'从启动日志表中'(每个用户的每台设备启动一次的记录)
取设备id,手机品牌,手机型号,
活跃次数(按每个设备id分组,每组的数量,就是该设备的活跃次数)
'从页面日志表中'
取页面访问统计{
目标id
目标数量(按每个设备的设备id分组,每组的该设备的舒朗那就是目标数量)
}
均按mid_id分组
drop table if exists dws_uv_detail_daycount;
create external table dws_uv_detail_daycount
(
--从dwd_start_log取
`mid_id` string COMMENT '设备id',
`brand` string COMMENT '手机品牌',
`model` string COMMENT '手机型号',
`login_count` bigint COMMENT '活跃次数',
--从dwd_page_log取
-- collect_set ,collect_list : 将N列多行转 N列1行(array)
-- named_struct 生成struct
`page_stats` array<struct<page_id:string,page_count:bigint>> COMMENT '页面访问统计'
) COMMENT '每日设备行为表'
partitioned by(dt string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_daycount'
tblproperties ("parquet.compression"="lzo");
insert overwrite table dws_uv_detail_daycount partition(dt='2020-10-10')
SELECT
t1.mid_id, brand, model, login_count, page_stats
from
(SELECT
mid_id,
concat_ws('|',collect_set(brand)) brand,
concat_ws('|',collect_set(model)) model,
count(*) login_count
-- dwd_start_log粒度 1个用户启动一次的信息记录
from dwd_start_log
where dt='2020-10-10'
GROUP by mid_id ) t1
join
(SELECT
mid_id,
COLLECT_list(named_struct('page_id',page_id,'page_count',page_count)) page_stats
from
(SELECT
mid_id, page_id,count(*) page_count
-- dwd_start_log粒度 1个用户访问一个页面一次的信息记录
from dwd_page_log
where dt='2020-10-10'
GROUP by mid_id, page_id) tmp
GROUP by mid_id) t2
on t1.mid_id=t2.mid_id
每个会员每天
'从启动日志表取:'
--【启动日志表粒度:每个用户每登录一次记录一天】
用户id,登录次数(每个用户一个组,每组的数量就是登录次数)
'从行动日志表取:'
--【每个用户,用什么设备,登录哪些页面,停留多久,看了哪件商品,商品类型,是否将商品加入购物车或者添加收藏,等一系列动作信息】
加入购物车次数(按用户分组,取某天加入购物车这一动作的次数)
'从订单事实表取:'
--【每个用户下一笔单的下单时间就是本表的创建时间,以及订单的支付,取消,完成,退款,退款完成时间,订单寄往哪个省份,订单参与了哪些活动,原价金额,优惠金额,订单金额,运费金额】
下单次数(按用户分组后,每组的数量)
下单金额(就是原价金额)
'从支付事实表中取:'
--【一行,就是一个用户的一笔订单,每个用户的每笔订单,什么时候,支付了多少钱,该笔订单是发往哪个省份的】
支付次数(按用户分组后,每组的数量)
支付金额(按用户分组后,每个用户每笔订单支付的数量之和)
'从订单明细事实表中取:'
--【一行,就是用户一笔订单的一件商品,每个用户,每笔订单的商品数量,名称,该笔订单的创建时间,订单发往何处,以及订单来源类型,编号,该笔订单的每件商品的原始价格分摊,够买价格分摊,运费分摊,优惠券分摊】
下单明细【会员的一个订单的其中一件商品详细信息】
(包含该笔订单的
商品id,
商品数量,[就是该订单的商品数量]
下单次数,[
按每个用户,每种商品分组,
每组的行数就是下单的次数,
因为该表粒度是一笔订单一件商品,
每种商品有多少行就是该商品的下单次数
]
订单金额 :
粒度为每个用户每笔订单
[
即该笔订单该商品的原始价格分摊求和,
求和的原因是:
每个用户,每笔订单有好几个商品,
所以每个用户每笔订单有好几行,
这些商品(这几行)的原始价格分摊求和,
就是该用户该笔订单的订单金额
]
)
drop table if exists dws_user_action_daycount;
create external table dws_user_action_daycount
(
--dwd_start_log
user_id string comment '用户 id',
login_count bigint comment '登录次数',
--无法求出,如需要求此数据,需要加入购物车记录表
--dwd_action_log
cart_count bigint comment '加入购物车次数',
--dwd_fact_order_info
order_count bigint comment '下单次数',
order_amount decimal(16,2) comment '下单金额',
--dwd_fact_payment_info
payment_count bigint comment '支付次数',
payment_amount decimal(16,2) comment '支付金额',
--dwd_fact_order_detail
order_detail_stats
array<struct<sku_id:string,
sku_num:bigint,
order_count(某个用户,某种商品的数量):bigint,
order_amount(原始价格分摊,即是订单中某类商品的总价):decimal(20,2)>> comment '下单明细统计'
) COMMENT '每日会员行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_action_daycount/'
tblproperties ("parquet.compression"="lzo");
with
t1 as
(SELECT
user_id,count(*) login_count
from dwd_start_log
where dt='2020-10-10' and user_id is not null
GROUP by user_id),
t2 as
(
SELECT
user_id,count(*) cart_count
from dwd_action_log
where dt='2020-10-10' and action_id='cart_add'
GROUP by user_id),
t3 as
(SELECT
user_id,count(*) order_count,sum(original_total_amount) order_amount
from dwd_fact_order_info
where dt='2020-10-10'
GROUP by user_id),
t4 as
(select
user_id,count(*) payment_count,SUM(payment_amount) payment_amount
from dwd_fact_payment_info
where dt='2020-10-10'
GROUP by user_id ),
t5 as
(SELECT
user_id ,COLLECT_list(sku_id_struct) order_detail_stats
from
(SELECT
user_id,
named_struct('sku_id',sku_id,'sku_num',sum(sku_num),'order_count',count(*),
'order_amount',cast(sum(original_amount_d) as decimal(20,2))) sku_id_struct
--粒度: 一笔订单的一件商品 是一行
from dwd_fact_order_detail
where dt='2020-10-10'
GROUP by user_id,sku_id ) t
GROUP by user_id)
insert overwrite table dws_user_action_daycount partition(dt='2020-10-10')
select
t1.user_id,
nvl(login_count,0),
nvl(cart_count,0),
nvl(order_count,0),
nvl(order_amount,0.00),
nvl(payment_count,0),
nvl(payment_amount,0.00) ,
order_detail_stats
from t1
left join t2 on t1.user_id=t2.user_id
left join t3 on t1.user_id=t3.user_id
left join t4 on t1.user_id=t4.user_id
left join t5 on t1.user_id=t5.user_id
每件商品每天
'从订单明细事实表取:'一个用户一笔订单的一件商品一行
--【
--从订单明细事实表中取下单的商品id,
--商品按商品id分组,每组的数量就是每件商品被下单的次数
--商品的数量求和就是被下单件数数
--商品的原始价格分摊就是每件商品的被下单金额
--】
被下单的次数,件数,金额
'从支付事实表取:'(存在跨天支付)
--【
--跨天支付
--先取支付事实表的订单id和
--订单明细事实表的订单id关联
--按商品id分组,每组的数量就是被支付的次数
--取订单明细事实表的订单id,商品id,商品数量,最终价格
--商品数量求和就是被支付的件数
--商品的最终价格求和就是被支付金额
--】
被支付的次数,件数,金额
'从退款事实表取:'
--【从退款事实表取出每件商品每天退款的次数,件数,被退款金额】
被退款次数,件数,金额
'从动作日志表取:'
加入购物车次数
收藏次数
'从评价事实表取:'
好拼数,中评数,差评数,默认评论数
各表按商品id拼接,分组
每组的数量就是次数
drop table if exists dws_sku_action_daycount;
create external table dws_sku_action_daycount
(
--dwd_fact_order_detail
sku_id string comment 'sku_id',
order_count bigint comment '被下单次数',
order_num bigint comment '被下单件数',
order_amount decimal(16,2) comment '被下单金额',
--dwd_fact_payment_info 存在跨天支付,取dwd_fact_order_detail当天和前一天,进行Join
payment_count bigint comment '被支付次数',
payment_num bigint comment '被支付件数',
payment_amount decimal(16,2) comment '被支付金额',
-- dwd_fact_refound_info
refund_count bigint comment '被退款次数',
refund_num bigint comment '被退款件数',
refund_amount decimal(16,2) comment '被退款金额',
--dwd_action_log
cart_count bigint comment '被加入购物车次数',
favor_count bigint comment '被收藏次数',
--dwd_fact_comment_info
appraise_good_count bigint comment '好评数',
appraise_mid_count bigint comment '中评数',
appraise_bad_count bigint comment '差评数',
appraise_default_count bigint comment '默认评价数'
) COMMENT '每日商品行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_sku_action_daycount/'
tblproperties ("parquet.compression"="lzo");
with
t1 as
(SELECT
sku_id,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count
from dwd_fact_comment_info
where dt='2020-10-10'
GROUP by sku_id ),
t2 as
(SELECT
item sku_id,
sum(if(action_id='cart_add',1,0)) cart_count,
sum(if(action_id='favor_add',1,0)) favor_count
from dwd_action_log
where dt='2020-10-10' and item_type='sku_id'
GROUP by item),
t3 as
(SELECT
sku_id ,
count(*) refund_count,
sum(refund_num ) refund_num,
sum(refund_amount) refund_amount
--粒度: 一笔订单的一种商品是一条
from dwd_fact_order_refund_info
where dt='2020-10-10'
GROUP by sku_id ),
t4 as
(--2020-10-10所有被支付的商品的信息
SELECT
sku_id,
count(*) payment_count,
sum(sku_num) payment_num,
sum(final_amount_d) payment_amount
from
(SELECT
order_id
from dwd_fact_payment_info
where dt='2020-10-10') p
join
(select
order_id,sku_id,sku_num,final_amount_d
from dwd_fact_order_detail
where dt='2020-10-10' or dt=date_sub('2020-10-10',1)
) o
on p.order_id=o.order_id
GROUP by sku_id),
t5 as
(SELECT
sku_id,
count(*) order_count,
sum(sku_num ) order_num,
sum(original_amount_d) order_amount
from dwd_fact_order_detail
where dt='2020-10-10'
GROUP by sku_id )
select
nvl(nvl(nvl(nvl(t1.sku_id,t2.sku_id),t3.sku_id),t4.sku_id),t5.sku_id) id,
nvl(order_count,0),
nvl(order_num,0),
nvl(order_amount,0.00),
nvl(payment_count,0),
nvl(payment_num,0),
nvl(payment_amount,0.00),
nvl(refund_count,0),
nvl(refund_num,0),
nvl(refund_amount,0.00),
nvl(cart_count,0),
nvl(favor_count,0),
nvl(appraise_good_count,0),
nvl(appraise_mid_count,0),
nvl(appraise_bad_count,0),
nvl(appraise_default_count,0)
from t1
full join t2 on t1.sku_id=t2.sku_id
full join t3 on t1.sku_id=t3.sku_id or t2.sku_id=t3.sku_id
full join t4 on t1.sku_id=t4.sku_id or t2.sku_id=t4.sku_id or t3.sku_id=t4.sku_id
full join t5 on t1.sku_id=t5.sku_id or t2.sku_id=t5.sku_id or t3.sku_id=t5.sku_id or t4.sku_id=t5.sku_id
将join 替换为 xxx union all xxx group by xxx
insert overwrite table dws_sku_action_daycount partition(dt='2020-10-10')
SELECT
sku_id ,
sum(order_count ),
sum(order_num ),
sum(order_amount ),
sum(payment_count ),
sum(payment_num ),
sum(payment_amount ),
sum(refund_count ),
sum(refund_num ),
sum(refund_amount ),
sum(cart_count ),
sum(favor_count ),
sum(appraise_good_count ),
sum(appraise_mid_count ),
sum(appraise_bad_count ),
sum(appraise_default_count )
from
(SELECT
sku_id,0 order_count, 0 order_num, 0 order_amount,
0 payment_count,0 payment_num,0 payment_amount,
0 refund_count, 0 refund_num, 0 refund_amount,
0 cart_count,0 favor_count,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count
from dwd_fact_comment_info
where dt='2020-10-10'
GROUP by sku_id
union all
SELECT
item sku_id,
0 order_count, 0 order_num, 0 order_amount,
0 payment_count,0 payment_num,0 payment_amount,
0 refund_count, 0 refund_num, 0 refund_amount,
sum(if(action_id='cart_add',1,0)) cart_count,
sum(if(action_id='favor_add',1,0)) favor_count,
0 appraise_good_count,0 appraise_mid_count,
0 appraise_bad_count,0 appraise_default_count
from dwd_action_log
where dt='2020-10-10' and item_type='sku_id'
GROUP by item
union all
SELECT
sku_id ,
0 order_count, 0 order_num, 0 order_amount,
0 payment_count,0 payment_num,0 payment_amount,
count(*) refund_count,
sum(refund_num ) refund_num,
sum(refund_amount) refund_amount,
0 cart_count,0 favor_count,
0 appraise_good_count,0 appraise_mid_count,
0 appraise_bad_count,0 appraise_default_count
--粒度: 一笔订单的一种商品是一条
from dwd_fact_order_refund_info
where dt='2020-10-10'
GROUP by sku_id
union all
--2020-10-10所有被支付的商品的信息
SELECT
sku_id,
0 order_count, 0 order_num, 0 order_amount,
count(*) payment_count,
sum(sku_num) payment_num,
sum(final_amount_d) payment_amount,
0 refund_count, 0 refund_num, 0 refund_amount,
0 cart_count,0 favor_count,
0 appraise_good_count,0 appraise_mid_count,
0 appraise_bad_count,0 appraise_default_count
from
(SELECT
order_id
from dwd_fact_payment_info
where dt='2020-10-10') p
join
(select
order_id,sku_id,sku_num,final_amount_d
from dwd_fact_order_detail
where dt='2020-10-10' or dt=date_sub('2020-10-10',1)
) o
on p.order_id=o.order_id
GROUP by sku_id
union all
SELECT
sku_id,
count(*) order_count,
sum(sku_num ) order_num,
sum(original_amount_d) order_amount,
0 payment_count,0 payment_num,0 payment_amount,
0 refund_count, 0 refund_num, 0 refund_amount,
0 cart_count,0 favor_count,
0 appraise_good_count,0 appraise_mid_count,
0 appraise_bad_count,0 appraise_default_count
from dwd_fact_order_detail
where dt='2020-10-10'
GROUP by sku_id) tmp
GROUP by sku_id
从活动信息表中获取每个活动,统计每个活动每天的曝光信息,订单信息,以及支付信息
'从活动信息表取:'
--【一行是一个活动的名称,类型,开始时间,结合时间,以及该表创建时间】
取所有字段
'从曝光日志表取:'
--【曝光的页面的商品信息,登录该页面的设备id,会员id,所曝光的商品的id】
取曝光次数(按曝光对象id分组,魅每组的数量就是曝光次数)
'订单事实表:'
--【一行,就是用户一笔订单的一件商品,每个用户,每笔订单的商品数量,名称,该笔订单的创建时间,订单发往何处,以及订单来源类型,编号,该笔订单的每件商品的原始价格分摊,够买价格分摊,运费分摊,优惠券分摊,以及该订单参与的活动id】
取下单次数(按活动分组,取每组订单数量就是该活动的下单数量)
下单金额(分组后每行的订单金额求和)
'支付事实表:'
--【一行,就是一个用户的一笔订单,每个用户的每笔订单,在什么时候支付了多少钱,该笔订单是发往哪个省份的】
支付次数(按活动分组后,每组的数量)
支付金额(按活动分组后,每个活动每笔订单支付的数量之和)
drop table if exists dws_activity_info_daycount;
create external table dws_activity_info_daycount(
-- 从dwd_dim_activity_info取
-- 活动得在有效期内 end_time >= 今天
`id` string COMMENT '编号',
`activity_name` string COMMENT '活动名称',
`activity_type` string COMMENT '活动类型',
`start_time` string COMMENT '开始时间',
`end_time` string COMMENT '结束时间',
`create_time` string COMMENT '创建时间',
--dwd_display_log
`display_count` bigint COMMENT '曝光次数',
--dwd_fact_order_info
`order_count` bigint COMMENT '下单次数',
`order_amount` decimal(20,2) COMMENT '下单金额',
--dwd_fact_payment_info 存在跨天支付 取前一天和今天的订单
`payment_count` bigint COMMENT '支付次数',
`payment_amount` decimal(20,2) COMMENT '支付金额'
) COMMENT '每日活动统计'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_activity_info_daycount/'
tblproperties ("parquet.compression"="lzo");
with
t1 as
(SELECT
activity_id,
count(*) payment_count,
sum(payment_amount) payment_amount
from
-- 求昨天和今天下单的参与活动的,在今天进行支付的订单有哪些
(SELECT
order_id,payment_amount
from dwd_fact_payment_info
where dt='2020-10-10') p
join
(select
DISTINCT id,activity_id
from dwd_fact_order_info
where (dt='2020-10-10' or dt=date_sub('2020-10-10',1))
and activity_id is not NULL
) o
on p.order_id=o.id
GROUP by activity_id),
t2 as
(SELECT
activity_id,
count(*) order_count,
sum(original_total_amount ) order_amount
from dwd_fact_order_info
where dt='2020-10-10' and activity_id is not null
GROUP by activity_id ),
t3 as
(SELECT
item activity_id,
count(*) display_count
FROM dwd_display_log
where dt='2020-10-10' and item_type ='activity_id'
GROUP by item),
t4 as
(SELECT
*
from dwd_dim_activity_info
where dt='2020-10-10')
insert overwrite table dws_activity_info_daycount partition(dt='2020-10-10')
SELECT
t4.id, activity_name, activity_type,
start_time, end_time, create_time,
nvl(display_count,0),
nvl(order_count,0),
nvl(order_amount,0.0),
nvl(payment_count,0),
nvl(payment_amount,0.0)
from t4
left join t3 on t4.id=t3.activity_id
left join t2 on t4.id=t2.activity_id
left join t1 on t4.id=t1.activity_id
每个地区每天产生的订单数
'从地区维度表中:'
取编号,省份名称,地区编码,iso编码,地区id,地区名称
'从启动日志表中:'
取活跃设备数(就是设备id)
'从订单事实表中:'
--【一行,就是用户一笔订单的一件商品,每个用户,每笔订单的商品数量,名称,该笔订单的创建时间,订单发往何处,以及订单来源类型,编号,该笔订单的每件商品的原始价格分摊,够买价格分摊,运费分摊,优惠券分摊,以及该订单参与的活动id】
取下单次数(按活动分组,取每组订单数量就是该活动的下单数量)
下单金额(分组后每行的订单金额求和)
'支付事实表:'
--【一行,就是一个用户的一笔订单,每个用户的每笔订单,在什么时候支付了多少钱,该笔订单是发往哪个省份的】
支付次数(按省份分组后,每组的数量)
支付金额(按省份分组后,每个用户每笔订单支付的数量之和)
取支付次数,支付金额
drop table if exists dws_area_stats_daycount;
create external table dws_area_stats_daycount(
--dwd_dim_base_province
`id` bigint COMMENT '编号',
`province_name` string COMMENT '省份名称',
`area_code` string COMMENT '地区编码',
`iso_code` string COMMENT 'iso编码',
`region_id` string COMMENT '地区ID',
`region_name` string COMMENT '地区名称',
--dwd_start_log
`login_count` string COMMENT '活跃设备数',
--从dwd_fact_order_info
`order_count` bigint COMMENT '下单次数',
`order_amount` decimal(20,2) COMMENT '下单金额',
----dwd_fact_payment_info 存在跨天支付 取前一天和今天的订单
`payment_count` bigint COMMENT '支付次数',
`payment_amount` decimal(20,2) COMMENT '支付金额'
) COMMENT '每日地区统计表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_area_stats_daycount/'
tblproperties ("parquet.compression"="lzo");
insert overwrite table dws_area_stats_daycount partition(dt='2020-10-10')
SELECT
id, province_name, t1.area_code, iso_code,
region_id, region_name,
nvl(login_count,0),
nvl(order_count,0),
nvl(order_amount,0.0),
nvl(payment_count,0),
nvl(payment_amount,0.0)
from dwd_dim_base_province t1
left join
(SELECT
province_id,
count(*) order_count,
sum(original_total_amount ) order_amount
from dwd_fact_order_info
where dt='2020-10-10'
GROUP by province_id ) t2
on t1.id = t2.province_id
left join
(SELECT
province_id,count(*) payment_count,
sum(payment_amount ) payment_amount
from dwd_fact_payment_info
where dt='2020-10-10'
GROUP by province_id ) t3
on t1.id= t3.province_id
left join
(select
area_code,count(*) login_count
from
(select
area_code ,mid_id
from dwd_start_log
where dt='2020-10-10'
GROUP by area_code ,mid_id ) tmp
GROUP by area_code) t4
on t1.area_code=t4.area_code
1.DWT层也是数据服务层!
和DWS层的区别在于,DWT是一张累积型宽表(从多张事实表和维度表中取数据)!
2.不是分区表,是一张全量表
3.全量表在更新时决定了方式:old(dwt表) full join new (dws今天新导入的数据),新旧交替,覆盖原表!
4.DWT层是从DWS层取数据! DWS层将同一主题数据按天聚合,DWT层是将同一主题数据累积聚合!
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BVRMxzGN-1604590441912)(https://i.loli.net/2020/10/24/ZaePtkODpiqrjSz.png)]
UV: user views : 用户浏览数量
PV: page views : 页面浏览数量
drop table if exists dwt_uv_topic;
create external table dwt_uv_topic
(
`mid_id` string comment '设备id',
`brand` string comment '手机品牌',
`model` string comment '手机型号',
-- 第一次活跃的时间
`login_date_first` string comment '首次活跃时间',
--最后一次活跃的时间
`login_date_last` string comment '末次活跃时间',
`login_day_count` bigint comment '当日活跃次数',
-- dwt累积的活跃天数+ if(当天活跃,1,0)
`login_count` bigint comment '累积活跃天数'
) COMMENT '设备主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_uv_topic'
tblproperties ("parquet.compression"="lzo");
old full join new on old.id = new.id
今天活跃的老用户: where old.id is not null and new.id is not null
新用户: where old.id id null
今天未活跃的老用户: where new.id id null
老用户:where old.id id not null
自定义函数导入hive的auxlib文件夹后要进入hive客户端
hive>use gmall;
hive>creat function 函数名 as '自定义函数的全类名'
INSERT overwrite table dwt_uv_topic
SELECT
nvl(old.mid_id,new.mid_id ) mid_id,
if(old.brand is null ,new.brand,IF(new.brand is null,old.brand,compactstr(old.brand ,new.brand)) ) brand,
if(old.model is null ,new.model,IF(new.model is null,old.model,compactstr(old.model ,new.model)) ) model,
--首次活跃时间
--老的设备是空的,就是使用新设备的使用时间为首次活跃时间,否则老的设备首次活跃时间不变
if(old.mid_id is null,new.dt,old.login_date_first ),
--末次活跃时间
--新的设备是空的,就是使用老的设备最后一次使用时间为,末次活跃时间,否则就是本次时间为该型设备末次活跃时间
if(new.mid_id is null,old.login_date_last,new.dt ),
--新设备的累积活跃天数为空的话,记0
nvl(new.login_count,0),
--累积活跃天数为老设备的活跃天数加上新设备的活跃天数(如果新设备不为空,则加上)
nvl(old.login_count,0)+if(new.mid_id is null,0,1)
from
dwt_uv_topic old
full join
(select
mid_id,brand ,model, login_count ,dt
from dws_uv_detail_daycount
where dt='2020-10-10') new
on old.mid_id = new.mid_id
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rVT1U1UN-1604590441916)(https://i.loli.net/2020/10/24/oHn24aXf83BuqQL.png)]
'old表从会员主题宽表取:'
用户id:
如果有新会员,记录新会员id,没有就还是原来的id --nvl(old.user_id ,new.user_id ) user_id,
首次登录时间:
如果有原来的登录时间为空就用新的时间,不空就还是用原来的 --nvl(old.login_date_first,new.dt) login_date_first,
末次登录时间:
如果新的时间为空,就是原来的末次活跃时间,如果本次活跃了就用本次的时间 --nvl(new.dt,old.login_date_last) login_date_last,
-- old为null,说明此人一次单也没下,此时判断今天下单的数量是否为0,
首次下单时间:nvl(old.login_date_first ,IF (new.order_count = 0,null,new.dt)) order_date_first,
末次下单时间:IF (new.order_count > 0,new.dt,old.order_date_last) order_date_last,
首次支付时间: nvl(old.payment_date_first ,IF (new.payment_count = 0,null,new.dt)) payment_date_first,
末次支付时间:IF (new.payment_count > 0,new.dt,old.payment_date_last) payment_date_last,
--old + new
累积登录天数:
如果累积登录天数为空,则记0,如果有新设备登录,则累积的登录天数加1 --nvl(old.login_count,0 )+ if(new.login_count is null,0,1) login_count,
累积下单次数:
如果累积下单次数为空,则记0,如果有新下的单,则累积下单次数加上新下单的数量 --nvl(old.order_count,0) + nvl(new.order_count,0) order_count,
累积下单金额:
如果累积下单金额为空,则记0,如果有新下的单,则累积下单金额加上新下的单的金额 --nvl(old.order_amount,0.0) + nvl(new.order_amount,0.0) order_amount,
累积支付次数:
如果支付次数为空,则记0,如果有新的支付订单,则原来的支付次数加新的支付次数 --nvl(old.payment_count,0) + nvl(new.payment_count,0) payment_count,
累积支付金额:
如果支付金额为空,则记0,如果有新下的单,则原来的累积支付金额加上新的订单的支付金额 --nvl(old.payment_amount,0.0) + nvl(new.payment_amount,0.0) payment_amount
--new表从每日会员行为表取最新的
会员id,登录次数,下单次数,支付次数,下单金额,支付金额,时间dt
两表做full join操作
最近30天的数据从每日会员行为表中找
where dt BETWEEN date_sub('2020-10-10',29) and '2020-10-10'
drop table if exists dwt_user_topic;
create external table dwt_user_topic
(
user_id string comment '用户id',
login_date_first string comment '首次登录时间',
login_date_last string comment '末次登录时间',
order_date_first string comment '首次下单时间',
order_date_last string comment '末次下单时间',
payment_date_first string comment '首次支付时间',
payment_date_last string comment '末次支付时间',
login_count bigint comment '累积登录天数',
order_count bigint comment '累积下单次数',
order_amount decimal(16,2) comment '累积下单金额',
payment_count decimal(16,2) comment '累积支付次数',
payment_amount decimal(16,2) comment '累积支付金额',
login_last_30d_count bigint comment '最近30日登录天数',
order_last_30d_count bigint comment '最近30日下单次数',
order_last_30d_amount bigint comment '最近30日下单金额',
payment_last_30d_count decimal(16,2) comment '最近30日支付次数',
payment_last_30d_amount decimal(16,2) comment '最近30日支付金额'
)COMMENT '会员主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_user_topic/'
tblproperties ("parquet.compression"="lzo");
insert overwrite table dwt_user_topic
SELECT
t1.user_id, login_date_first, login_date_last,
login_count, login_last_30d_count, order_date_first,
order_date_last, order_count, order_amount, order_last_30d_count,
order_last_30d_amount, payment_date_first, payment_date_last,
payment_count, payment_amount, payment_last_30d_count,
payment_last_30d_amount
from
(SELECT
nvl(old.user_id ,new.user_id ) user_id,
nvl(old.login_date_first,new.dt) login_date_first,
nvl(new.dt,old.login_date_last) login_date_last,
-- old为null,说明此人一次单也没下,此时判断今天下单的数量是否为0,
nvl(old.login_date_first ,IF (new.order_count = 0,null,new.dt)) order_date_first,
IF (new.order_count > 0,new.dt,old.order_date_last) order_date_last,
nvl(old.payment_date_first ,IF (new.payment_count = 0,null,new.dt)) payment_date_first,
IF (new.payment_count > 0,new.dt,old.payment_date_last) payment_date_last,
--old + new
nvl(old.login_count,0 )+ if(new.login_count is null,0,1) login_count,
nvl(old.order_count,0) + nvl(new.order_count,0) order_count,
nvl(old.order_amount,0.0) + nvl(new.order_amount,0.0) order_amount,
nvl(old.payment_count,0) + nvl(new.payment_count,0) payment_count,
nvl(old.payment_amount,0.0) + nvl(new.payment_amount,0.0) payment_amount
from dwt_user_topic old
full join
(select
--login_count 不可能为0,从启动日志查询
--order_count ,payment_count 有可能为0,如果为0代表今天并没有下单或支付
user_id,login_count ,order_count ,payment_count
,order_amount,payment_amount,dt
from dws_user_action_daycount
where dt='2020-10-10') new
on old.user_id =new.user_id) t1
left join
(SELECT
user_id,
count(*) login_last_30d_count,
sum(order_count ) order_last_30d_count,
sum(order_amount ) order_last_30d_amount,
sum(payment_count ) payment_last_30d_count,
sum(payment_amount ) payment_last_30d_amount
from
dws_user_action_daycount
where dt BETWEEN date_sub('2020-10-10',29) and '2020-10-10' --【累积30天的数据】
GROUP by user_id ) t2
on t1.user_id = t2.user_id
drop table if exists dwt_sku_topic;
create external table dwt_sku_topic
(
sku_id string comment 'sku_id',
spu_id string comment 'spu_id',
order_last_30d_count bigint comment '最近30日被下单次数',
order_last_30d_num bigint comment '最近30日被下单件数',
order_last_30d_amount decimal(16,2) comment '最近30日被下单金额',
payment_last_30d_count bigint comment '最近30日被支付次数',
payment_last_30d_num bigint comment '最近30日被支付件数',
payment_last_30d_amount decimal(16,2) comment '最近30日被支付金额',
refund_last_30d_count bigint comment '最近三十日退款次数',
refund_last_30d_num bigint comment '最近三十日退款件数',
refund_last_30d_amount decimal(16,2) comment '最近三十日退款金额',
cart_last_30d_count bigint comment '最近30日被加入购物车次数',
favor_last_30d_count bigint comment '最近30日被收藏次数',
appraise_last_30d_good_count bigint comment '最近30日好评数',
appraise_last_30d_mid_count bigint comment '最近30日中评数',
appraise_last_30d_bad_count bigint comment '最近30日差评数',
appraise_last_30d_default_count bigint comment '最近30日默认评价数',
order_count bigint comment '累积被下单次数',
order_num bigint comment '累积被下单件数',
order_amount decimal(16,2) comment '累积被下单金额',
payment_count bigint comment '累积被支付次数',
payment_num bigint comment '累积被支付件数',
payment_amount decimal(16,2) comment '累积被支付金额',
refund_count bigint comment '累积退款次数',
refund_num bigint comment '累积退款件数',
refund_amount decimal(16,2) comment '累积退款金额',
cart_count bigint comment '累积被加入购物车次数',
favor_count bigint comment '累积被收藏次数',
appraise_good_count bigint comment '累积好评数',
appraise_mid_count bigint comment '累积中评数',
appraise_bad_count bigint comment '累积差评数',
appraise_default_count bigint comment '累积默认评价数'
)COMMENT '商品主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_sku_topic/'
tblproperties ("parquet.compression"="lzo");
with
t3 as
(SELECT
id ,spu_id
from dwd_dim_sku_info
where dt='2020-10-10'),
t2 as
(SELECT
sku_id,
sum(order_count ) order_last_30d_count,
sum(order_num ) order_last_30d_num,
sum(order_amount ) order_last_30d_amount,
sum(payment_count ) payment_last_30d_count,
sum(payment_num ) payment_last_30d_num,
sum(payment_amount ) payment_last_30d_amount,
sum(refund_count ) refund_last_30d_count,
sum(refund_num ) refund_last_30d_num,
sum(refund_amount ) refund_last_30d_amount,
sum(cart_count ) cart_last_30d_count,
sum(favor_count ) favor_last_30d_count,
sum(appraise_good_count ) appraise_last_30d_good_count,
sum(appraise_mid_count ) appraise_last_30d_mid_count,
sum(appraise_bad_count ) appraise_last_30d_bad_count,
sum(appraise_default_count ) appraise_last_30d_default_count
from dws_sku_action_daycount
where dt BETWEEN date_sub('2020-10-10',29) and '2020-10-10'
GROUP by sku_id ),
-- dwt_sku_topic old + dws_sku_daycount new
t1 as
(SELECT
nvl(old.sku_id,new.sku_id ) sku_id,
nvl(old.order_count,0 ) + nvl(new.order_count,0) order_count,
nvl(old.order_num,0 ) + nvl(new.order_num,0) order_num,
nvl(old.order_amount,0.0 ) + nvl(new.order_amount,0.0) order_amount,
nvl(old.payment_count,0 ) + nvl(new.payment_count,0) payment_count,
nvl(old.payment_num,0 ) + nvl(new.payment_num,0) payment_num,
nvl(old.payment_amount,0.0 ) + nvl(new.payment_amount,0.0) payment_amount,
nvl(old.refund_count,0 ) + nvl(new.refund_count,0) refund_count,
nvl(old.refund_num,0 ) + nvl(new.refund_num,0) refund_num,
nvl(old.refund_amount,0.0 ) + nvl(new.refund_amount,0.0) refund_amount,
nvl(old.cart_count,0 ) + nvl(new.cart_count,0) cart_count,
nvl(old.favor_count,0 ) + nvl(new.favor_count,0) favor_count,
nvl(old.appraise_good_count,0 ) + nvl(new.appraise_good_count,0) appraise_good_count,
nvl(old.appraise_mid_count,0 ) + nvl(new.appraise_mid_count,0) appraise_mid_count,
nvl(old.appraise_bad_count,0 ) + nvl(new.appraise_bad_count,0) appraise_bad_count,
nvl(old.appraise_default_count,0 ) + nvl(new.appraise_default_count,0) appraise_default_count
from dwt_sku_topic old
full join
(select
*
from dws_sku_action_daycount
where dt='2020-10-10') new
on old.sku_id = new.sku_id )
insert overwrite table dwt_sku_topic
select
t1.sku_id,
spu_id,
nvl(order_last_30d_count,0),
nvl(order_last_30d_num,0),
nvl(order_last_30d_amount,0),
order_count, order_num, order_amount,
nvl(payment_last_30d_count,0),
nvl(payment_last_30d_num,0),
nvl(payment_last_30d_amount,0.0), payment_count, payment_num,
payment_amount,
nvl(refund_last_30d_count,0),
nvl(refund_last_30d_num,0),
nvl(refund_last_30d_amount,0.0), refund_count, refund_num, refund_amount,
nvl(cart_last_30d_count,0), cart_count,
nvl(favor_last_30d_count,0),
favor_count,
nvl(appraise_last_30d_good_count,0),
nvl(appraise_last_30d_mid_count,0),
nvl(appraise_last_30d_bad_count,0),
nvl(appraise_last_30d_default_count,0), appraise_good_count,
appraise_mid_count, appraise_bad_count, appraise_default_count
from t1 left join t2 on t1.sku_id=t2.sku_id
left join t3 on t1.sku_id = t3.id
drop table if exists dwt_activity_topic;
create external table dwt_activity_topic(
`id` string COMMENT '编号',
`activity_name` string COMMENT '活动名称',
`activity_type` string COMMENT '活动类型',
`start_time` string COMMENT '开始时间',
`end_time` string COMMENT '结束时间',
`create_time` string COMMENT '创建时间',
`display_day_count` bigint COMMENT '当日曝光次数',
`order_day_count` bigint COMMENT '当日下单次数',
`order_day_amount` decimal(20,2) COMMENT '当日下单金额',
`payment_day_count` bigint COMMENT '当日支付次数',
`payment_day_amount` decimal(20,2) COMMENT '当日支付金额',
`display_count` bigint COMMENT '累积曝光次数',
`order_count` bigint COMMENT '累积下单次数',
`order_amount` decimal(20,2) COMMENT '累积下单金额',
`payment_count` bigint COMMENT '累积支付次数',
`payment_amount` decimal(20,2) COMMENT '累积支付金额'
) COMMENT '活动主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_activity_topic/'
tblproperties ("parquet.compression"="lzo");
insert overwrite table dwt_activity_topic
SELECT
nvl(old.id , new.id) id,
nvl(old.activity_name , new.activity_name) activity_name,
nvl(old.activity_type , new.activity_type) activity_type,
nvl(old.start_time , new.start_time) start_time,
nvl(old.end_time , new.end_time) end_time,
nvl(old.create_time , new.create_time) create_time,
nvl(new.display_count,0) display_day_count,
nvl(new.order_count,0) order_day_count,
nvl(new.order_amount,0.0) order_day_amount,
nvl(new.payment_count,0) payment_day_count,
nvl(new.payment_amount,0.0) payment_day_amount,
nvl(old.display_count,0) + nvl(new.display_count,0) display_count,
nvl(old.order_count,0) + nvl(new.order_count,0) order_count,
nvl(old.order_amount,0.0) + nvl(new.order_amount,0.0) order_amount,
nvl(old.payment_count,0) + nvl(new.payment_count,0) payment_count,
nvl(old.payment_amount,0.0) + nvl(new.payment_amount,0.0) payment_amount
from dwt_activity_topic old
full join
(select
*
from dws_activity_info_daycount
where dt='2020-10-10') new
on old.id =new.id
drop table if exists dwt_area_topic;
create external table dwt_area_topic(
`id` bigint COMMENT '编号',
`province_name` string COMMENT '省份名称',
`area_code` string COMMENT '地区编码',
`iso_code` string COMMENT 'iso编码',
`region_id` string COMMENT '地区ID',
`region_name` string COMMENT '地区名称',
`login_day_count` string COMMENT '当天活跃设备数',
`payment_day_count` bigint COMMENT '当天支付次数',
`payment_day_amount` decimal(16,2) COMMENT '当天支付金额',
`order_day_count` bigint COMMENT '当天下单次数',
`order_day_amount` decimal(16,2) COMMENT '当天下单金额',
`login_last_30d_count` string COMMENT '最近30天活跃设备数',
`order_last_30d_count` bigint COMMENT '最近30天下单次数',
`order_last_30d_amount` decimal(16,2) COMMENT '最近30天下单金额',
`payment_last_30d_count` bigint COMMENT '最近30天支付次数',
`payment_last_30d_amount` decimal(16,2) COMMENT '最近30天支付金额'
) COMMENT '地区主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_area_topic/'
tblproperties ("parquet.compression"="lzo");
insert overwrite table dwt_area_topic
SELECT
t1.id, province_name, area_code, iso_code, region_id,
region_name, login_day_count,
nvl(login_last_30d_count,0), order_day_count,
order_day_amount,
nvl(order_last_30d_count,0),
nvl(order_last_30d_amount,0.0), payment_day_count,
payment_day_amount,
nvl(payment_last_30d_count,0),
nvl(payment_last_30d_amount,0)
from
(SELECT
nvl(old.id ,new.id) id,
nvl(old.province_name ,new.province_name) province_name,
nvl(old.area_code ,new.area_code) area_code,
nvl(old.iso_code ,new.iso_code) iso_code,
nvl(old.region_id ,new.region_id) region_id,
nvl(old.region_name ,new.region_name) region_name,
nvl(new.login_count,0) login_day_count,
nvl(new.payment_count,0) payment_day_count,
nvl(new.payment_amount,0.0) payment_day_amount,
nvl(new.order_count,0) order_day_count,
nvl(new.order_amount,0.0) order_day_amount
from dwt_area_topic old
full join
(
select
*
from dws_area_stats_daycount
where dt='2020-10-10'
)new
on old.id =new.id) t1
left join
(SELECT
id,
sum(login_count) login_last_30d_count,
sum(order_count) order_last_30d_count,
sum(order_amount) order_last_30d_amount,
sum(payment_count) payment_last_30d_count,
sum(payment_amount) payment_last_30d_amount
from dws_area_stats_daycount
where dt BETWEEN date_sub('2020-10-10',29) and '2020-10-10'
GROUP by id) t2
on t1.id=t2.id
–数据导入到方式:
dwt_xxx(old) full join dws_xxx(dt = ‘最新时间’)(new)
新老数据更替,某些情况,需要结合逻辑判断!
将数据覆盖到原表。
–注意:
①.累积xxx数量
old + new
最新的30天的数据
从dws层选取距离当前日期30天的数据,分组之后进行sum求和操作即可
当前的累积数据
取之前累积的结果和最新的数据相加
(
--nvl(old数据,0) + nvl(new数据,0)
解释:
如果之前没累积数据,则最新的数据为累加的数据,
如果之前有数据,则之前累积数据加上最新数据,作为最新的累积数据
)
--注意所有的字段都需要进行nvl判空处理
ADS层直接和需求相关!通常是一个(类)需求一张表!
ADS层可以使用分区表,但是没必要,数据量少!通常是全量表!
ADS层每一个统计的结果必须有一个对应的日期属性!
ADS层的数据,从DWS和DWT层导入而来,如果查询某N天的数据,从DWS层取!如果取截至到目前的累积数据,从DWT层取!
日活:当日活跃的 设备数
周活:当周活跃的 设备数
月活:当月活跃的 设备数
活跃:在指定的日期范围内,至少启动过一次APP!
'从设备主题宽表取:'
统计日期:统计数据的日期
当日用户数量:末次活跃时间是今天
当周用户数量:末次活跃时间介于本周一和本周日之间的活跃设备数求和,只需要大于本周一,说明本周活跃
--本周一:date_sub(next_day('2020-10-10','Mon'),7) 下周一减7天就是本周一
当月用户数量:说明本月活跃,将设备末次活跃时间跟统计数据日期在同一个月的设备求和
--某个日期的本月最后一天:last_day(日期)
--例如:last_day(login_date_last) = last_day('2020-10-10')
Y,N是否是周末,用于得到本周最终结果:
--本周末:date_sub(next_day('日期' , 'Mon'),1)
Y,N是否是月末,用于得到本月最终结果:
--月末:last_day('日期')
drop table if exists ads_uv_count;
create external table ads_uv_count(
`dt` string COMMENT '统计日期',
`day_count` bigint COMMENT '当日用户数量',
`wk_count` bigint COMMENT '当周用户数量',
`mn_count` bigint COMMENT '当月用户数量',
`is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
`is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果'
) COMMENT '活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_uv_count/';
insert into table ads_uv_count
SELECT
'2020-10-10' dt,
-- 如果 login_date_last = 今天,说明今日活跃
sum(if(login_date_last ='2020-10-10',1,0)) day_count,
-- login_date_last 介于本周一和周日之间,说明本周活跃
sum(if(login_date_last >= date_sub(next_day('2020-10-10','Mon'),7),1,0)) wk_count,
-- login_date_last 在本月,说明本月活跃过
sum(if(last_day(login_date_last) = last_day('2020-10-10'),1,0 )) mn_count,
if(date_sub(next_day('2020-10-10','Mon'),1)='2020-10-10','Y','N') is_weekend,
if(last_day('2020-10-10') = '2020-10-10','Y','N' ) is_monthend
from dwt_uv_topic
'从设备主题宽表取:'
创建时间
新增设备数量:
--新增设备数量即是设备首次活跃时间在今天的
--只需要过滤出来再求数量即可
create external table ads_new_mid_count
(
`create_date` string comment '创建时间' ,
`new_mid_count` BIGINT comment '新增设备数量'
) COMMENT '每日新增设备数量'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_new_mid_count/';
insert into table ads_new_mid_count
select
'2020-10-10',
count(*)
from dwt_uv_topic
where login_date_first = '2020-10-10'
日期:新增的日期 , 留存天数 , 留存N天后的日活日期
新增日期 + 留存天数 = 留存当天的活跃日期
统计:2020-10-10日留存1天的留存率
新增日期:2020-10-10,统计当天新增的人数 : x
留存日期:1天
统计的活跃日期:2020-10-11 , x中 ,在2020-10-11 日期,活跃的设备数: y
2020-10-10日新用户留存1天的人数:y
2020-10-10日新用户留存1天的留存率 : y / x
①统计数据的日期:自己定
②设备新增的日期:
③截止当前日期留存天数: 2 这个数留存几天就定几天
'从dwt层的设备主题宽表取:'
④留存数量:前几天新增的,统计日期时间活跃的设备数量就是留存数量
--2020-10-8日新增的,在2020-10-10活跃的设备,即是过滤出首次活跃时间在统计数据日期前一天【这个天数为截止当前日期留存天数】,和末次活跃时间在统计数据日期这一天做count(*)统计数量
【
where login_date_first = date_sub('2020-10-10',2【截止当前日期留存天数】)
and
login_date_last = '2020-10-10' )
】
'从ads层的每日新增设备表取'
⑤设备新增数量:当天新增的设备数量
--创建时间等于统计数据日期前一天
【
where create_date=date_sub('2020-10-10',1)
】
'从以上两表做join操作所得的表取:'
⑥留存率: 留存数量 / 设备新增数量
-- cast(retention_count / new_mid_count * 100 as decimal(16,2)) retention_ratio
分别求出一天留存率,两天留存率,三天留存率做union all操作
drop table if exists ads_user_retention_day_rate;
create external table ads_user_retention_day_rate
(
`stat_date` string comment '统计数据的日期',
`create_date` string comment '设备新增日期',
`retention_day` int comment '截止当前日期留存天数',
`retention_count` bigint comment '留存数量',
`new_mid_count` bigint comment '设备新增数量',
`retention_ratio` decimal(16,2) comment '留存率'
) COMMENT '留存率'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_retention_day_rate/';
insert into table ads_user_retention_day_rate
select
'2020-10-10' stat_date,
date_sub('2020-10-10',1) create_date,
1 retention_day,
retention_count,
new_mid_count,
cast(retention_count / new_mid_count * 100 as decimal(16,2)) retention_ratio
from
(select
-- 已经有2020-10-10的日活,
'2020-10-10' dt,
count(*) retention_count
from dwt_uv_topic
-- 2020-10-9日新增的,在2020-10-10活跃的设备
where login_date_first = date_sub('2020-10-10',1)
and
login_date_last = '2020-10-10' ) t1
join
(select
'2020-10-10' dt,
new_mid_count
from ads_new_mid_count
where create_date=date_sub('2020-10-10',1) ) t2
on t1.dt = t2.dt
union all
select
'2020-10-10' stat_date,
date_sub('2020-10-10',2) create_date,
2 retention_day,
retention_count,
new_mid_count,
cast(retention_count / new_mid_count * 100 as decimal(16,2)) retention_ratio
from
(select
-- 已经有2020-10-10的日活,
'2020-10-10' dt,
count(*) retention_count
from dwt_uv_topic
-- 2020-10-8日新增的,在2020-10-10活跃的设备
where login_date_first = date_sub('2020-10-10',2)
and
login_date_last = '2020-10-10' ) t1
join
(select
'2020-10-10' dt,
new_mid_count
from ads_new_mid_count
where create_date=date_sub('2020-10-10',2) ) t2
on t1.dt = t2.dt
union all
select
'2020-10-10' stat_date,
date_sub('2020-10-10',3) create_date,
3 retention_day,
retention_count,
new_mid_count,
cast(retention_count / new_mid_count * 100 as decimal(16,2)) retention_ratio
from
(select
-- 已经有2020-10-10的日活,
'2020-10-10' dt,
count(*) retention_count
from dwt_uv_topic
-- 2020-10-7日新增的,在2020-10-10活跃的设备
where login_date_first = date_sub('2020-10-10',3)
and
login_date_last = '2020-10-10' ) t1
join
(select
'2020-10-10' dt,
new_mid_count
from ads_new_mid_count
where create_date=date_sub('2020-10-10',3) ) t2
on t1.dt = t2.dt
沉默用户数: 仅仅在安装当天启动过应用,并且安装的当天距离现在已经是7天之前
仅仅在安装当天启动过应用: 从dwt层,login_date_last=login_date_first=安装当天
login_date_last=login_date_first <= date_sub(当前日期,7)
'从设备主题宽表取'
统计日期
沉默设备数:首次活跃时间等于末次活跃时间
并且首次活跃时间小于等于末次活跃时间减7,过滤出来之后直接count(*)
drop table if exists ads_silent_count;
create external table ads_silent_count(
`dt` string COMMENT '统计日期',
`silent_count` bigint COMMENT '沉默设备数'
) COMMENT '沉默用户数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_silent_count';
insert into table ads_silent_count
select
'2020-10-10',
count(*)
from dwt_uv_topic
where login_date_first = login_date_last
and
login_date_first <= date_sub('2020-10-10',7)
本周回流用户: 本周的活跃老用户 ,上周没有活跃
本周回流用户: 本周活跃的老用户 -(取差集) 上周活跃的老用户
SQL取差集合: a left join b on xxx where b.id is null
'从dwt层的设备主题宽表取'
本周回流用户 = 本周活跃老用户 - 上周活跃老用户
本周活跃老用户:
--本周活跃用户就是末次活跃时间大于本周一
--本周活跃老用户就是首次活跃时间小于本周一
【
login_date_last >= date_sub(next_day('2020-10-10','MO'),7)
AND
login_date_first < date_sub(next_day('2020-10-10','MO'),7)
】这里求得是本周活跃的老用户
'从dws层的每日设备表取'
上周活跃老用户:
--因为每日设备表记录单是每天活跃的设备
--所以上周活跃的设备就是取日期在上周的设备id
【
where dt between date_sub(next_day('2020-10-10',Mon),14)
and
date_sub(next_day('2020-10-10',Mon),7)
】
'从以上两表的left join表取'
统计日期,
统计日期所在的周:
--concat(date_sub(next_dy('2020-10-10','Mon'),7) , '_' , date_sub(next_day('2020-10-10','Mon'),1))
回流设备数:
--取本表设备id不为空的字段做count(*)操作
drop table if exists ads_back_count;
create external table ads_back_count(
`dt` string COMMENT '统计日期',
`wk_dt` string COMMENT '统计日期所在周',
`wastage_count` bigint COMMENT '回流设备数'
) COMMENT '本周回流用户数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_back_count';
insert into table ads_back_count
select
'2020-10-10' dt,
concat(date_sub(next_day('2020-10-10','MO'),7),'_',date_sub(next_day('2020-10-10','MO'),1)) wk_dt,
count(*) wastage_count
from
-- 本周活跃的老用户 可以从dwt层取
(select
mid_id
from dwt_uv_topic
-- 本周活跃 : login_date_last >= 本周一
-- 本周老用户 : login_date_first < 本周一
where login_date_last >= date_sub(next_day('2020-10-10','MO'),7)
AND
login_date_first < date_sub(next_day('2020-10-10','MO'),7)) t1
left join
-- 上周活跃的老用户
-- 举例: 有用户 a ,首次登录时间为2020-1-1,一定是老用户,假设上周2020-10-1日登录
-- 2020-10-10 又登录了,此时 login_date_last更新为 2020-10-10
-- 无法确定a,是否在上周活跃,无法从dwt层获取此数据,只能从dws层取,取上周所有的数据
(SELECT
DISTINCT mid_id
from dws_uv_detail_daycount
where dt BETWEEN date_sub(next_day('2020-10-10','MO'),14)
and date_sub(next_day('2020-10-10','MO'),8) ) t2
on t1.mid_id = t2.mid_id
where t2.mid_id is null
流失用户:最近7天未活跃的设备
最后一次登录的时间,距离现在已经7天了!
从dwt层取,login_date_last <= date_sub(当前日期,7)
'从dwt层的设备主题宽表取'
流失设备数:
--最后一次登录时间距离现在已经7天了
--即末次活跃时间小于等于当前日期减七天
【
where login_date_last <= date_sub('2020-10-10,7')
】
drop table if exists ads_wastage_count;
create external table ads_wastage_count(
`dt` string COMMENT '统计日期',
`wastage_count` bigint COMMENT '流失设备数'
) COMMENT '流失用户数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_wastage_count';
insert into table ads_wastage_count
select
'2020-10-10',
count(*)
from dwt_uv_topin
where login_date_last <= date_sub('2020-10-10',7)
连续三周活跃: 最近连续三周,每周都需要活跃一次!
取本周,所有的周活用户:t1
取最近前1周,所有的周活用户:t2
取最近前2周,所有的周活用户:t3
t1,t2,t3都需要去重
t1 union all t2 union all t3
group by mid_id
count(*) = 3
'从dws层的每日设备行为表取'
本周周活用户
--where dt between date_sub(next_day('2020-10-10','Mon'),7) and date_sub(next_day('2020-10-10','Mon'),1)
上周周活用户
--where dt between date_sub(next_day('2020-10-10','Mon'),14) and date_sub(next_day('2020-10-10','Mon'),8)
上上周周活用户
--where dt between date_sub(next_day('2020-10-10','Mon'),21) and date_sub(next_day('2020-10-10','Mon'),15)
'从以上三个表union all 的表取'
按设备分组,
过滤出每组的设备数等于3的设备
'从过滤后的表取'
统计日期,
持续时间:
--concat(date_sub(next_day('2020-10-10','Mon'),21), '_' ,'2020-10-10')
连续活跃三周的设备数:
--就是过滤出的表的行数,count(*)
drop table if exists ads_continuity_wk_count;
create external table ads_continuity_wk_count(
`dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期',
`wk_dt` string COMMENT '持续时间',
`continuity_count` bigint COMMENT '活跃次数'
) COMMENT '最近连续三周活跃用户数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_wk_count';
insert into table ads_continuity_wk_count
select
'2020-10-10',
concat(date_sub(next_day('2020-10-10','MO'),21),'_','2020-10-10'),
count(*)
from
(select
mid_id
from
(--求本周的周活用户
select
DISTINCT mid_id
from dws_uv_detail_daycount
where dt BETWEEN date_sub(next_day('2020-10-10','MO'),7)
and date_sub(next_day('2020-10-10','MO'),1)
UNION all
--求上周的周活用户
select
DISTINCT mid_id
from dws_uv_detail_daycount
where dt BETWEEN date_sub(next_day('2020-10-10','MO'),14)
and date_sub(next_day('2020-10-10','MO'),8)
UNION all
--求上上周的周活用户
select
DISTINCT mid_id
from dws_uv_detail_daycount
where dt BETWEEN date_sub(next_day('2020-10-10','MO'),21)
and date_sub(next_day('2020-10-10','MO'),15) ) tmp
GROUP by mid_id
having COUNT(*) = 3 ) tmp2
连续三天
连续的数据都有以下特征:
A列,以a为初始值,每次递增x,连续递增;
B列,以b为初始值,每次递增y,连续递增;
此时A,B两列差值连续两行的差值,总是相等的!
A | B | A-B | A-B列相邻两行的差值 |
---|---|---|---|
a | b | a-b | x-y |
a+x | b+y | a-b+(x-y) | x-y |
a+2x | b+2y | a-b+2(x-y) | … |
如果A,B两列的增量一样, X=Y,此时A-B列相邻两行的差值总为0!
依照此规律,求连续3天,可以这样求
ID | 日期A | 参照列B row_number | A-B |
---|---|---|---|
1 | 2020-08-10 | 1 | 2020-08-09 |
1 | 2020-08-11 | 2 | 2020-08-09 |
1 | 2020-08-12 | 3 | 2020-08-09 |
只需要找一个和A列 增量一致的参照列!
做差!
根据ID和差值分组,分组后统计图内数据的行数 >=3 即复合连续3天!
'从dws层的每日设备行为表取'
设备id,时间,
以按设备id分区,时间排序的开窗函数
--过滤出七天内的时间【date_sub('2020-10-10',7)】 t1
'从t1取'
设备id,时间,开窗结果,统计日期减开窗结果 t2
'从t2表取'
按设备id和统计日期减开窗结果分组
取设备id t3
'从t3表取'
统计日期;
最近7天日期;
连续三天活跃用户数
--每个分区做count(*)操作
drop table if exists ads_continuity_uv_count;
create external table ads_continuity_uv_count(
`dt` string COMMENT '统计日期',
`wk_dt` string COMMENT '最近7天日期',
`continuity_count` bigint
) COMMENT '最近七天内连续三天活跃用户数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_uv_count';
insert into table ads_continuity_uv_count
select
'2020-10-10' ,
concat(date_sub('2020-10-10',7),'_','2020-10-10') ,
COUNT(DISTINCT mid_id )
from
(select
mid_id
from
(select
mid_id,dt,rn, date_sub(dt,rn) datediff 【差值】
from
(select
mid_id,dt , ROW_NUMBER() over(PARTITION by mid_id order by dt) rn
from dws_uv_detail_daycount
where dt > date_sub('2020-10-10',7) ) t1 ) t2
GROUP by mid_id , datediff
having COUNT(*) >=3 ) t3
会员: userid进行标识
会员活跃率: 活跃会员数 / 会员总数
会员付费率: 总付费会员数 / 会员总数
会员新鲜度: 新增用户数 / 活跃用户数
'从会员主题宽表取'
统计日期;
活跃会员数;
--取会员主题宽表末次登录时间是统计日期同一天的记1,在做求和操作
新增会员数;
--取会员主题宽表首次的呢过路时间等于统计日期的记1,再做求和操作
新增消费会员数;
--取会员主题宽表首次支付时间等于统计日期的记1,再做求和操作
总付费会员数;
--取会员主题宽表末次支付时间非空的记1,再做求和操作
总会员数;
--count(*)
会员活跃率;
--活跃会员数 / 会员总数
会员付费率;
--总付费会员数 / 会员总数
会员新鲜度;
--新增用户数 / 活跃用户数
drop table if exists ads_user_topic;
create external table ads_user_topic(
`dt` string COMMENT '统计日期',
`day_users` string COMMENT '活跃会员数',
`day_new_users` string COMMENT '新增会员数',
`day_new_payment_users` string COMMENT '新增消费会员数',
`payment_users` string COMMENT '总付费会员数',
`users` string COMMENT '总会员数',
`day_users2users` decimal(16,2) COMMENT '会员活跃率',
`payment_users2users` decimal(16,2) COMMENT '会员付费率',
`day_new_users2users` decimal(16,2) COMMENT '会员新鲜度'
) COMMENT '会员信息表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_topic';
insert into table ads_user_topic
select
'2020-10-10',
sum(if(login_date_last = '2020-10-10',1,0)) day_users,
sum(if(login_date_first = '2020-10-10',1,0)) day_new_users,
sum(if(payment_date_first = '2020-10-10',1,0)) day_new_payment_users,
sum(if(payment_date_last is not null,1,0)) payment_users,
count(*) users,
--会员活跃率: 活跃会员数 / 会员总数
cast(sum(IF (login_date_last='2020-10-10',1,0)) / count(*) * 100 as decimal(16,2)) day_users2users,
--会员付费率: 综服非会员数 / 会员总数2
cast(sum(if(payment_date_last is not null,1,0)) / count(*) * 100 as decimal(16,2)) payment_users2users,
--会员新鲜度: 新增用户数 / 活跃用户数
case(sum(if(login_date_first = '2020-10-10',1,0)) / sum(if(login_date_last = '2020-10-10',1,0)) * 100 as decimal(16,2))
from dwt_user_topic
漏斗分析: 用来分析转换率!
'dwd层取页面日志表'
浏览首页人数;
--取页面日志表过滤出page_id等于home的数据,user_id去重后,做count操作就是浏览首页的人数
浏览商品详情页人数;
----取页面日志表过滤出page_id等于good_detail的数据,user_id去重后,做count操作就是浏览首页的人数
'dws层取每日会员行为表'
加入购物车人数;
--如果加入购物车次数大于0,则记1,否则记0,作求和操作
下单人数;
--如果下单次数大于0,则记1,否则记0,作求和操作
支付人数;
--如果支付次数大于0,则记1,否则记0,作求和操作
'以上两表作join操作'
首页到商品详情转化率;
--浏览商品详情页人数 / 浏览首页人数
商品详情页到加入购物车转化率;
--加入购物车人数 / 浏览商品详情页人数
加入购物车到下单转化率;
--支付人数 / 下单人数
drop table if exists ads_user_action_convert_day;
create external table ads_user_action_convert_day(
`dt` string COMMENT '统计日期',
`home_count` bigint COMMENT '浏览首页人数',
`good_detail_count` bigint COMMENT '浏览商品详情页人数',
`home2good_detail_convert_ratio` decimal(16,2) COMMENT '首页到商品详情转化率',
`cart_count` bigint COMMENT '加入购物车的人数',
`good_detail2cart_convert_ratio` decimal(16,2) COMMENT '商品详情页到加入购物车转化率',
`order_count` bigint COMMENT '下单人数',
`cart2order_convert_ratio` decimal(16,2) COMMENT '加入购物车到下单转化率',
`payment_amount` bigint COMMENT '支付人数',
`order2payment_convert_ratio` decimal(16,2) COMMENT '下单到支付的转化率'
) COMMENT '漏斗分析'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_action_convert_day/';
insert into table ads_user_action_convert_day
select
t1.dt,
home_count,good_detail_count,
cast(good_detail_count / home_count * 100 as decimal(16,2)) home2good_detail_convert_ratio,
cart_count,
cast(cart_count /good_detail_count * 100 as decimal(16,2)) good_detail2cart_convert_ratio,
order_count,
cast(order_count /cart_count * 100 as decimal(16,2)) cart2order_convert_ratio,
payment_amount,
cast(payment_amount / order_count * 100 as decimal(16,2)) order2payment_convert_ratio
from
(select
'2020-10-10' dt,
count(DISTINCT user_id ) home_count
from dwd_page_log
where dt='2020-10-10' and page_id ='home'
and user_id is not NULL ) t1
join
(select
'2020-10-10' dt,
count(DISTINCT user_id ) good_detail_count
from dwd_page_log
where dt='2020-10-10' and page_id ='good_detail'
and user_id is not NULL ) t2
on t1.dt = t2.dt
join
(SELECT
'2020-10-10' dt,
sum(if(cart_count > 0 ,1 ,0)) cart_count,
sum(if(order_count > 0 ,1 ,0)) order_count,
sum(if(payment_count > 0 ,1 ,0)) payment_amount
from dws_user_action_daycount
where dt='2020-10-10') t3
on t1.dt=t3.dt
'dwd层取商品维度表'
过滤出统计日期那天sku_id的数量和spu_id的数量
drop table if exists ads_product_info;
create external table ads_product_info(
`dt` string COMMENT '统计日期',
`sku_num` string COMMENT 'sku个数',
`spu_num` string COMMENT 'spu个数'
) COMMENT '商品个数信息'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_info';
insert into table ads_product_info
select
'2020-10-10',
count(*),
count(distinct spu_id)
from dwd_dim_sku_info
where dt = '2020-10-10'
'dwt层取商品主题宽表'
取累积支付件数
drop table if exists ads_product_sale_topN;
create external table ads_product_sale_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`payment_amount` bigint COMMENT '销量'
) COMMENT '商品销量排名'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_sale_topN';
insert into table ads_product_sale_topN
select
'2020-10-10',
sku_id,
payment_num
from dwt_sku_topic
where payment_num > 0
order by payment_num desc
limit 10
'dwt层取商品主题宽表'
取累积被收藏件数
drop table if exists ads_product_favor_topN;
create external table ads_product_favor_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`favor_count` bigint COMMENT '收藏量'
) COMMENT '商品收藏排名'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_favor_topN';
insert into table ads_product_favor_topN
select
'2020-10-10',
sku_id ,
favor_count
from dwt_sku_topic
where favor_count > 0
order by favor_count desc
limit 10
'dwt层取商品主题宽表'
取累积被加入购物车次数
drop table if exists ads_product_cart_topN;
create external table ads_product_cart_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`cart_count` bigint COMMENT '加入购物车次数'
) COMMENT '商品加入购物车排名'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_cart_topN';
insert into table ads_product_cart_topN
select
'2020-10-10',
sku_id ,
cart_count
from dwt_sku_topic
where cart_count > 0
order by cart_count desc
limit 10
'dwt层取商品主题宽表'
退款率;
--取累积30天退款次数 / 累积30天支付次数
过滤出累积30天退款次数大于0,和累积30谈支付次数大于0的数据
按退款率排名
drop table if exists ads_product_refund_topN;
create external table ads_product_refund_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`refund_ratio` decimal(16,2) COMMENT '退款率'
) COMMENT '商品退款率排名'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_refund_topN';
insert into table ads_product_refund_topN
select
'2020-10-10',
sku_id ,
cast( refund_last_30d_count / payment_last_30d_count * 100 as decimal(16,2)) refund_ratio
from dwt_sku_topic
where payment_last_30d_count > 0 and refund_last_30d_count > 0
order by refund_ratio desc
limit 10
'dwt层取商品主题宽表'
退款率;
--取累积差评数 / 累积差评数 + 累积好评数 + 累积默认评数
过滤出累积30天退款次数大于0,和累积30谈支付次数大于0的数据
按退款率排名
drop table if exists ads_appraise_bad_topN;
create external table ads_appraise_bad_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`appraise_bad_ratio` decimal(16,2) COMMENT '差评率'
) COMMENT '商品差评率'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_appraise_bad_topN';
insert into table ads_appraise_bad_topN
select
'2020-10-10',
sku_id ,
cast( appraise_bad_count / (appraise_bad_count+ appraise_good_count+appraise_mid_count+appraise_default_count) * 100 as decimal(16,2)) appraise_bad_ratio
from dwt_sku_topic
where appraise_bad_count > 0
order by appraise_bad_ratio desc
limit 10
'dws层中每日会员行为表'
单日下单笔数:
--每位会员下单次数求和
单日下单金额:
--每位会员下单金额求和
单日下单用户数
--该表的会员数量
过滤出下单次数大于0的数据
drop table if exists ads_order_daycount;
create external table ads_order_daycount(
dt string comment '统计日期',
order_count bigint comment '单日下单笔数',
order_amount bigint comment '单日下单金额',
order_users bigint comment '单日下单用户数'
) comment '下单数目统计'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_order_daycount';
insert into table ads_order_daycount
select
'2020-10-10',
sum(order_count ) ,
sum(order_amount),
count(*)
from dws_user_action_daycount
where dt='2020-10-10'
and order_count > 0
'dws层每日会员行为表取'
单日支付笔数;
--过滤出统计日期那天支付了的会员的支付次数求和
单日支付金额;
--过滤出统计日期那天支付了的会员的每个会员的支付金额求和
单日支付人数;
--过滤支付次数大于0的数据统计
'dws层每日商品行为表取'
单日支付商品数;
--过滤出被支付件数求和
'dwd层订单事实表取'
下单到支付的平均时长,取分钟数
--( 支付的秒数 - 下单的秒数 ) / 订单数量 / 60
unix_timestamp('时间',时间格式) : 求的是时间戳(当前时间距离1970年1月1日的秒数),单位为秒
【
cast(sum(unix_timestamp(payment_time ) - unix_timestamp(create_time )) / count(*) / 60 as decimal(16,2)) payment_avg_time
】
drop table if exists ads_payment_daycount;
create external table ads_payment_daycount(
dt string comment '统计日期',
order_count bigint comment '单日支付笔数',
order_amount bigint comment '单日支付金额',
payment_sku_count bigint comment '单日支付商品数',
payment_user_count bigint comment '单日支付人数',
payment_avg_time decimal(16,2) comment '下单到支付的平均时长,取分钟数'
) comment '支付信息统计'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_payment_daycount';
insert into table ads_payment_daycount
select
t1.dt,
order_count,order_amount,payment_sku_count,payment_user_count,payment_avg_time
from
(select
'2020-10-10' dt,
sum(payment_count ) order_count,
sum(payment_amount ) order_amount,
count(*) payment_user_count
from dws_user_action_daycount
where dt='2020-10-10'
and payment_count > 0) t1
join
(SELECT
'2020-10-10' dt,
sum(payment_num) payment_sku_count
from dws_sku_action_daycount
where dt='2020-10-10') t2
on t1.dt = t2.dt
join
(select
'2020-10-10' dt,
cast(sum(unix_timestamp(payment_time ) - unix_timestamp(create_time )) / count(*) / 60 as decimal(16,2)) payment_avg_time
from dwd_fact_order_info
where date_format(payment_time,'yyyy-MM-dd') = '2020-10-10' ) t3
on t1.dt = t3.dt
统计的是一个品牌在当前月的复够率!
复购率: 商品被重复购买的频率!
单次复购率: 两次以上购买人数 / 购买过的总人数
多次复购率: 三次以上购买人数 / 购买过的总人数
'dwd层订单明细事实表取' --一笔订单的一个商品是一条
订单id,用户id,商品id 【t1】
'dwd层支付事实表取' --一比订单是一条
订单id 【t2】
'dwd层商品维度表取'
商品id,品牌id 【t3】
1,2,3三表join操作 按商品id和品牌id分组
drop table ads_sale_tm_category1_stat_mn;
create external table ads_sale_tm_category1_stat_mn
(
tm_id string comment '品牌id',
category1_id string comment '1级品类id ',
category1_name string comment '1级品类名称 ',
buycount bigint comment '购买人数',
buy_twice_last bigint comment '两次以上购买人数',
buy_twice_last_ratio decimal(16,2) comment '单次复购率',
buy_3times_last bigint comment '三次以上购买人数',
buy_3times_last_ratio decimal(16,2) comment '多次复购率',
stat_mn string comment '统计月份',
stat_date string comment '统计日期'
) COMMENT '品牌复购率统计'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_sale_tm_category1_stat_mn/';
insert into table ads_sale_tm_category1_stat_mn
select
t4.tm_id,
category1_id, category1_name, buycount,
buy_twice_last, buy_twice_last_ratio,
buy_3times_last, buy_3times_last_ratio,
date_format('2020-10-10','yyyy-MM') stat_mn, '2020-10-10' stat_date
from
(select
tm_id,
sum(if(buytimes > 0,1,0)) buycount,
sum(if(buytimes > 1,1,0)) buy_twice_last,
cast(sum(if(buytimes > 1,1,0)) / sum(if(buytimes > 0,1,0)) * 100 as decimal(16,2)) buy_twice_last_ratio,
sum(if(buytimes > 2,1,0)) buy_3times_last,
cast(sum(if(buytimes > 2,1,0)) / sum(if(buytimes > 0,1,0)) * 100 as decimal(16,2)) buy_3times_last_ratio
from
--求出,每个用户购买每个品牌的记录,进行分组
(select
--求出当前月,所有用户购买的商品的数据
user_id, tm_id,
count(*) buytimes
from
-- 取dwd层 dwd_fact_order_detail join dwd 层 dwd_fact_payment_info
-- 先求出这个月中,一共支付的所有订单的详情数据
(select
order_id ,user_id ,sku_id
-- 一笔订单的一个商品是一条
from dwd_fact_order_detail
where last_day(dt) = last_day('2020-10-10') ) t1
join
(select
order_id
--一笔订单是一条
from dwd_fact_payment_info
where last_day(dt) = last_day('2020-10-10')) t2
on t1.order_id = t2.order_id
join
(select
id,tm_id
from dwd_dim_sku_info
where dt='2020-10-10') t3
on t1.sku_id = t3.id
GROUP by t1.user_id,t3.tm_id ) tmp
group by tm_id) t4
join
(SELECT
tm_id ,
concat_ws('|',collect_set(category1_id)) category1_id,
concat_ws('|',collect_set(category1_name)) category1_name
from dwd_dim_sku_info
where dt='2020-10-10'
group by tm_id ) t5
on t4.tm_id = t5.tm_id
取地区主题表的所有字段
drop table if exists ads_area_topic;
create external table ads_area_topic(
`dt` string COMMENT '统计日期',
`id` bigint COMMENT '编号',
`province_name` string COMMENT '省份名称',
`area_code` string COMMENT '地区编码',
`iso_code` string COMMENT 'iso编码',
`region_id` string COMMENT '地区ID',
`region_name` string COMMENT '地区名称',
`login_day_count` bigint COMMENT '当天活跃设备数',
`order_day_count` bigint COMMENT '当天下单次数',
`order_day_amount` decimal(16,2) COMMENT '当天下单金额',
`payment_day_count` bigint COMMENT '当天支付次数',
`payment_day_amount` decimal(16,2) COMMENT '当天支付金额'
) COMMENT '地区主题信息'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_area_topic/';
insert into table ads_area_topic
select
dt,
id,province_name ,area_code ,iso_code ,region_id ,region_name ,
login_count login_day_count,
order_count ,
order_amount ,
payment_count ,
payment_amount
from dws_area_stats_daycount
where dt='2020-10-10'
CREATE DATABASE `gmall_report` CHARACTER SET 'utf8' COLLATE 'utf8_general_ci';
(1)创建用户主题表
DROP TABLE IF EXISTS `ads_user_topic`;
CREATE TABLE `ads_user_topic` (
`dt` date NOT NULL,
`day_users` bigint(255) NULL DEFAULT NULL,
`day_new_users` bigint(255) NULL DEFAULT NULL,
`day_new_payment_users` bigint(255) NULL DEFAULT NULL,
`payment_users` bigint(255) NULL DEFAULT NULL,
`users` bigint(255) NULL DEFAULT NULL,
`day_users2users` double(255, 2) NULL DEFAULT NULL,
`payment_users2users` double(255, 2) NULL DEFAULT NULL,
`day_new_users2users` double(255, 2) NULL DEFAULT NULL,
PRIMARY KEY (`dt`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;
(2)创建地区主题表
DROP TABLE IF EXISTS `ads_area_topic`;
CREATE TABLE `ads_area_topic` (
`dt` date NOT NULL,
`id` int(11) NULL DEFAULT NULL,
`province_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`area_code` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`iso_code` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`region_id` int(11) NULL DEFAULT NULL,
`region_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`login_day_count` bigint(255) NULL DEFAULT NULL,
`order_day_count` bigint(255) NULL DEFAULT NULL,
`order_day_amount` double(255, 2) NULL DEFAULT NULL,
`payment_day_count` bigint(255) NULL DEFAULT NULL,
`payment_day_amount` double(255, 2) NULL DEFAULT NULL,
PRIMARY KEY (`dt`, `iso_code`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;
(1) 在在/home/atguigu/bin目录下创建脚本hdfs_to_mysql.sh
[atguigu@hadoop102 bin]$ vim hdfs_to_mysql.sh
在脚本中填写如下内容
#!/bin/bash
hive_db_name=gmall
mysql_db_name=gmall_report
export_data() {
/opt/module/sqoop/bin/sqoop export \
-Dmapreduce.job.queuename=hive \
--connect "jdbc:mysql://hadoop102:3306/${mysql_db_name}?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 000000 \
--table $1 \
--num-mappers 1 \
--export-dir /warehouse/$hive_db_name/ads/$1 \
--input-fields-terminated-by "\t" \
--update-mode allowinsert \
--update-key $2 \
--input-null-string '\\N' \
--input-null-non-string '\\N'
}
case $1 in
"ads_uv_count")
export_data "ads_uv_count" "dt"
;;
"ads_user_action_convert_day")
export_data "ads_user_action_convert_day" "dt"
;;
"ads_user_topic")
export_data "ads_user_topic" "dt"
;;
"ads_area_topic")
export_data "ads_area_topic" "dt,iso_code"
;;
"all")
export_data "ads_user_topic" "dt"
export_data "ads_area_topic" "dt,iso_code"
#其余表省略未写
;;
esac
注意:
关于导出update还是insert的问题:
--update-mode:
updateonly 只更新,无法插入新数据
allowinsert 允许新增
--update-key:[通常设为主键]
允许更新的情况下,指定哪些字段匹配视为同一条数据,进行更新而不增加。多个字段用逗号分隔。
--input-null-string 和 --input-null-non-string:
分别表示,将字符串和非字符串列的空串和“null”转义。
Hive中的NULL在底层是以“\N”来存储,而MySQL中的Null在底层就是Null,为了保证数据两端的一致性。
在导出数据时采用 --input-null-string 和 --input-null-string两个参数。
导入数据时采用 --null-string 和 --null-non-string
(2) 执行操作
hdfs_to_mysql.sh all
1)编写azkaban.project文件,内容如下
azkaban-flow-version: 2.0
2)编写gmall.flow文件,内容如下
nodes:
\- name: hdfs_to_ods_log
type: command
config:
command: /home/atguigu/bin/hdfs_to_ods_log.sh ${
dt}
\- name: hdfs_to_ods_db
type: command
config:
command: /home/atguigu/bin/hdfs_to_ods_db.sh all ${
dt}
\- name: ods_to_dwd_log
type: command
dependsOn:
\- hdfs_to_ods_log
config:
command: /home/atguigu/bin/ods_to_dwd_log.sh ${
dt}
\- name: ods_to_dwd_db
type: command
dependsOn:
\- hdfs_to_ods_db
config:
command: /home/atguigu/bin/ods_to_dwd_db.sh ${
dt}
\- name: dwd_to_dws
type: command
dependsOn:
\- ods_to_dwd_log
\- ods_to_dwd_db
config:
command: /home/atguigu/bin/dwd_to_dws.sh ${
dt}
\- name: dws_to_dwt
type: command
dependsOn:
\- dwd_to_dws
config:
command: /home/atguigu/bin/dws_to_dwt.sh ${
dt}
\- name: dwt_to_ads
type: command
dependsOn:
\- dws_to_dwt
config:
command: /home/atguigu/bin/dwt_to_ads.sh ${
dt}
\- name: hdfs_to_mysql
type: command
dependsOn:
\- dwt_to_ads
config:
command: /home/atguigu/bin/hdfs_to_mysql.sh ads_user_topic
--注意每次启动executor在mysql中的executors表各个executor的id都会改变
Azkaban多Executor模式是指,在集群中多个节点部署Executor。在这种模式下, Azkaban web Server会根据策略,选取其中一个Executor去执行任务。
由于我们需要交给Azkaban调度的脚本,以及脚本需要的hive,sqoop等应用只在hadoop102部署了,为保证任务顺利执行,我们须在以下两种方案任选其一,推荐使用方案二。
1)在MySQL中azkaban数据库executors表中,查询hadoop102上的Executor的id。
mysql> use azkaban;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select * from executors;
+----+-----------+-------+--------+
| id | host | port | active |
+----+-----------+-------+--------+
| 1 | hadoop103 | 35985 | 1 |
| 2 | hadoop104 | 36363 | 1 |
| ***\*3\**** | hadoop102 | 12321 | 1 |
+----+-----------+-------+--------+
3 rows in set (0.00 sec)
2)在执行工作流程时加入useExecutor属性,如下
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DECOBN6K-1604590441929)(C:\Users\Zephyr\AppData\Local\Temp\ksohtml13964\wps25.jpg)]
1)分发脚本、hive以及sqoop
[atguigu@hadoop102 ~]$ xsync /home/atguigu/bin/
[atguigu@hadoop102 ~]$ xsync /opt/module/hive
[atguigu@hadoop102 ~]$ xsync /opt/module/sqoop