这篇文章旨在练习大数据Spark操作,做一些简单的分析。后续阶段将基于三台虚拟机情况下进行调优,以期实现性能最优化。
数据集:淘宝用户购物行为数据集_数据集-阿里云天池 (aliyun.com)
下载到:/opt/module/taobao/UserBehavior.csv
商品类目纬度数据来源:bigdata_analyse/category.sql at main · TurboWay/bigdata_analyse (github.com)
下载到:/opt/module/taobao/category.sql
字段 | 属性 | 备注 |
---|---|---|
user_id | 用户ID | 整数类型,序列化后的用户ID |
sku_id | 商品ID | 整数类型,序列化后的商品ID |
category_id | 商品类目ID | 整数类型,序列化后的商品所属类目ID |
behavior | 行为类型 | 字符串,枚举类型,包括(‘pv’, ‘buy’, ‘cart’, ‘fav’) |
create_time | 行为发生的时间 | 字符串,行为发生的时间 |
dt | 分区键 | string |
单机配置:
创建3台虚拟机,配置如下:
hadoop1102 | hadoop1103 | hadoop1104 | |
---|---|---|---|
CPU | |||
内存 | 8GB | 4GB | 3GB |
磁盘 | 50GB | 50GB | 50GB |
HDFS | DataNode、NameNode | DataNode | DataNode、SecondaryNameNode |
YARN | NodeManager | NodeManager、ResourceManager | NodeManager |
此外,还配置了zk、spark-hive等。软件版本设置如下…
配置具体可参考尚硅谷大数据课程。
执行category.sql文件,将商品类名维度数据导入mysql数据库中。
在/opt/module/taobao目录下创建mysql_to_hdfs.sh,执行mysql_to_hdfs.sh,将mysql数据导入到hdfs。
#!/bin/bash
APP=taobao
sqoop=/opt/module/sqoop-1.4.6/bin/sqoop
import_data(){
$sqoop import \
--connect jdbc:mysql://hadoop1102:3306/$APP \
--username root \
--password XXXX \
--target-dir /origin_data/$APP/db/category \
--delete-target-dir \
--query "$1 and \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by '\t' \
--null-string '\\N' \
--null-non-string '\\N'
}
import_data "select * from category where 1=1"
在shell命令行创建数据库taobao
hive
hive(default)>create taobao
hive(default)>use taobao
hive(taobao)>create external table ods_category (
`sub_category_id` bigint comment 'id',
`parent_category_name` string comment '类目名称'
) comment '商品维度表'
row format delimited fields terminated by '\t'
LOCATION '/warehouse/taobao/ods/ods_category/';
hive(taobao)>load data inpath '/origin_data/taobao/db/category' into table taobao.ods_category;
进入/opt/module/spark-yarn/,执行以下命令,以打开sparksql服务器
sbin/start-thriftserver.sh
然后可以在IDEA上面的Database插件上敲SQL,具体配置如下
drop table if exists taobao.ods_user_behavior;
create external table ods_user_behavior (
`user_id` bigint,
`sku_id` bigint,
`category_id` bigint,
`behavior` string,
`timestamp` bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/warehouse/taobao/ods/ods_user_behavior/';
load data local inpath '/opt/module/taobao/UserBehavior.csv' into table taobao.ods_user_behavior;
字段 | 属性 | 备注 |
---|---|---|
user_id | 用户ID | 整数类型,序列化后的用户ID |
sku_id | 商品ID | 整数类型,序列化后的商品ID |
category_id | 商品类目ID | 整数类型,序列化后的商品所属类目ID |
behavior | 行为类型 | 字符串,枚举类型,包括(‘pv’, ‘buy’, ‘cart’, ‘fav’) |
create_time | 行为发生的时间 | 字符串,行为发生的时间 |
dt | 分区键 | string |
drop table if exists taobao.dwd_user_behavior;
create external table dwd_user_behavior (
`user_id` bigint,
`sku_id` bigint,
`category_id` bigint,
`catgegory_name` string,
`behavior` string,
`create_time` string
)
partitioned by (`dt` string)
row format delimited fields terminated by '\t'
LOCATION '/warehouse/taobao/dwd/dwd_user_behavior/';
insert overwrite table dwd_user_behavior
select user_id,
sku_id,
category_id,
parent_category_name category_name,
behavior,
from_unixtime(timestamp, 'yyyy-MM-dd HH:mm:ss') as create_time,
from_unixtime(timestamp, 'yyyy-MM-dd') as dt
from ods_user_behavior oub
join ods_category oc on oub.category_id = oc.sub_category_id
where from_unixtime(`timestamp`, 'yyyy-MM-dd') >= '2017-11-25' and from_unixtime(`timestamp`, 'yyyy-MM-dd') <= '2017-12-03';
这里使用了dt当作分区键,即按日期进行分区,有以下奇效:
至此,一个1000w的购买记录已经被存储到了hive中,接下来就要愉快的分析了啦!!!
以下数据分析均基于Spark
drop table if exists ads_date_hour_behavior;
create table ads_date_hour_behavior(
date string,
hour string,
pv bigint,
cart bigint,
fav bigint,
buy bigint
)
row format delimited fields terminated by '\t'
LOCATION '/warehouse/taobao/ads/ads_date_hour_behavior/';
insert into ads_date_hour_behavior
select date_format(create_time, 'yyyy-MM-dd'), hour(create_time),
sum(`if`(behavior='pv', 1, 0)) pv,
sum(`if`(behavior='cart', 1, 0)) cart,
sum(`if`(behavior='fav', 1, 0)) fav,
sum(`if`(behavior='buy', 1, 0)) buy
from dwd_user_behavior
group by date_format(create_time, 'yyyy-MM-dd'), hour(create_time)
order by date_format(create_time, 'yyyy-MM-dd'), hour(create_time);
可以看到4种行为在不同时间的分布类似,在凌晨时间使用量特别少
drop table if exists ads_pv_uv_puv;
create table ads_pv_uv_puv(
dates string,
pv bigint,
uv bigint,
puv decimal(16,2)
)
row format delimited fields terminated by '\t'
LOCATION '/warehouse/taobao/ads/ads_pv_uv_puv/';
insert overwrite table ads_pv_uv_puv
select date_format(create_time, 'yyyy-MM-dd') dates,
count(behavior) pv,
count(distinct user_id) uv,
cast(count(behavior)/count(distinct user_id) as decimal(16, 2)) as puv
from dwd_user_behavior
where behavior = 'pv'
group by dates;
drop table if exists ads_category_sell_rank;
create table ads_category_sell_rank (
catgegory_name string,
sku_id bigint,
sell_num bigint,
rank bigint
)
row format delimited fields terminated by '\t'
location '/warehouse/taobao/ads/ads_category_sell_rank';
-- 商品类目销售排行版
insert overwrite table ads_category_sell_rank
select catgegory_name, sku_id, sell_num, rank
from(
select distinct catgegory_name, sku_id, cnt sell_num, dense_rank() over (partition by catgegory_name order by cnt desc) rank
from
(
select sku_id, catgegory_name, count(sku_id) over (partition by sku_id) cnt
from dwd_user_behavior
where behavior='buy'
) t1
) t2
where rank <= 10;
可以看到sku_id为1115492的商品在衣物品类的销售最棒,荣登销冠哈
电子类的商品这几天卖得最火了,不知道是卖什么呢?该不会是小米6吧(米粉狂喜)
select count(*)
from
(
select user_id from dwd_user_behavior
group by user_id
having count(behavior)=1
) a;
也有105位用户的
drop table if exists dws_user_behavior;
create table dws_user_behavior (
user_id bigint,
all_count bigint,
pv_count bigint,
fav_count bigint,
cart_count bigint,
buy_count bigint
)
row format delimited fields terminated by '\t'
location '/warehouse/taobao/dws/dws_user_behavior';
-- 用户行为汇总
insert overwrite table dws_user_behavior
select user_id, count(behavior) all_count,
sum(`if`(behavior='pv', 1, 0)) pv_count,
sum(`if`(behavior='fav', 1, 0)) fav_count,
sum(`if`(behavior='cart', 1, 0)) cart_count,
sum(`if`(behavior='buy', 1, 0)) buy_count
from dwd_user_behavior
group by user_id
select concat(round(buy_two/buy_one*100, 2), '%') as `复购率`
from
(select sum(case when buy_count > 1 then 1 else 0 end) buy_two,
sum(case when buy_count > 0 then 1 else 0 end) buy_one
from dws_user_behavior) t1
得到结果65.9%
参考链接
由于数据集包含的时间是从2017-11-25至2017-12-03,这里选取2017-12-04作为计算日期,统计客户最近发生购买行为的日期距离2017-12-04间隔几天,再对间隔时间进行排名,间隔天数越少,客户价值越大,排名越靠前。
先统计每位用户的购买频率,再对购买频率进行排名,频率越大,客户价值越大,排名越靠前。
select user_id, recent, dense_rank() over (order by recent asc) as recent_rank, frequency, dense_rank() over (order by frequency desc) as frequency_rank
from (
select user_id, datediff('2017-12-04', max(create_time)) as recent, count(behavior) frequency
from dwd_user_behavior
where behavior='buy'
group by user_id
) t1
对有购买行为的用户按照排名进行分组,共划分为四组,对排在前四分之一的用户打4分,排在四分之一到四分之二的用户打3分,排在前四分之二到四分之三的用户打2分,剩余的用户打1分,按照这个规则分别对用户时间间隔排名打分和购买频率排名打分,最后把两个分数合并在一起作为该名用户的最终评分。
-- 计算购买商品的用户个数
select count(distinct user_id) as user_buy_num
from dwd_user_behavior
where behavior = 'buy';
求得为671043
drop table if exists rfm_score ;
create table rfm_score(
user_id bigint,
score int
)
row format delimited fields terminated by '\t'
location '/warehouse/taobao/ads/ads_rfm_score';
insert overwrite table rfm_score
select user_id,
case when recent_rank < 671043 * 1 / 4 then 4
when recent_rank < 671043 * 2 / 4 then 3
when recent_rank < 671043 * 3 / 4 then 2
else 1 end +
case when frequency_rank < 671043 * 1 / 4 then 4
when frequency_rank < 671043 * 2 / 4 then 3
when frequency_rank < 671043 * 3 / 4 then 2
else 1 end
as score
from
(
select user_id, recent, dense_rank() over (order by recent asc) as recent_rank, frequency, dense_rank() over (order by frequency desc) as frequency_rank
from (
select user_id, datediff('2017-12-04', max(create_time)) as recent, count(behavior) frequency
from dwd_user_behavior
where behavior='buy'
group by user_id
) t1
) rfm;
drop table if exists ads_score_summary ;
create table ads_score_summary(
score bigint,
cnt bigint,
buy_user_num bigint,
rate string
)
row format delimited fields terminated by '\t'
location '/warehouse/taobao/ads/ads_score_summary';
with all_score as (
select 0 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 1 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 2 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 3 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 4 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 5 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 6 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 7 score, 0 cnt, 0 buy_user_num, 0 rate
union all
select 8 score, 0 cnt, 0 buy_user_num, 0 rate
),
now_score as (
select score, count(1) cnt, 671043 `buy_user_num`, concat(round(count(1) / 671043 * 100, 2), '%') rate
from ads_rfm_score
group by score
order by score asc
)
insert overwrite table ads_score_summary
select nvl(now_score.score, all_score.score),
nvl(now_score.cnt, 0),
nvl(now_score.buy_user_num, 0),
nvl(now_score.rate, 0)
from now_score
right join all_score on now_score.score = all_score.score
同一个用户会对一个商品购买多次,这个就是统计商品被多个用户同时购买多次的复购率,说明哪个商品需要一直重复购买,如纸巾、食品等消耗品。
drop table if exists ads_sku_rebuy;
create table ads_sku_rebuy(
sku_id bigint,
`rebuy_rate` string
)
row format delimited fields terminated by '\t'
location '/warehouse/taobao/ads/ads_sku_rebuy';
with t1 as (
select sku_id, user_id, count(*) as buy_count
from dwd_user_behavior
where behavior='buy'
group by sku_id, user_id
),
t2 as (
select sku_id, user_id, sum(`if`(buy_count > 0, 1, 0)) buy_one,
sum(`if`(buy_count > 1, 1, 0)) buy_two
from t1
group by sku_id, user_id
)
insert overwrite table ads_sku_rebuy
select sku_id, concat(round(sum(buy_two)/sum(buy_one)*100, 2), '%') rebuy_rate
from t2
group by sku_id
至此淘宝用户购买行为可视化分析就做完啦!接下来,我将继续展示SPARK调优,希望在虚拟机集群中实现性能最大化,敬请期待!