目标:掌握访问与咨询的业务流程
路径
实施
小结
目标:掌握访问与咨询的业务需求
实施
每天总访问客户量
指标:访问用户量
计算:count(distinct usreid)
维度:时间:年、季度、月、天、小时
字段:年【2021】、季度【2】、月【03】、天【15】、小时【12】、用户id
SQL
select
年,
月,
天,
count(distinct 用户id)
from table
group by 年、月、天
每天每个地区的访问用户数
指标:访问用户量
计算:count(distinct usreid)
维度
字段:年【2021】、季度【2】、月【03】、天【15】、小时【12】、地区、用户id
SQL
select
年,
月,
天,
地区,
count(distinct 用户id)
from table
group by 年、月、天、地区
每小时访客转咨询率趋势
指标:咨询率
计算:每个小时咨询用户数 / 每个小时的访问用户数 = 每个小时的咨询率
维度:小时
字段:小时、每个小时的咨询用户数、每小时的访问用户数
SQL
select
小时,
咨询用户数 / 访问用户数
from table
每天各个来源渠道访问用户量占比
什么叫做来源渠道?
目的:统计分析用户的来源,考虑实现精准运营投放
指标:访问用户量
维度:天、来源渠道
每天每个搜索来源的访问用户量占比
什么叫做搜索来源?
指标:访问用户量
维度:天、搜索来源
每天每个来源页面的访问用户量排行榜:Top10
什么叫做来源页面?
指标:访问用户量
维度:天、来源页面
小结
目标:了解访问与咨询的原始数据内容
路径
实施
数据存储
- 来自于用户的访问,用户没每访问一个页面就会记录一条日志信息
```
id userid sessionId ip create_time url refere_url
1 userid1 sesionid1 192.168.111.11 2020-11-11 12:30:30 url1 www.baidu.com
2 userid1 sesionid1 192.168.111.11 2020-11-11 12:30:31 url2 url1
3 userid1 sesionid2 192.168.111.11 2020-11-11 14:30:31 url3 www.sougou.com
4 userid2 sesionid3 192.168.111.12 2020-11-11 14:30:31 url3 www.baidu.com
```
- UV:2
- SessionId:3
- PV:4
- IP:2
数据表与核心字段
这两张表逻辑上是一张表,存储的时候将一张表分成两张表存储了,这张表的列多而且部分字段的存储值比较大,分成了两张存储
web_chat_ems_2019_07
id INT comment '主键',
create_date_time STRING comment '数据创建时间',
session_id STRING comment 'sessionId',
sid STRING comment '访客id',
create_time STRING comment '会话创建时间',
seo_source STRING comment '搜索来源',
seo_keywords STRING comment '关键字',
ip STRING comment 'IP地址',
area STRING comment '地域',
country STRING comment '所在国家',
province STRING comment '省',
city STRING comment '城市',
origin_channel STRING comment '投放渠道',
user_match STRING comment '所属坐席',
manual_time STRING comment '人工开始时间',
begin_time STRING comment '坐席领取时间 ',
end_time STRING comment '会话结束时间',
last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间',
last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间',
reply_msg_count INT comment '客服回复消息数',
msg_count INT comment '客户发送消息数',
browser_name STRING comment '浏览器名称',
os_info STRING comment '系统名称')
id:用于关联合并信息表
session_id:指标:统计session个数
sid:指标:统计访客个数
create_time:时间维度
seo_source:搜索来源维度
ip:指标:统计IP个数
area:地区维度
origin_channel:来源渠道维度
msg_count:用于区分是否是一条咨询数据
web_chat_text_ems_2019_07
id INT COMMENT '主键来自MySQL',
referrer STRING comment '上级来源页面',
from_url STRING comment '会话来源页面',
landing_page_url STRING comment '访客着陆页面',
url_title STRING comment '咨询页面title',
platform_description STRING comment '客户平台信息',
other_params STRING comment '扩展字段中数据',
history STRING comment '历史访问记录'
小结
目标:掌握数据仓库设计的分析过程
路径
实施
数仓需求
结果:事实表
时间 地区 搜索来源 来源渠道 来源页面 UV Session IP
2020 -1 -1 -1 -1 1000 2000 100
2020 上海 -1 -1 -1 1000 2000 100
正常情况下,预想的事实表的格式
问题1:如果将所有的维度下的指标结果都放在一张事实表,不同维度有的维度字段没有值怎么办?
需求中的维度
时间
时间 UV Session IP
时间+地区
时间 地区 UV Session IP
时间+来源渠道
时间 来源渠道 UV Session IP
如果不基于这个维度,将这个维度的值设置为-1
问题2:所有维度的结果都在一张表中,如果我想获取某个维度下的结果,怎么区分呢?
维度标记
时间 地区 搜索来源 来源渠道 来源页面 UV Session IP flag1 flag2
2020 -1 -1 -1 -1 1000 2000 100 5 1
2020 上海 -1 -1 -1 1000 2000 100 5 2
- flag1:用于标记哪种时间维度
- 1-小时
- 2-天
- 3-月
- 4-季度
- 5-年
- flag2:用于标记哪种组合维度 = 基础维度 + 其他维度
- 1-时间
- 2-时间+地区
- 3-时间+来源渠道
- 4-时间+搜索来源
- 5-时间+来源页面
ODS设计
DWD设计
select
a.create_time,
substr(a.create_time,0,4) as yearinfo,
ceil(substr(a.create_time,6,2) / 3)
substr(a.create_time,6,2) as monthinfo,
substr(a.create_time,9,2) as dayinfo,
substr(a.create_time,12,2) as hourinfo,
a.orgin_channel,
a.seo_source,
a.area,
b.from_url,
a.sid,
a.sessionId,
a.ip,
a.msg_count
from web_chat_ems a join web_chat_text_ems b on a.id = b.id
where substr(create_time,0,10) = 昨天的日期;
DWS/APP设计
对DWD层的数据基于维度进行分组聚合,统计每天的UV、Session和IP的个数
统计每天的UV、Session、IP
select
yearinfo,
monthinfo,
dayinfo,
'-1' as hourinfo,
'-1' as orgin_channel,
'-1' as seo_source,
'-1 ' as area,
'-1' as from_url,
count(distinct sid) as uv,
count(distinct sessionId) as sessionId,
count(distinct ip) as ip,
'2' as flag1,
'1' as flag2
from dwd
group by yearinfo,monthinfo,dayinfo;
统计每天每个地区的UV、Session、IP
select
yearinfo,
monthinfo,
dayinfo,
'-1' as hourinfo,
'-1' as orgin_channel,
'-1' as seo_source,
area,
'-1' as from_url,
count(distinct sid) as uv,
count(distinct sessionId) as sessionId,
count(distinct ip) as ip,
'2' as flag1,
'2' as flag2
from dwd
group by yearinfo,monthinfo,dayinfo,area;
小结
目标:实现Hive注释中文支持的配置
实施
step1:修改元数据表注解为UTF8
alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
alter table PARTITION_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8 ;
alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;
alter table INDEX_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
step2:修改连接配置
<property>
<name>javax.jdo.option.ConnectionURLname>
<value>jdbc:mysql://node3:3306/hivemetadata?createDatabaseIfNotExist=true&characterEncoding=UTF-8value>
property>
step3:重启metastore和hiveserver,重新建hive表,中文正常
小结
目标:实现访问数据的数据采集
路径
实施
采集需求
Hive建表
itcast_ods.web_chat_ems
--写入时压缩生效
use itcast_ods;
set hive.exec.orc.compression.strategy=COMPRESSION;
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_ems (
id INT comment '主键',
create_date_time STRING comment '数据创建时间',
session_id STRING comment 'sessionId',
sid STRING comment '访客id',
create_time STRING comment '会话创建时间',
seo_source STRING comment '搜索来源',
seo_keywords STRING comment '关键字',
ip STRING comment 'IP地址',
area STRING comment '地域',
country STRING comment '所在国家',
province STRING comment '省',
city STRING comment '城市',
origin_channel STRING comment '投放渠道',
user_match STRING comment '所属坐席',
manual_time STRING comment '人工开始时间',
begin_time STRING comment '坐席领取时间 ',
end_time STRING comment '会话结束时间',
last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间',
last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间',
reply_msg_count INT comment '客服回复消息数',
msg_count INT comment '客户发送消息数',
browser_name STRING comment '浏览器名称',
os_info STRING comment '系统名称')
comment '访问会话信息表'
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');
itcast_ods.web_chat_text_ems
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_text_ems (
id INT COMMENT '主键来自MySQL',
referrer STRING comment '上级来源页面',
from_url STRING comment '会话来源页面',
landing_page_url STRING comment '访客着陆页面',
url_title STRING comment '咨询页面title',
platform_description STRING comment '客户平台信息',
other_params STRING comment '扩展字段中数据',
history STRING comment '历史访问记录'
) comment 'EMS-PV测试表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_text_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');
全量采集
web_chat_ems
sqoop import \
--connect jdbc:mysql://node3:3306/nev \
--username root \
--password 123456 \
--driver com.mysql.jdbc.Driver \
--query 'select id, create_date_time, session_id, sid, create_time, seo_source, seo_keywords, ip, area, country, province, city, origin_channel, user as user_match, manual_time, begin_time, end_time, last_customer_msg_time_stamp, last_agent_msg_time_stamp, reply_msg_count, msg_count, browser_name, os_info, "2019-07-01" as starts_time from web_chat_ems_2019_07 where $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' \
-m 2 \
--split-by id
- web_chat_text_ems
```shell
sqoop import \
--connect jdbc:mysql://node3:3306/nev \
--username root \
--password 123456 \
--driver com.mysql.jdbc.Driver \
--query 'select id,referrer,from_url,landing_page_url,url_title,platform_description,other_params,history, "2019-07-01" as start_time from web_chat_text_ems_2019_07 where $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' \
-m 2 \
--split-by id
```
增量采集
append:实现增量采集
sqoop import \
--connect jdbc:mysql://node3:3306/nev \
--username root \
--password 123456 \
--driver com.mysql.jdbc.Driver \
--query 'select id, create_date_time, session_id, sid, create_time, seo_source, seo_keywords, ip, area, country, province, city, origin_channel, user as user_match, manual_time, begin_time, end_time, last_customer_msg_time_stamp, last_agent_msg_time_stamp, reply_msg_count, msg_count, browser_name, os_info, date_sub(now,1) as starts_time from web_chat_ems_2019_07 where $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' \
--incremental append \
--check-column id \
--last-value 100 \
-m 2 \
--split-by id
直接过滤方式
sqoop import \
--connect jdbc:mysql://node3:3306/nev \
--username root \
--password 123456 \
--driver com.mysql.jdbc.Driver \
--query 'select id, create_date_time, session_id, sid, create_time, seo_source, seo_keywords, ip, area, country, province, city, origin_channel, user as user_match, manual_time, begin_time, end_time, last_customer_msg_time_stamp, last_agent_msg_time_stamp, reply_msg_count, msg_count, browser_name, os_info, date_sub(now,1) as starts_time from web_chat_ems_2019_07 where substr(create_time,0,10) = date_sub(now(),1) $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' \
-m 2 \
--split-by id
小结
目标:实现访问分析DWD层的构建
路径
实施
分析
step1:将两张表合并一张明细表
step2:过滤掉不需要的数据:非法的行,用不到的列
step3:构造维度
insert into table itcast_dwd.visit_consult_dwd partition (yearinfo,monthinfo,dayinfo)
select
a.create_time,
ceil(substr(a.create_time,6,2) / 3),
substr(a.create_time,12,2) as hourinfo,
a.orgin_channel,
a.seo_source,
a.area,
b.from_url,
a.sid,
a.sessionId,
a.ip,
a.msg_count,
substr(a.create_time,0,4) as yearinfo,
substr(a.create_time,6,2) as monthinfo,
substr(a.create_time,9,2) as dayinfo,
from web_chat_ems a join web_chat_text_ems b on a.id = b.id
where substr(create_time,0,10) = 昨天的日期;
建表
create table if not exists itcast_dwd.visit_consult_dwd(
session_id STRING comment '七陌sessionId',
sid STRING comment '访客id',
create_time bigint comment '会话创建时间',
seo_source STRING comment '搜索来源',
ip STRING comment 'IP地址',
area STRING comment '地域',
msg_count int comment '客户发送消息数',
origin_channel STRING COMMENT '来源渠道',
referrer STRING comment '上级来源页面',
from_url STRING comment '会话来源页面',
landing_page_url STRING comment '访客着陆页面',
url_title STRING comment '咨询页面title',
platform_description STRING comment '客户平台信息',
other_params STRING comment '扩展字段中数据',
history STRING comment '历史访问记录',
hourinfo string comment '小时',
quarterinfo string comment '季度'
)
comment '访问咨询DWD表'
partitioned by(yearinfo String, monthinfo String, dayinfo string)
row format delimited fields terminated by '\t'
stored as orc
location '/user/hive/warehouse/itcast_dwd.db/visit_consult_dwd'
tblproperties ('orc.compress'='SNAPPY');
构建
--本地模式
set hive.exec.mode.local.auto=true;
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
insert into table itcast_dwd.visit_consult_dwd partition (yearinfo, monthinfo, dayinfo)
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time, 'yyyy-MM-dd HH:mm:ss.SSS') as create_time,
wce.seo_source,
wce.ip,
wce.area,
cast(if(wce.msg_count is null, 0, wce.msg_count) as int) as msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
substr(wce.create_time, 12, 2) as hourinfo,
ceil(substr(wce.create_time, 6, 2) / 3.0) as quarterinfo,
substr(wce.create_time, 1, 4) as yearinfo,
substr(wce.create_time, 6, 2) as monthinfo,
substr(wce.create_time, 9, 2) as dayinfo
from itcast_ods.web_chat_ems wce inner join itcast_ods.web_chat_text_ems wcte
on wce.id = wcte.id;
if语法
if(条件,true的结果,false的结果)
小结
目标:实现DWS层的构建
路径
实施
分析
不同维度下的访问用户量
DWD
时间维度 来源渠道 搜索来源 地区 来源页面 userid sessionid ip
DWS
时间 地区 搜索来源 来源渠道 来源页面 UV Session IP flag1 flag2
分组聚合
建表
CREATE TABLE IF NOT EXISTS itcast_dws.visit_dws (
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area STRING COMMENT '区域信息',
seo_source STRING COMMENT '搜索来源',
origin_channel STRING COMMENT '来源渠道',
hourinfo STRING COMMENT '创建时间,统计至小时',
quarterinfo STRING COMMENT '季度',
time_str STRING COMMENT '时间明细',
from_url STRING comment '会话来源页面',
groupType STRING COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;')
comment 'EMS访客日志dws表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_dws.db/visit_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');
构建
每个小时的用户总个数、会话总个数及IP总个数
--本地模式
set hive.exec.mode.local.auto=true;
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
insert into table itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701'
group by yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;
每天的用户总个数、会话总个数及IP总个数
insert into table itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701'
group by yearinfo, quarterinfo, monthinfo, dayinfo;
每个月的用户总个数、会话总个数及IP总个数
insert into table itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701'
group by yearinfo, quarterinfo, monthinfo;
每个季度的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701'
group by yearinfo, quarterinfo;
每年的用户总个数、会话总个数及IP总个数
INSERT INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'5' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat(yearinfo,monthinfo,dayinfo)='20190701'
group by wce.yearinfo;
每个地区每个小时维度下的用户总个数、会话总个数及IP总个数
--分区
set hive.exec.mode.local.auto=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
insert into table itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and instr(area,"中国") > 0
group by area, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;
每个地区每天维度下的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and instr(area,"中国") > 0
group by area, yearinfo, quarterinfo, monthinfo, dayinfo;
每个地区每个月维度下的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and instr(area,"中国") > 0
group by area, yearinfo, quarterinfo, monthinfo;
每个地区每个季度维度下的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and instr(area,"中国") > 0
group by area, yearinfo, quarterinfo;
每个地区每年维度下的用户总个数、会话总个数及IP总个数
INSERT INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'1' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat(yearinfo,monthinfo,dayinfo)='20190701' and instr(area,"中国") > 0
group by wce.area,wce.yearinfo;
每个搜索来源中每个小时的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and length(seo_source) > 0
group by seo_source, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;
每个搜索来源中每天的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and length(seo_source) > 0
group by seo_source, yearinfo, quarterinfo, monthinfo, dayinfo;
每个搜索来源中每个月的用户总个数、会话总个数及IP总个数
每个搜索来源中每个季度的用户总个数、会话总个数及IP总个数
每个搜索来源中每年的用户总个数、会话总个数及IP总个数
每个来源渠道中每个小时的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701'
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;
每个来源渠道中每天的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701'
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo;
每个来源渠道中每个月的用户总个数、会话总个数及IP总个数
每个来源渠道中每个季度的用户总个数、会话总个数及IP总个数
每个来源渠道中每年的用户总个数、会话总个数及IP总个数
每个来源页面中每个小时的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
from_url,
'4' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and length(from_url) > 0
group by from_url, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;
每个来源页面中每天的用户总个数、会话总个数及IP总个数
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
from_url,
'4' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where concat(yearinfo,monthinfo,dayinfo)='20190701' and length(from_url) > 0
group by from_url, yearinfo, quarterinfo, monthinfo, dayinfo;
每个来源页面中每个月的用户总个数、会话总个数及IP总个数
每个来源页面中每个季度的用户总个数、会话总个数及IP总个数
每个来源页面中每年的用户总个数、会话总个数及IP总个数
小结
目标:实现APP层的构建
路径
实施
分析
建表:MySQL
create database if not exists scrm_bi;
use scrm_bi;
drop table if exists itcast_visit;
CREATE TABLE `itcast_visit` (
sid_total int(11) COMMENT '根据sid去重求count',
sessionid_total int(11) COMMENT '根据sessionid去重求count',
ip_total int(11) COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
seo_source varchar(32) COMMENT '搜索来源',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '小时信息',
quarterinfo varchar(32) COMMENT '季度',
time_str varchar(32) COMMENT '时间明细',
from_url varchar(2083) comment '会话来源页面',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.不考虑',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年信息',
monthinfo varchar(32) COMMENT '月信息',
dayinfo varchar(32) COMMENT '日信息'
)ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
构建
sqoop export \
--connect "jdbc:mysql://node3:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password '123456' \
--driver com.mysql.jdbc.Driver \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
-m 1
小结
目标:了解咨询业务需求
路径
实施
需求:基于不同维度统计咨询数据指标
数据
客服系统记录的每个访问的访问信息,包含了咨询数据
与访问分析的区别是什么?
访问的数据中包含了咨询的数据
咨询了,就一定访问了
访问了,不一定咨询了
咨询分析:只对咨询的数据进行处理
如何辨别用户是否咨询了?
ODS与DWD构建
小结
目标:实现咨询DWS的构建
路径
实施
分析
建表
set hive.exec.orc.compression.strategy=COMPRESSION;
CREATE TABLE IF NOT EXISTS itcast_dws.consult_dws
(
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area STRING COMMENT '区域信息',
origin_channel STRING COMMENT '来源渠道',
hourinfo STRING COMMENT '创建时间,统计至小时',
quarterinfo STRING COMMENT '季度',
time_str STRING COMMENT '时间明细',
groupType STRING COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合'
)
COMMENT '咨询量DWS宽表'
PARTITIONED BY (yearinfo string, monthinfo STRING, dayinfo string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS ORC
LOCATION '/user/hive/warehouse/itcast_dws.db/consult_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');
实现
每个小时的咨询用户指标
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=2000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--小时
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo) as time_str,
'5',
'1',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and concat(yearinfo,monthinfo,dayinfo)='20190701'
group by yearinfo, quarterinfo, monthinfo, dayinfo,hourinfo;
每天咨询访问量
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo) as time_str,
'5',
'2',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and concat(yearinfo,monthinfo,dayinfo)='20190701'
group by yearinfo, quarterinfo, monthinfo, dayinfo;
每个地区每天的咨询访问指标
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo) as time_str,
'1',
'2',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and concat(yearinfo,monthinfo,dayinfo)='20190701' and instr(area,"中国") > 0
group by area,yearinfo, quarterinfo, monthinfo, dayinfo;
每小时各来源渠道的咨询用户指标
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo) as time_str,
'3',
'2',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and concat(yearinfo,monthinfo,dayinfo)='20190701'
group by origin_channel,yearinfo, quarterinfo, monthinfo, dayinfo;
小结
目标
路径
实施
分析
建表
use scrm_bi;
drop table if exists itcast_consult;
CREATE TABLE `itcast_consult` (
sid_total int(11) COMMENT '根据sid去重求count',
sessionid_total int(11) COMMENT '根据sessionid去重求count',
ip_total int(11) COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '小时信息',
quarterinfo varchar(32) COMMENT '季度',
time_str varchar(32) COMMENT '时间明细',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.不考虑',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年信息',
monthinfo varchar(32) COMMENT '月信息',
dayinfo varchar(32) COMMENT '日信息'
)ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
实现
sqoop export \
--connect "jdbc:mysql://node3:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--driver com.mysql.jdbc.Driver \
--table itcast_consult \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 1
小结
use scrm_bi;
drop table if exists itcast_consult;
CREATE TABLE `itcast_consult` (
sid_total int(11) COMMENT '根据sid去重求count',
sessionid_total int(11) COMMENT '根据sessionid去重求count',
ip_total int(11) COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '小时信息',
quarterinfo varchar(32) COMMENT '季度',
time_str varchar(32) COMMENT '时间明细',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.不考虑',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年信息',
monthinfo varchar(32) COMMENT '月信息',
dayinfo varchar(32) COMMENT '日信息'
)ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
实现
sqoop export \
--connect "jdbc:mysql://node3:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--driver com.mysql.jdbc.Driver \
--table itcast_consult \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 1
小结