商数仓
第1章 数仓分层概念
1.1 为什么要分层
1.2 数仓分层
1.3 数据集市与数据仓库概念
1.4 数仓命名规范
Ø ODS层命名为ods
Ø DWD层命名为dwd
Ø DWS层命名为dws
Ø ADS层命名为ads
Ø 临时表数据库命名为xxx_tmp
Ø 备份数据数据库命名为xxx_bak
第2章 数仓搭建环境准备
集群规划
服务器hadoop102
服务器hadoop103
服务器hadoop104
Hive
Hive
MySQL
MySQL
2.1 Hive&MySQL安装
2.1.1 Hive&MySQL安装
详见:尚硅谷大数据技术之Hive
2.1.2 修改hive-site.xml
1)关闭元数据检查
[atguigu@hadoop102 conf]$ pwd
/opt/module/hive/conf
[atguigu@hadoop102 conf]$ vim hive-site.xml
增加如下配置:
hive.metastore.schema.verification
false
2.2 Hive运行引擎Tez
Tez是一个Hive的运行引擎,性能优于MR。为什么优于MR呢?看下图。
用Hive直接编写MR程序,假设有四个有依赖关系的MR作业,上图中,绿色是Reduce Task,云状表示写屏蔽,需要将中间结果持久化写到HDFS。
Tez可以将多个有依赖的作业转换为一个作业,这样只需写一次HDFS,且中间节点较少,从而大大提升作业的计算性能。
2.2.1 安装包准备
1)下载tez的依赖包:http://tez.apache.org
2)拷贝apache-tez-0.9.1-bin.tar.gz到hadoop102的/opt/module目录
[atguigu@hadoop102 module]$ ls
apache-tez-0.9.1-bin.tar.gz
3)解压缩apache-tez-0.9.1-bin.tar.gz
[atguigu@hadoop102 module]$ tar -zxvf apache-tez-0.9.1-bin.tar.gz
4)修改名称
[atguigu@hadoop102 module]$ mv apache-tez-0.9.1-bin/ tez-0.9.1
2.2.2 在Hive中配置Tez
1)进入到Hive的配置目录:/opt/module/hive/conf
[atguigu@hadoop102 conf]$ pwd
/opt/module/hive/conf
2)在hive-env.sh文件中添加tez环境变量配置和依赖包环境变量配置
[atguigu@hadoop102 conf]$ vim hive-env.sh
添加如下配置
export HADOOP_HOME=/opt/module/hadoop-2.7.2
export HIVE_CONF_DIR=/opt/module/hive/conf
export TEZ_HOME=/opt/module/tez-0.9.1 #是你的tez的解压目录
export TEZ_JARS=""
for jar in ls $TEZ_HOME |grep jar
; do
export TEZ_JARS= T E Z J A R S : TEZ_JARS: TEZJARS:TEZ_HOME/$jar
done
for jar in ls $TEZ_HOME/lib
; do
export TEZ_JARS= T E Z J A R S : TEZ_JARS: TEZJARS:TEZ_HOME/lib/$jar
done
export HIVE_AUX_JARS_PATH=/opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar T E Z J A R S 3 ) 在 h i v e − s i t e . x m l 文 件 中 添 加 如 下 配 置 , 更 改 h i v e 计 算 引 擎 < p r o p e r t y > < n a m e > h i v e . e x e c u t i o n . e n g i n e < / n a m e > < v a l u e > t e z < / v a l u e > < / p r o p e r t y > 2.2.3 配 置 T e z 1 ) 在 H i v e 的 / o p t / m o d u l e / h i v e / c o n f 下 面 创 建 一 个 t e z − s i t e . x m l 文 件 [ a t g u i g u @ h a d o o p 102 c o n f ] TEZ_JARS 3)在hive-site.xml文件中添加如下配置,更改hive计算引擎
/opt/module/hive/conf
[atguigu@hadoop102 conf]$ vim tez-site.xml
添加如下内容
第3章 数仓搭建之ODS层
3.1 创建数据库
1)创建gmall数据库
hive (default)> create database gmall;
说明:如果数据库存在且有数据,需要强制删除时执行:drop database gmall cascade;
2)使用gmall数据库
hive (default)> use gmall;
3.2 ODS层
原始数据层,存放原始数据,直接加载原始日志、数据,数据保持原貌不做处理。
3.2.1 创建启动日志表ods_start_log
1)创建输入数据是lzo输出是text,支持json解析的分区表
hive (gmall)>
drop table if exists ods_start_log;
CREATE EXTERNAL TABLE ods_start_log (line
string)
PARTITIONED BY (dt
string)
STORED AS
INPUTFORMAT ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘/warehouse/gmall/ods/ods_start_log’;
说明Hive的LZO压缩:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
2)加载数据
hive (gmall)>
load data inpath ‘/origin_data/gmall/log/topic_start/2019-02-10’ into table gmall.ods_start_log partition(dt=‘2019-02-10’);
注意:时间格式都配置成YYYY-MM-DD格式,这是Hive默认支持的时间格式
3)查看是否加载成功
hive (gmall)> select * from ods_start_log limit 2;
3.2.2 创建事件日志表ods_event_log
1)创建输入数据是lzo输出是text,支持json解析的分区表
hive (gmall)>
drop table if exists ods_event_log;
CREATE EXTERNAL TABLE ods_event_log(line
string)
PARTITIONED BY (dt
string)
STORED AS
INPUTFORMAT ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘/warehouse/gmall/ods/ods_event_log’;
2)加载数据
hive (gmall)>
load data inpath ‘/origin_data/gmall/log/topic_event/2019-02-10’ into table gmall.ods_event_log partition(dt=‘2019-02-10’);
注意:时间格式都配置成YYYY-MM-DD格式,这是Hive默认支持的时间格式
3)查看是否加载成功
hive (gmall)> select * from ods_event_log limit 2;
3.2.3 Shell中单引号和双引号区别
1)在/home/atguigu/bin创建一个test.sh文件
[atguigu@hadoop102 bin]$ vim test.sh
在文件中添加如下内容
#!/bin/bash
do_date=$1
echo ‘ d o d a t e ′ e c h o " do_date' echo " dodate′echo"do_date"
echo “' d o d a t e ′ " e c h o ′ " do_date'" echo '" dodate′"echo′"do_date”’
echo date
2)查看执行结果
[atguigu@hadoop102 bin]$ test.sh 2019-02-10
d o d a t e 2019 − 02 − 1 0 ′ 2019 − 02 − 1 0 ′ " do_date 2019-02-10 '2019-02-10' " dodate2019−02−10′2019−02−10′"do_date"
2019年 05月 02日 星期四 21:02:08 CST
3)总结:
(1)单引号不取变量值
(2)双引号取变量值
(3)反引号`,执行引号中命令
(4)双引号内部嵌套单引号,取出变量值
(5)单引号内部嵌套双引号,不取出变量值
3.2.4 ODS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ods_log.sh
在脚本中编写如下内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
echo "===日志日期为 d o d a t e = = = " s q l = " l o a d d a t a i n p a t h ′ / o r i g i n d a t a / g m a l l / l o g / t o p i c s t a r t / do_date===" sql=" load data inpath '/origin_data/gmall/log/topic_start/ dodate==="sql="loaddatainpath′/origindata/gmall/log/topicstart/do_date’ into table " A P P " . o d s s t a r t l o g p a r t i t i o n ( d t = ′ APP".ods_start_log partition(dt=' APP".odsstartlogpartition(dt=′do_date’);
load data inpath ‘/origin_data/gmall/log/topic_event/ d o d a t e ′ i n t o t a b l e " do_date' into table " dodate′intotable"APP".ods_event_log partition(dt=’$do_date’);
"
h i v e − e " hive -e " hive−e"sql"
说明1:
[ -n 变量值 ] 判断变量的值,是否为空
– 变量的值,非空,返回true
– 变量的值,为空,返回false
说明2:
查看date命令的使用,[atguigu@hadoop102 ~]$ date --help
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ods_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ods_log.sh 2019-02-11
4)查看导入数据
hive (gmall)>
select * from ods_start_log where dt=‘2019-02-11’ limit 2;
select * from ods_event_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第4章 数仓搭建之DWD层
对ODS层数据进行清洗(去除空值,脏数据,超过极限范围的数据,行式存储改为列存储,改压缩格式)。
4.1 DWD层启动表数据解析
4.1.1 创建启动表
1)建表语句
hive (gmall)>
drop table if exists dwd_start_log;
CREATE EXTERNAL TABLE dwd_start_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
entry
string,
open_ad_type
string,
action
string,
loading_time
string,
detail
string,
extend1
string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_start_log/’;
4.1.2 向启动表导入数据
hive (gmall)>
insert overwrite table dwd_start_log
PARTITION (dt=‘2019-02-10’)
select
get_json_object(line,’ . m i d ′ ) m i d i d , g e t j s o n o b j e c t ( l i n e , ′ .mid') mid_id, get_json_object(line,' .mid′)midid,getjsonobject(line,′.uid’) user_id,
get_json_object(line,’ . v c ′ ) v e r s i o n c o d e , g e t j s o n o b j e c t ( l i n e , ′ .vc') version_code, get_json_object(line,' .vc′)versioncode,getjsonobject(line,′.vn’) version_name,
get_json_object(line,’ . l ′ ) l a n g , g e t j s o n o b j e c t ( l i n e , ′ .l') lang, get_json_object(line,' .l′)lang,getjsonobject(line,′.sr’) source,
get_json_object(line,’ . o s ′ ) o s , g e t j s o n o b j e c t ( l i n e , ′ .os') os, get_json_object(line,' .os′)os,getjsonobject(line,′.ar’) area,
get_json_object(line,’ . m d ′ ) m o d e l , g e t j s o n o b j e c t ( l i n e , ′ .md') model, get_json_object(line,' .md′)model,getjsonobject(line,′.ba’) brand,
get_json_object(line,’ . s v ′ ) s d k v e r s i o n , g e t j s o n o b j e c t ( l i n e , ′ .sv') sdk_version, get_json_object(line,' .sv′)sdkversion,getjsonobject(line,′.g’) gmail,
get_json_object(line,’ . h w ′ ) h e i g h t w i d t h , g e t j s o n o b j e c t ( l i n e , ′ .hw') height_width, get_json_object(line,' .hw′)heightwidth,getjsonobject(line,′.t’) app_time,
get_json_object(line,’ . n w ′ ) n e t w o r k , g e t j s o n o b j e c t ( l i n e , ′ .nw') network, get_json_object(line,' .nw′)network,getjsonobject(line,′.ln’) lng,
get_json_object(line,’ . l a ′ ) l a t , g e t j s o n o b j e c t ( l i n e , ′ .la') lat, get_json_object(line,' .la′)lat,getjsonobject(line,′.entry’) entry,
get_json_object(line,’ . o p e n a d t y p e ′ ) o p e n a d t y p e , g e t j s o n o b j e c t ( l i n e , ′ .open_ad_type') open_ad_type, get_json_object(line,' .openadtype′)openadtype,getjsonobject(line,′.action’) action,
get_json_object(line,’ . l o a d i n g t i m e ′ ) l o a d i n g t i m e , g e t j s o n o b j e c t ( l i n e , ′ .loading_time') loading_time, get_json_object(line,' .loadingtime′)loadingtime,getjsonobject(line,′.detail’) detail,
get_json_object(line,’ . e x t e n d 1 ′ ) e x t e n d 1 f r o m o d s s t a r t l o g w h e r e d t = ′ 2019 − 02 − 1 0 ′ ; 3 ) 测 试 h i v e ( g m a l l ) > s e l e c t ∗ f r o m d w d s t a r t l o g l i m i t 2 ; 4.1.3 D W D 层 启 动 表 加 载 数 据 脚 本 1 ) 在 h a d o o p 102 的 / h o m e / a t g u i g u / b i n 目 录 下 创 建 脚 本 [ a t g u i g u @ h a d o o p 102 b i n ] .extend1') extend1 from ods_start_log where dt='2019-02-10'; 3)测试 hive (gmall)> select * from dwd_start_log limit 2; 4.1.3 DWD层启动表加载数据脚本 1)在hadoop102的/home/atguigu/bin目录下创建脚本 [atguigu@hadoop102 bin] .extend1′)extend1fromodsstartlogwheredt=′2019−02−10′;3)测试hive(gmall)>select∗fromdwdstartloglimit2;4.1.3DWD层启动表加载数据脚本1)在hadoop102的/home/atguigu/bin目录下创建脚本[atguigu@hadoop102bin] vim dwd_start_log.sh
在脚本中编写如下内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table “ A P P " . d w d s t a r t l o g P A R T I T I O N ( d t = ′ APP".dwd_start_log PARTITION (dt=' APP".dwdstartlogPARTITION(dt=′do_date’)
select
get_json_object(line,’ . m i d ′ ) m i d i d , g e t j s o n o b j e c t ( l i n e , ′ .mid') mid_id, get_json_object(line,' .mid′)midid,getjsonobject(line,′.uid’) user_id,
get_json_object(line,’ . v c ′ ) v e r s i o n c o d e , g e t j s o n o b j e c t ( l i n e , ′ .vc') version_code, get_json_object(line,' .vc′)versioncode,getjsonobject(line,′.vn’) version_name,
get_json_object(line,’ . l ′ ) l a n g , g e t j s o n o b j e c t ( l i n e , ′ .l') lang, get_json_object(line,' .l′)lang,getjsonobject(line,′.sr’) source,
get_json_object(line,’ . o s ′ ) o s , g e t j s o n o b j e c t ( l i n e , ′ .os') os, get_json_object(line,' .os′)os,getjsonobject(line,′.ar’) area,
get_json_object(line,’ . m d ′ ) m o d e l , g e t j s o n o b j e c t ( l i n e , ′ .md') model, get_json_object(line,' .md′)model,getjsonobject(line,′.ba’) brand,
get_json_object(line,’ . s v ′ ) s d k v e r s i o n , g e t j s o n o b j e c t ( l i n e , ′ .sv') sdk_version, get_json_object(line,' .sv′)sdkversion,getjsonobject(line,′.g’) gmail,
get_json_object(line,’ . h w ′ ) h e i g h t w i d t h , g e t j s o n o b j e c t ( l i n e , ′ .hw') height_width, get_json_object(line,' .hw′)heightwidth,getjsonobject(line,′.t’) app_time,
get_json_object(line,’ . n w ′ ) n e t w o r k , g e t j s o n o b j e c t ( l i n e , ′ .nw') network, get_json_object(line,' .nw′)network,getjsonobject(line,′.ln’) lng,
get_json_object(line,’ . l a ′ ) l a t , g e t j s o n o b j e c t ( l i n e , ′ .la') lat, get_json_object(line,' .la′)lat,getjsonobject(line,′.entry’) entry,
get_json_object(line,’ . o p e n a d t y p e ′ ) o p e n a d t y p e , g e t j s o n o b j e c t ( l i n e , ′ .open_ad_type') open_ad_type, get_json_object(line,' .openadtype′)openadtype,getjsonobject(line,′.action’) action,
get_json_object(line,’ . l o a d i n g t i m e ′ ) l o a d i n g t i m e , g e t j s o n o b j e c t ( l i n e , ′ .loading_time') loading_time, get_json_object(line,' .loadingtime′)loadingtime,getjsonobject(line,′.detail’) detail,
get_json_object(line,' . e x t e n d 1 ′ ) e x t e n d 1 f r o m " .extend1') extend1 from " .extend1′)extend1from"APP”.ods_start_log
where dt=’$do_date’;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dwd_start_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dwd_start_log.sh 2019-02-11
4)查询导入结果
hive (gmall)>
select * from dwd_start_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
4.2 DWD层事件表数据解析
4.2.1 创建基础明细表
明细表用于存储ODS层原始表转换过来的明细数据。
1)创建事件日志基础明细表
hive (gmall)>
drop table if exists dwd_base_event_log;
CREATE EXTERNAL TABLE dwd_base_event_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
event_name
string,
event_json
string,
server_time
string)
PARTITIONED BY (dt
string)
stored as parquet
location ‘/warehouse/gmall/dwd/dwd_base_event_log/’;
2)说明:其中event_name和event_json用来对应事件名和整个事件。这个地方将原始日志1对多的形式拆分出来了。操作的时候我们需要将原始日志展平,需要用到UDF和UDTF。
4.2.2 自定义UDF函数(解析公共字段)
1)创建一个maven工程:hivefunction
2)创建包名:com.atguigu.udf
3)在pom.xml文件中添加如下内容
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;
public class BaseFieldUDF extends UDF {
public String evaluate(String line, String jsonkeysString) {
// 0 准备一个sb
StringBuilder sb = new StringBuilder();
// 1 切割jsonkeys mid uid vc vn l sr os ar md
String[] jsonkeys = jsonkeysString.split(",");
// 2 处理line 服务器时间 | json
String[] logContents = line.split("\\|");
// 3 合法性校验
if (logContents.length != 2 || StringUtils.isBlank(logContents[1])) {
return "";
}
// 4 开始处理json
try {
JSONObject jsonObject = new JSONObject(logContents[1]);
// 获取cm里面的对象
JSONObject base = jsonObject.getJSONObject("cm");
// 循环遍历取值
for (int i = 0; i < jsonkeys.length; i++) {
String filedName = jsonkeys[i].trim();
if (base.has(filedName)) {
sb.append(base.getString(filedName)).append("\t");
} else {
sb.append("\t");
}
}
sb.append(jsonObject.getString("et")).append("\t");
sb.append(logContents[0]).append("\t");
} catch (JSONException e) {
e.printStackTrace();
}
return sb.toString();
}
public static void main(String[] args) {
String line = "1541217850324|{\"cm\":{\"mid\":\"m7856\",\"uid\":\"u8739\",\"ln\":\"-74.8\",\"sv\":\"V2.2.2\",\"os\":\"8.1.3\",\"g\":\"[email protected]\",\"nw\":\"3G\",\"l\":\"es\",\"vc\":\"6\",\"hw\":\"640*960\",\"ar\":\"MX\",\"t\":\"1541204134250\",\"la\":\"-31.7\",\"md\":\"huawei-17\",\"vn\":\"1.1.2\",\"sr\":\"O\",\"ba\":\"Huawei\"},\"ap\":\"weather\",\"et\":[{\"ett\":\"1541146624055\",\"en\":\"display\",\"kv\":{\"goodsid\":\"n4195\",\"copyright\":\"ESPN\",\"content_provider\":\"CNN\",\"extend2\":\"5\",\"action\":\"2\",\"extend1\":\"2\",\"place\":\"3\",\"showtype\":\"2\",\"category\":\"72\",\"newstype\":\"5\"}},{\"ett\":\"1541213331817\",\"en\":\"loading\",\"kv\":{\"extend2\":\"\",\"loading_time\":\"15\",\"action\":\"3\",\"extend1\":\"\",\"type1\":\"\",\"type\":\"3\",\"loading_way\":\"1\"}},{\"ett\":\"1541126195645\",\"en\":\"ad\",\"kv\":{\"entry\":\"3\",\"show_style\":\"0\",\"action\":\"2\",\"detail\":\"325\",\"source\":\"4\",\"behavior\":\"2\",\"content\":\"1\",\"newstype\":\"5\"}},{\"ett\":\"1541202678812\",\"en\":\"notification\",\"kv\":{\"ap_time\":\"1541184614380\",\"action\":\"3\",\"type\":\"4\",\"content\":\"\"}},{\"ett\":\"1541194686688\",\"en\":\"active_background\",\"kv\":{\"active_source\":\"3\"}}]}";
String x = new BaseFieldUDF().evaluate(line, "mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,nw,ln,la,t");
System.out.println(x);
}
}
注意:使用main函数主要用于模拟数据测试。
4.2.3 自定义UDTF函数(解析具体事件字段)
1)创建包名:com.atguigu.udtf
2)在com.atguigu.udtf包下创建类名:EventJsonUDTF
3)用于展开业务字段
package com.atguigu.udtf;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.json.JSONArray;
import org.json.JSONException;
import java.util.ArrayList;
public class EventJsonUDTF extends GenericUDTF {
//该方法中,我们将指定输出参数的名称和参数类型:
@Override
public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
ArrayList fieldNames = new ArrayList();
ArrayList fieldOIs = new ArrayList();
fieldNames.add("event_name");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldNames.add("event_json");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
//输入1条记录,输出若干条结果
@Override
public void process(Object[] objects) throws HiveException {
// 获取传入的et
String input = objects[0].toString();
// 如果传进来的数据为空,直接返回过滤掉该数据
if (StringUtils.isBlank(input)) {
return;
} else {
try {
// 获取一共有几个事件(ad/facoriters)
JSONArray ja = new JSONArray(input);
if (ja == null)
return;
// 循环遍历每一个事件
for (int i = 0; i < ja.length(); i++) {
String[] result = new String[2];
try {
// 取出每个的事件名称(ad/facoriters)
result[0] = ja.getJSONObject(i).getString("en");
// 取出每一个事件整体
result[1] = ja.getString(i);
} catch (JSONException e) {
continue;
}
// 将结果返回
forward(result);
}
} catch (JSONException e) {
e.printStackTrace();
}
}
}
//当没有记录处理的时候该方法会被调用,用来清理代码或者产生额外的输出
@Override
public void close() throws HiveException {
}
}
2)打包
3)将hivefunction-1.0-SNAPSHOT上传到hadoop102的/opt/module/hive/
4)将jar包添加到Hive的classpath
hive (gmall)> add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar;
5)创建临时函数与开发好的java class关联
hive (gmall)>
create temporary function base_analizer as ‘com.atguigu.udf.BaseFieldUDF’;
create temporary function flat_analizer as ‘com.atguigu.udtf.EventJsonUDTF’;
4.2.4 解析事件日志基础明细表
1)解析事件日志基础明细表
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_base_event_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
event_name,
event_json,
server_time
from
(
select
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[0] as mid_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[1] as user_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[2] as version_code,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[3] as version_name,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[4] as lang,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[5] as source,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[6] as os,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[7] as area,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[8] as model,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[9] as brand,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[10] as sdk_version,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[11] as gmail,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[12] as height_width,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[13] as app_time,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[14] as network,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[15] as lng,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[16] as lat,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[17] as ops,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[18] as server_time
from ods_event_log where dt=‘2019-02-10’ and base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’)<>’’
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
2)测试
hive (gmall)> select * from dwd_base_event_log limit 2;
4.2.5 DWD层数据解析脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim dwd_base_log.sh
在脚本中编写如下内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
sql="
add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar;
create temporary function base_analizer as ‘com.atguigu.udf.BaseFieldUDF’;
create temporary function flat_analizer as ‘com.atguigu.udtf.EventJsonUDTF’;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table " A P P " . d w d b a s e e v e n t l o g P A R T I T I O N ( d t = ′ APP".dwd_base_event_log PARTITION (dt=' APP".dwdbaseeventlogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source ,
os ,
area ,
model ,
brand ,
sdk_version ,
gmail ,
height_width ,
network ,
lng ,
lat ,
app_time ,
event_name ,
event_json ,
server_time
from
(
select
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[0] as mid_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[1] as user_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[2] as version_code,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[3] as version_name,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[4] as lang,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[5] as source,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[6] as os,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[7] as area,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[8] as model,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[9] as brand,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[10] as sdk_version,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[11] as gmail,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[12] as height_width,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[13] as app_time,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[14] as network,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[15] as lng,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[16] as lat,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[17] as ops,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[18] as server_time
from " A P P " . o d s e v e n t l o g w h e r e d t = ′ APP".ods_event_log where dt=' APP".odseventlogwheredt=′do_date’ and base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’)<>’’
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dwd_base_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dwd_base_log.sh 2019-02-11
4)查询导入结果
hive (gmall)>
select * from dwd_base_event_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
4.3 DWD层事件表获取
4.3.1 商品点击表
1)建表语句
hive (gmall)>
drop table if exists dwd_display_log;
CREATE EXTERNAL TABLE dwd_display_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
action
string,
goodsid
string,
place
string,
extend1
string,
category
string,
server_time
string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_display_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_display_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action′)action,getjsonobject(eventjson,′.kv.goodsid’) goodsid,
get_json_object(event_json,’ . k v . p l a c e ′ ) p l a c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.place') place, get_json_object(event_json,' .kv.place′)place,getjsonobject(eventjson,′.kv.extend1’) extend1,
get_json_object(event_json,’$.kv.category’) category,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘display’;
3)测试
hive (gmall)> select * from dwd_display_log limit 2;
4.3.2 商品详情页表
1)建表语句
hive (gmall)>
drop table if exists dwd_newsdetail_log;
CREATE EXTERNAL TABLE dwd_newsdetail_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
entry
string,
action
string,
goodsid
string,
showtype
string,
news_staytime
string,
loading_time
string,
type1
string,
category
string,
server_time
string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_newsdetail_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_newsdetail_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry′)entry,getjsonobject(eventjson,′.kv.action’) action,
get_json_object(event_json,’ . k v . g o o d s i d ′ ) g o o d s i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.goodsid') goodsid, get_json_object(event_json,' .kv.goodsid′)goodsid,getjsonobject(eventjson,′.kv.showtype’) showtype,
get_json_object(event_json,’ . k v . n e w s s t a y t i m e ′ ) n e w s s t a y t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.news_staytime') news_staytime, get_json_object(event_json,' .kv.newsstaytime′)newsstaytime,getjsonobject(eventjson,′.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . t y p e 1 ′ ) t y p e 1 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.type1') type1, get_json_object(event_json,' .kv.type1′)type1,getjsonobject(eventjson,′.kv.category’) category,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘newsdetail’;
3)测试
hive (gmall)> select * from dwd_newsdetail_log limit 2;
4.3.3 商品列表页表
1)建表语句
hive (gmall)>
drop table if exists dwd_loading_log;
CREATE EXTERNAL TABLE dwd_loading_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
action
string,
loading_time
string,
loading_way
string,
extend1
string,
extend2
string,
type
string,
type1
string,
server_time
string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_loading_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_loading_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action′)action,getjsonobject(eventjson,′.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . l o a d i n g w a y ′ ) l o a d i n g w a y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.loading_way') loading_way, get_json_object(event_json,' .kv.loadingway′)loadingway,getjsonobject(eventjson,′.kv.extend1’) extend1,
get_json_object(event_json,’ . k v . e x t e n d 2 ′ ) e x t e n d 2 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.extend2') extend2, get_json_object(event_json,' .kv.extend2′)extend2,getjsonobject(eventjson,′.kv.type’) type,
get_json_object(event_json,’$.kv.type1’) type1,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘loading’;
3)测试
hive (gmall)> select * from dwd_loading_log limit 2;
4.3.4 广告表
1)建表语句
hive (gmall)>
drop table if exists dwd_ad_log;
CREATE EXTERNAL TABLE dwd_ad_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
entry
string,
action
string,
content
string,
detail
string,
ad_source
string,
behavior
string,
newstype
string,
show_style
string,
server_time
string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_ad_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_ad_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry′)entry,getjsonobject(eventjson,′.kv.action’) action,
get_json_object(event_json,’ . k v . c o n t e n t ′ ) c o n t e n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.content') content, get_json_object(event_json,' .kv.content′)content,getjsonobject(eventjson,′.kv.detail’) detail,
get_json_object(event_json,’ . k v . s o u r c e ′ ) a d s o u r c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.source') ad_source, get_json_object(event_json,' .kv.source′)adsource,getjsonobject(eventjson,′.kv.behavior’) behavior,
get_json_object(event_json,’ . k v . n e w s t y p e ′ ) n e w s t y p e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.newstype') newstype, get_json_object(event_json,' .kv.newstype′)newstype,getjsonobject(eventjson,′.kv.show_style’) show_style,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘ad’;
3)测试
hive (gmall)> select * from dwd_ad_log limit 2;
4.3.5 消息通知表
1)建表语句
hive (gmall)>
drop table if exists dwd_notification_log;
CREATE EXTERNAL TABLE dwd_notification_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
action
string,
noti_type
string,
ap_time
string,
content
string,
server_time
string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_notification_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_notification_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action′)action,getjsonobject(eventjson,′.kv.noti_type’) noti_type,
get_json_object(event_json,’ . k v . a p t i m e ′ ) a p t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.ap_time') ap_time, get_json_object(event_json,' .kv.aptime′)aptime,getjsonobject(eventjson,′.kv.content’) content,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘notification’;
3)测试
hive (gmall)> select * from dwd_notification_log limit 2;
4.3.6 用户前台活跃表
1)建表语句
hive (gmall)>
drop table if exists dwd_active_foreground_log;
CREATE EXTERNAL TABLE dwd_active_foreground_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
push_id
string,
access
string,
server_time
string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_foreground_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_active_foreground_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . p u s h i d ′ ) p u s h i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.push_id') push_id, get_json_object(event_json,' .kv.pushid′)pushid,getjsonobject(eventjson,′.kv.access’) access,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘active_foreground’;
3)测试
hive (gmall)> select * from dwd_active_foreground_log limit 2;
4.3.7 用户后台活跃表
1)建表语句
hive (gmall)>
drop table if exists dwd_active_background_log;
CREATE EXTERNAL TABLE dwd_active_background_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
active_source
string,
server_time
string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_background_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_active_background_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’$.kv.active_source’) active_source,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘active_background’;
3)测试
hive (gmall)> select * from dwd_active_background_log limit 2;
4.3.8 评论表
1)建表语句
hive (gmall)>
drop table if exists dwd_comment_log;
CREATE EXTERNAL TABLE dwd_comment_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
comment_id
int,
userid
int,
p_comment_id
int,
content
string,
addtime
string,
other_id
int,
praise_count
int,
reply_count
int,
server_time
string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_comment_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_comment_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . c o m m e n t i d ′ ) c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.comment_id') comment_id, get_json_object(event_json,' .kv.commentid′)commentid,getjsonobject(eventjson,′.kv.userid’) userid,
get_json_object(event_json,’ . k v . p c o m m e n t i d ′ ) p c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.p_comment_id') p_comment_id, get_json_object(event_json,' .kv.pcommentid′)pcommentid,getjsonobject(eventjson,′.kv.content’) content,
get_json_object(event_json,’ . k v . a d d t i m e ′ ) a d d t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.addtime') addtime, get_json_object(event_json,' .kv.addtime′)addtime,getjsonobject(eventjson,′.kv.other_id’) other_id,
get_json_object(event_json,’ . k v . p r a i s e c o u n t ′ ) p r a i s e c o u n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.praise_count') praise_count, get_json_object(event_json,' .kv.praisecount′)praisecount,getjsonobject(eventjson,′.kv.reply_count’) reply_count,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘comment’;
3)测试
hive (gmall)> select * from dwd_comment_log limit 2;
4.3.9 收藏表
1)建表语句
hive (gmall)>
drop table if exists dwd_favorites_log;
CREATE EXTERNAL TABLE dwd_favorites_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
id
int,
course_id
int,
userid
int,
add_time
string,
server_time
string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_favorites_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_favorites_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id′)id,getjsonobject(eventjson,′.kv.course_id’) course_id,
get_json_object(event_json,’ . k v . u s e r i d ′ ) u s e r i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.userid') userid, get_json_object(event_json,' .kv.userid′)userid,getjsonobject(eventjson,′.kv.add_time’) add_time,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘favorites’;
3)测试
hive (gmall)> select * from dwd_favorites_log limit 2;
4.3.10 点赞表
1)建表语句
hive (gmall)>
drop table if exists dwd_praise_log;
CREATE EXTERNAL TABLE dwd_praise_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
id
string,
userid
string,
target_id
string,
type
string,
add_time
string,
server_time
string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_praise_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_praise_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id′)id,getjsonobject(eventjson,′.kv.userid’) userid,
get_json_object(event_json,’ . k v . t a r g e t i d ′ ) t a r g e t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.target_id') target_id, get_json_object(event_json,' .kv.targetid′)targetid,getjsonobject(eventjson,′.kv.type’) type,
get_json_object(event_json,’$.kv.add_time’) add_time,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘praise’;
3)测试
hive (gmall)> select * from dwd_praise_log limit 2;
4.3.11 错误日志表
1)建表语句
hive (gmall)>
drop table if exists dwd_error_log;
CREATE EXTERNAL TABLE dwd_error_log(
mid_id
string,
user_id
string,
version_code
string,
version_name
string,
lang
string,
source
string,
os
string,
area
string,
model
string,
brand
string,
sdk_version
string,
gmail
string,
height_width
string,
app_time
string,
network
string,
lng
string,
lat
string,
errorBrief
string,
errorDetail
string,
server_time
string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_error_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_error_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e r r o r B r i e f ′ ) e r r o r B r i e f , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.errorBrief') errorBrief, get_json_object(event_json,' .kv.errorBrief′)errorBrief,getjsonobject(eventjson,′.kv.errorDetail’) errorDetail,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘error’;
3)测试
hive (gmall)> select * from dwd_error_log limit 2;
4.3.12 DWD层事件表加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim dwd_event_log.sh
在脚本中编写如下内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table “ A P P " . d w d d i s p l a y l o g P A R T I T I O N ( d t = ′ APP".dwd_display_log PARTITION (dt=' APP".dwddisplaylogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action′)action,getjsonobject(eventjson,′.kv.goodsid’) goodsid,
get_json_object(event_json,’ . k v . p l a c e ′ ) p l a c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.place') place, get_json_object(event_json,' .kv.place′)place,getjsonobject(eventjson,′.kv.extend1’) extend1,
get_json_object(event_json,' . k v . c a t e g o r y ′ ) c a t e g o r y , s e r v e r t i m e f r o m " .kv.category') category, server_time from " .kv.category′)category,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘display’;
insert overwrite table " A P P " . d w d n e w s d e t a i l l o g P A R T I T I O N ( d t = ′ APP".dwd_newsdetail_log PARTITION (dt=' APP".dwdnewsdetaillogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry′)entry,getjsonobject(eventjson,′.kv.action’) action,
get_json_object(event_json,’ . k v . g o o d s i d ′ ) g o o d s i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.goodsid') goodsid, get_json_object(event_json,' .kv.goodsid′)goodsid,getjsonobject(eventjson,′.kv.showtype’) showtype,
get_json_object(event_json,’ . k v . n e w s s t a y t i m e ′ ) n e w s s t a y t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.news_staytime') news_staytime, get_json_object(event_json,' .kv.newsstaytime′)newsstaytime,getjsonobject(eventjson,′.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . t y p e 1 ′ ) t y p e 1 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.type1') type1, get_json_object(event_json,' .kv.type1′)type1,getjsonobject(eventjson,′.kv.category’) category,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=′do_date’ and event_name=‘newsdetail’;
insert overwrite table “ A P P " . d w d l o a d i n g l o g P A R T I T I O N ( d t = ′ APP".dwd_loading_log PARTITION (dt=' APP".dwdloadinglogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action′)action,getjsonobject(eventjson,′.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . l o a d i n g w a y ′ ) l o a d i n g w a y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.loading_way') loading_way, get_json_object(event_json,' .kv.loadingway′)loadingway,getjsonobject(eventjson,′.kv.extend1’) extend1,
get_json_object(event_json,’ . k v . e x t e n d 2 ′ ) e x t e n d 2 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.extend2') extend2, get_json_object(event_json,' .kv.extend2′)extend2,getjsonobject(eventjson,′.kv.type’) type,
get_json_object(event_json,' . k v . t y p e 1 ′ ) t y p e 1 , s e r v e r t i m e f r o m " .kv.type1') type1, server_time from " .kv.type1′)type1,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘loading’;
insert overwrite table " A P P " . d w d a d l o g P A R T I T I O N ( d t = ′ APP".dwd_ad_log PARTITION (dt=' APP".dwdadlogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry′)entry,getjsonobject(eventjson,′.kv.action’) action,
get_json_object(event_json,’ . k v . c o n t e n t ′ ) c o n t e n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.content') content, get_json_object(event_json,' .kv.content′)content,getjsonobject(eventjson,′.kv.detail’) detail,
get_json_object(event_json,’ . k v . s o u r c e ′ ) a d s o u r c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.source') ad_source, get_json_object(event_json,' .kv.source′)adsource,getjsonobject(eventjson,′.kv.behavior’) behavior,
get_json_object(event_json,’ . k v . n e w s t y p e ′ ) n e w s t y p e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.newstype') newstype, get_json_object(event_json,' .kv.newstype′)newstype,getjsonobject(eventjson,′.kv.show_style’) show_style,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=′do_date’ and event_name=‘ad’;
insert overwrite table " A P P " . d w d n o t i f i c a t i o n l o g P A R T I T I O N ( d t = ′ APP".dwd_notification_log PARTITION (dt=' APP".dwdnotificationlogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action′)action,getjsonobject(eventjson,′.kv.noti_type’) noti_type,
get_json_object(event_json,’ . k v . a p t i m e ′ ) a p t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.ap_time') ap_time, get_json_object(event_json,' .kv.aptime′)aptime,getjsonobject(eventjson,′.kv.content’) content,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=′do_date’ and event_name=‘notification’;
insert overwrite table " A P P " . d w d a c t i v e f o r e g r o u n d l o g P A R T I T I O N ( d t = ′ APP".dwd_active_foreground_log PARTITION (dt=' APP".dwdactiveforegroundlogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . p u s h i d ′ ) p u s h i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.push_id') push_id, get_json_object(event_json,' .kv.pushid′)pushid,getjsonobject(eventjson,′.kv.access’) access,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=′do_date’ and event_name=‘active_foreground’;
insert overwrite table “ A P P " . d w d a c t i v e b a c k g r o u n d l o g P A R T I T I O N ( d t = ′ APP".dwd_active_background_log PARTITION (dt=' APP".dwdactivebackgroundlogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,' . k v . a c t i v e s o u r c e ′ ) a c t i v e s o u r c e , s e r v e r t i m e f r o m " .kv.active_source') active_source, server_time from " .kv.activesource′)activesource,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘active_background’;
insert overwrite table " A P P " . d w d c o m m e n t l o g P A R T I T I O N ( d t = ′ APP".dwd_comment_log PARTITION (dt=' APP".dwdcommentlogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . c o m m e n t i d ′ ) c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.comment_id') comment_id, get_json_object(event_json,' .kv.commentid′)commentid,getjsonobject(eventjson,′.kv.userid’) userid,
get_json_object(event_json,’ . k v . p c o m m e n t i d ′ ) p c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.p_comment_id') p_comment_id, get_json_object(event_json,' .kv.pcommentid′)pcommentid,getjsonobject(eventjson,′.kv.content’) content,
get_json_object(event_json,’ . k v . a d d t i m e ′ ) a d d t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.addtime') addtime, get_json_object(event_json,' .kv.addtime′)addtime,getjsonobject(eventjson,′.kv.other_id’) other_id,
get_json_object(event_json,’ . k v . p r a i s e c o u n t ′ ) p r a i s e c o u n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.praise_count') praise_count, get_json_object(event_json,' .kv.praisecount′)praisecount,getjsonobject(eventjson,′.kv.reply_count’) reply_count,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=′do_date’ and event_name=‘comment’;
insert overwrite table " A P P " . d w d f a v o r i t e s l o g P A R T I T I O N ( d t = ′ APP".dwd_favorites_log PARTITION (dt=' APP".dwdfavoriteslogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id′)id,getjsonobject(eventjson,′.kv.course_id’) course_id,
get_json_object(event_json,’ . k v . u s e r i d ′ ) u s e r i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.userid') userid, get_json_object(event_json,' .kv.userid′)userid,getjsonobject(eventjson,′.kv.add_time’) add_time,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=′do_date’ and event_name=‘favorites’;
insert overwrite table “ A P P " . d w d p r a i s e l o g P A R T I T I O N ( d t = ′ APP".dwd_praise_log PARTITION (dt=' APP".dwdpraiselogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id′)id,getjsonobject(eventjson,′.kv.userid’) userid,
get_json_object(event_json,’ . k v . t a r g e t i d ′ ) t a r g e t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.target_id') target_id, get_json_object(event_json,' .kv.targetid′)targetid,getjsonobject(eventjson,′.kv.type’) type,
get_json_object(event_json,' . k v . a d d t i m e ′ ) a d d t i m e , s e r v e r t i m e f r o m " .kv.add_time') add_time, server_time from " .kv.addtime′)addtime,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘praise’;
insert overwrite table " A P P " . d w d e r r o r l o g P A R T I T I O N ( d t = ′ APP".dwd_error_log PARTITION (dt=' APP".dwderrorlogPARTITION(dt=′do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e r r o r B r i e f ′ ) e r r o r B r i e f , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.errorBrief') errorBrief, get_json_object(event_json,' .kv.errorBrief′)errorBrief,getjsonobject(eventjson,′.kv.errorDetail’) errorDetail,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=′do_date’ and event_name=‘error’;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dwd_event_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dwd_event_log.sh 2019-02-11
4)查询导入结果
hive (gmall)>
select * from dwd_comment_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第5章 业务知识准备
5.1 业务术语
1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_day;
create external table dws_uv_detail_day
(
mid_id
string COMMENT ‘设备唯一标识’,
user_id
string COMMENT ‘用户标识’,
version_code
string COMMENT ‘程序版本号’,
version_name
string COMMENT ‘程序版本名’,
lang
string COMMENT ‘系统语言’,
source
string COMMENT ‘渠道号’,
os
string COMMENT ‘安卓系统版本’,
area
string COMMENT ‘区域’,
model
string COMMENT ‘手机型号’,
brand
string COMMENT ‘手机品牌’,
sdk_version
string COMMENT ‘sdkVersion’,
gmail
string COMMENT ‘gmail’,
height_width
string COMMENT ‘屏幕宽高’,
app_time
string COMMENT ‘客户端日志产生时的时间’,
network
string COMMENT ‘网络模式’,
lng
string COMMENT ‘经度’,
lat
string COMMENT ‘纬度’
)
partitioned by(dt string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_uv_detail_day’
;
2)数据导入
以用户单日访问为key进行聚合,如果某个用户在一天中使用了两种操作系统、两个系统版本、多个地区,登录不同账号,只取其中之一
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_day
partition(dt=‘2019-02-10’)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang))lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat
from dwd_start_log
where dt=‘2019-02-10’
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_day limit 1;
hive (gmall)> select count(*) from dws_uv_detail_day;
4)思考:不同渠道来源的每日活跃数统计怎么计算?
6.1.2 每周活跃设备明细
根据日用户访问明细,获得周用户访问明细。
1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_wk;
create external table dws_uv_detail_wk(
mid_id
string COMMENT ‘设备唯一标识’,
user_id
string COMMENT ‘用户标识’,
version_code
string COMMENT ‘程序版本号’,
version_name
string COMMENT ‘程序版本名’,
lang
string COMMENT ‘系统语言’,
source
string COMMENT ‘渠道号’,
os
string COMMENT ‘安卓系统版本’,
area
string COMMENT ‘区域’,
model
string COMMENT ‘手机型号’,
brand
string COMMENT ‘手机品牌’,
sdk_version
string COMMENT ‘sdkVersion’,
gmail
string COMMENT ‘gmail’,
height_width
string COMMENT ‘屏幕宽高’,
app_time
string COMMENT ‘客户端日志产生时的时间’,
network
string COMMENT ‘网络模式’,
lng
string COMMENT ‘经度’,
lat
string COMMENT ‘纬度’,
monday_date
string COMMENT ‘周一日期’,
sunday_date
string COMMENT ‘周日日期’
) COMMENT ‘活跃用户按周明细’
PARTITIONED BY (wk_dt
string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_uv_detail_wk/’
;
2)数据导入
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_wk partition(wk_dt)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang)) lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat,
date_add(next_day(‘2019-02-10’,‘MO’),-7),
date_add(next_day(‘2019-02-10’,‘MO’),-1),
concat(date_add( next_day(‘2019-02-10’,‘MO’),-7), ‘_’ , date_add(next_day(‘2019-02-10’,‘MO’),-1)
)
from dws_uv_detail_day
where dt>=date_add(next_day(‘2019-02-10’,‘MO’),-7) and dt<=date_add(next_day(‘2019-02-10’,‘MO’),-1)
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_wk limit 1;
hive (gmall)> select count(*) from dws_uv_detail_wk;
6.1.3 每月活跃设备明细
1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_mn;
create external table dws_uv_detail_mn(
mid_id
string COMMENT ‘设备唯一标识’,
user_id
string COMMENT ‘用户标识’,
version_code
string COMMENT ‘程序版本号’,
version_name
string COMMENT ‘程序版本名’,
lang
string COMMENT ‘系统语言’,
source
string COMMENT ‘渠道号’,
os
string COMMENT ‘安卓系统版本’,
area
string COMMENT ‘区域’,
model
string COMMENT ‘手机型号’,
brand
string COMMENT ‘手机品牌’,
sdk_version
string COMMENT ‘sdkVersion’,
gmail
string COMMENT ‘gmail’,
height_width
string COMMENT ‘屏幕宽高’,
app_time
string COMMENT ‘客户端日志产生时的时间’,
network
string COMMENT ‘网络模式’,
lng
string COMMENT ‘经度’,
lat
string COMMENT ‘纬度’
) COMMENT ‘活跃用户按月明细’
PARTITIONED BY (mn
string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_uv_detail_mn/’
;
2)数据导入
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_mn partition(mn)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang)) lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat,
date_format(‘2019-02-10’,‘yyyy-MM’)
from dws_uv_detail_day
where date_format(dt,‘yyyy-MM’) = date_format(‘2019-02-10’,‘yyyy-MM’)
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_mn limit 1;
hive (gmall)> select count(*) from dws_uv_detail_mn ;
6.1.4 DWS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim dws_uv_log.sh
在脚本中编写如下内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table " A P P " . d w s u v d e t a i l d a y p a r t i t i o n ( d t = ′ APP".dws_uv_detail_day partition(dt=' APP".dwsuvdetaildaypartition(dt=′do_date’)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang)) lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat
from " A P P " . d w d s t a r t l o g w h e r e d t = ′ APP".dwd_start_log where dt=' APP".dwdstartlogwheredt=′do_date’
group by mid_id;
insert overwrite table “ A P P " . d w s u v d e t a i l w k p a r t i t i o n ( w k d t ) s e l e c t m i d i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( u s e r i d ) ) u s e r i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n c o d e ) ) v e r s i o n c o d e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n n a m e ) ) v e r s i o n n a m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a n g ) ) l a n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s o u r c e ) ) s o u r c e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( o s ) ) o s , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a r e a ) ) a r e a , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( m o d e l ) ) m o d e l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( b r a n d ) ) b r a n d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s d k v e r s i o n ) ) s d k v e r s i o n , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( g m a i l ) ) g m a i l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( h e i g h t w i d t h ) ) h e i g h t w i d t h , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a p p t i m e ) ) a p p t i m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( n e t w o r k ) ) n e t w o r k , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l n g ) ) l n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a t ) ) l a t , d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk partition(wk_dt) select mid_id, concat_ws('|', collect_set(user_id)) user_id, concat_ws('|', collect_set(version_code)) version_code, concat_ws('|', collect_set(version_name)) version_name, concat_ws('|', collect_set(lang)) lang, concat_ws('|', collect_set(source)) source, concat_ws('|', collect_set(os)) os, concat_ws('|', collect_set(area)) area, concat_ws('|', collect_set(model)) model, concat_ws('|', collect_set(brand)) brand, concat_ws('|', collect_set(sdk_version)) sdk_version, concat_ws('|', collect_set(gmail)) gmail, concat_ws('|', collect_set(height_width)) height_width, concat_ws('|', collect_set(app_time)) app_time, concat_ws('|', collect_set(network)) network, concat_ws('|', collect_set(lng)) lng, concat_ws('|', collect_set(lat)) lat, date_add(next_day(' APP".dwsuvdetailwkpartition(wkdt)selectmidid,concatws(′∣′,collectset(userid))userid,concatws(′∣′,collectset(versioncode))versioncode,concatws(′∣′,collectset(versionname))versionname,concatws(′∣′,collectset(lang))lang,concatws(′∣′,collectset(source))source,concatws(′∣′,collectset(os))os,concatws(′∣′,collectset(area))area,concatws(′∣′,collectset(model))model,concatws(′∣′,collectset(brand))brand,concatws(′∣′,collectset(sdkversion))sdkversion,concatws(′∣′,collectset(gmail))gmail,concatws(′∣′,collectset(heightwidth))heightwidth,concatws(′∣′,collectset(apptime))apptime,concatws(′∣′,collectset(network))network,concatws(′∣′,collectset(lng))lng,concatws(′∣′,collectset(lat))lat,dateadd(nextday(′do_date’,‘MO’),-7),
date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) , c o n c a t ( d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-1), concat(date_add( next_day(' dodate′,′MO′),−1),concat(dateadd(nextday(′do_date’,‘MO’),-7), ‘_’ , date_add(next_day(' d o d a t e ′ , ′ M O ′ ) , − 1 ) ) f r o m " do_date','MO'),-1) ) from " dodate′,′MO′),−1))from"APP”.dws_uv_detail_day
where dt>=date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 7 ) a n d d t < = d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-7) and dt<=date_add(next_day(' dodate′,′MO′),−7)anddt<=dateadd(nextday(′do_date’,‘MO’),-1)
group by mid_id;
insert overwrite table " A P P " . d w s u v d e t a i l m n p a r t i t i o n ( m n ) s e l e c t m i d i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( u s e r i d ) ) u s e r i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n c o d e ) ) v e r s i o n c o d e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n n a m e ) ) v e r s i o n n a m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a n g ) ) l a n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s o u r c e ) ) s o u r c e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( o s ) ) o s , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a r e a ) ) a r e a , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( m o d e l ) ) m o d e l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( b r a n d ) ) b r a n d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s d k v e r s i o n ) ) s d k v e r s i o n , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( g m a i l ) ) g m a i l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( h e i g h t w i d t h ) ) h e i g h t w i d t h , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a p p t i m e ) ) a p p t i m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( n e t w o r k ) ) n e t w o r k , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l n g ) ) l n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a t ) ) l a t , d a t e f o r m a t ( ′ APP".dws_uv_detail_mn partition(mn) select mid_id, concat_ws('|', collect_set(user_id)) user_id, concat_ws('|', collect_set(version_code)) version_code, concat_ws('|', collect_set(version_name)) version_name, concat_ws('|', collect_set(lang))lang, concat_ws('|', collect_set(source)) source, concat_ws('|', collect_set(os)) os, concat_ws('|', collect_set(area)) area, concat_ws('|', collect_set(model)) model, concat_ws('|', collect_set(brand)) brand, concat_ws('|', collect_set(sdk_version)) sdk_version, concat_ws('|', collect_set(gmail)) gmail, concat_ws('|', collect_set(height_width)) height_width, concat_ws('|', collect_set(app_time)) app_time, concat_ws('|', collect_set(network)) network, concat_ws('|', collect_set(lng)) lng, concat_ws('|', collect_set(lat)) lat, date_format(' APP".dwsuvdetailmnpartition(mn)selectmidid,concatws(′∣′,collectset(userid))userid,concatws(′∣′,collectset(versioncode))versioncode,concatws(′∣′,collectset(versionname))versionname,concatws(′∣′,collectset(lang))lang,concatws(′∣′,collectset(source))source,concatws(′∣′,collectset(os))os,concatws(′∣′,collectset(area))area,concatws(′∣′,collectset(model))model,concatws(′∣′,collectset(brand))brand,concatws(′∣′,collectset(sdkversion))sdkversion,concatws(′∣′,collectset(gmail))gmail,concatws(′∣′,collectset(heightwidth))heightwidth,concatws(′∣′,collectset(apptime))apptime,concatws(′∣′,collectset(network))network,concatws(′∣′,collectset(lng))lng,concatws(′∣′,collectset(lat))lat,dateformat(′do_date’,‘yyyy-MM’)
from " A P P " . d w s u v d e t a i l d a y w h e r e d a t e f o r m a t ( d t , ′ y y y y − M M ′ ) = d a t e f o r m a t ( ′ APP".dws_uv_detail_day where date_format(dt,'yyyy-MM') = date_format(' APP".dwsuvdetaildaywheredateformat(dt,′yyyy−MM′)=dateformat(′do_date’,‘yyyy-MM’)
group by mid_id;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dws_uv_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dws_uv_log.sh 2019-02-11
4)查询结果
hive (gmall)> select count() from dws_uv_detail_day where dt=‘2019-02-11’;
hive (gmall)> select count() from dws_uv_detail_wk;
hive (gmall)> select count(*) from dws_uv_detail_mn ;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
6.2 ADS层
目标:当日、当周、当月活跃设备数
6.2.1 活跃设备数
1)建表语句
hive (gmall)>
drop table if exists ads_uv_count;
create external table ads_uv_count(
dt
string COMMENT ‘统计日期’,
day_count
bigint COMMENT ‘当日用户数量’,
wk_count
bigint COMMENT ‘当周用户数量’,
mn_count
bigint COMMENT ‘当月用户数量’,
is_weekend
string COMMENT ‘Y,N是否是周末,用于得到本周最终结果’,
is_monthend
string COMMENT ‘Y,N是否是月末,用于得到本月最终结果’
) COMMENT ‘活跃设备数’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_uv_count/’
;
2)导入数据
hive (gmall)>
insert into table ads_uv_count
select
‘2019-02-10’ dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day(‘2019-02-10’,‘MO’),-1)=‘2019-02-10’,‘Y’,‘N’) ,
if(last_day(‘2019-02-10’)=‘2019-02-10’,‘Y’,‘N’)
from
(
select
‘2019-02-10’ dt,
count() ct
from dws_uv_detail_day
where dt=‘2019-02-10’
)daycount join
(
select
‘2019-02-10’ dt,
count () ct
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day(‘2019-02-10’,‘MO’),-7),’_’ ,date_add(next_day(‘2019-02-10’,‘MO’),-1) )
) wkcount on daycount.dt=wkcount.dt
join
(
select
‘2019-02-10’ dt,
count (*) ct
from dws_uv_detail_mn
where mn=date_format(‘2019-02-10’,‘yyyy-MM’)
)mncount on daycount.dt=mncount.dt
;
3)查询导入结果
hive (gmall)> select * from ads_uv_count ;
6.2.2 ADS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_uv_log.sh
在脚本中编写如下内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table “ A P P " . a d s u v c o u n t s e l e c t ′ APP".ads_uv_count select ' APP".adsuvcountselect′do_date’ dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) = ′ do_date','MO'),-1)=' dodate′,′MO′),−1)=′do_date’,‘Y’,‘N’) ,
if(last_day(‘ d o d a t e ′ ) = ′ do_date')=' dodate′)=′do_date’,‘Y’,‘N’)
from
(
select
' d o d a t e ′ d t , c o u n t ( ∗ ) c t f r o m " do_date' dt, count(*) ct from " dodate′dt,count(∗)ctfrom"APP”.dws_uv_detail_day
where dt=‘ d o d a t e ′ ) d a y c o u n t j o i n ( s e l e c t ′ do_date' )daycount join ( select ' dodate′)daycountjoin(select′do_date’ dt,
count () ct
from " A P P " . d w s u v d e t a i l w k w h e r e w k d t = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt=concat(dateadd(nextday(′do_date’,‘MO’),-7),’_’ ,date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) ) ) w k c o u n t o n d a y c o u n t . d t = w k c o u n t . d t j o i n ( s e l e c t ′ do_date','MO'),-1) ) ) wkcount on daycount.dt=wkcount.dt join ( select ' dodate′,′MO′),−1)))wkcountondaycount.dt=wkcount.dtjoin(select′do_date’ dt,
count () ct
from " A P P " . d w s u v d e t a i l m n w h e r e m n = d a t e f o r m a t ( ′ APP".dws_uv_detail_mn where mn=date_format(' APP".dwsuvdetailmnwheremn=dateformat(′do_date’,‘yyyy-MM’)
)mncount on daycount.dt=mncount.dt;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_uv_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_uv_log.sh 2019-02-11
4)脚本执行时间
企业开发中一般在每日凌晨30分~1点
5)查询导入结果
hive (gmall)> select * from ads_uv_count;
第7章 需求二:用户新增主题
首次联网使用应用的用户。如果一个用户首次打开某APP,那这个用户定义为新增用户;卸载再安装的设备,不会被算作一次新增。新增用户包括日新增用户、周新增用户、月新增用户。
7.1 DWS层(每日新增设备明细表)
1)建表语句
hive (gmall)>
drop table if exists dws_new_mid_day;
create external table dws_new_mid_day
(
mid_id
string COMMENT ‘设备唯一标识’,
user_id
string COMMENT ‘用户标识’,
version_code
string COMMENT ‘程序版本号’,
version_name
string COMMENT ‘程序版本名’,
lang
string COMMENT ‘系统语言’,
source
string COMMENT ‘渠道号’,
os
string COMMENT ‘安卓系统版本’,
area
string COMMENT ‘区域’,
model
string COMMENT ‘手机型号’,
brand
string COMMENT ‘手机品牌’,
sdk_version
string COMMENT ‘sdkVersion’,
gmail
string COMMENT ‘gmail’,
height_width
string COMMENT ‘屏幕宽高’,
app_time
string COMMENT ‘客户端日志产生时的时间’,
network
string COMMENT ‘网络模式’,
lng
string COMMENT ‘经度’,
lat
string COMMENT ‘纬度’,
create_date
string comment ‘创建时间’
) COMMENT ‘每日新增设备信息’
stored as parquet
location ‘/warehouse/gmall/dws/dws_new_mid_day/’;
2)导入数据
用每日活跃用户表Left Join每日新增设备表,关联的条件是mid_id相等。如果是每日新增的设备,则在每日新增设备表中为null。
hive (gmall)>
insert into table dws_new_mid_day
select
ud.mid_id,
ud.user_id ,
ud.version_code ,
ud.version_name ,
ud.lang ,
ud.source,
ud.os,
ud.area,
ud.model,
ud.brand,
ud.sdk_version,
ud.gmail,
ud.height_width,
ud.app_time,
ud.network,
ud.lng,
ud.lat,
‘2019-02-10’
from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id
where ud.dt=‘2019-02-10’ and nm.mid_id is null;
3)查询导入数据
hive (gmall)> select count(*) from dws_new_mid_day ;
7.2 ADS层(每日新增设备表)
1)建表语句
hive (gmall)>
drop table if exists ads_new_mid_count;
create external table ads_new_mid_count
(
create_date
string comment ‘创建时间’ ,
new_mid_count
BIGINT comment ‘新增设备数量’
) COMMENT ‘每日新增设备信息数量’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_new_mid_count/’;
2)导入数据
hive (gmall)>
insert into table ads_new_mid_count
select
create_date,
count(*)
from dws_new_mid_day
where create_date=‘2019-02-10’
group by create_date;
3)查询导入数据
hive (gmall)> select * from ads_new_mid_count;
第8章 需求三:用户留存主题
8.1 需求目标
8.1.1 用户留存概念
8.1.2 需求描述
8.2 DWS层
8.2.1 DWS层(每日留存用户明细表)
1)建表语句
hive (gmall)>
drop table if exists dws_user_retention_day;
create external table dws_user_retention_day
(
mid_id
string COMMENT ‘设备唯一标识’,
user_id
string COMMENT ‘用户标识’,
version_code
string COMMENT ‘程序版本号’,
version_name
string COMMENT ‘程序版本名’,
lang
string COMMENT ‘系统语言’,
source
string COMMENT ‘渠道号’,
os
string COMMENT ‘安卓系统版本’,
area
string COMMENT ‘区域’,
model
string COMMENT ‘手机型号’,
brand
string COMMENT ‘手机品牌’,
sdk_version
string COMMENT ‘sdkVersion’,
gmail
string COMMENT ‘gmail’,
height_width
string COMMENT ‘屏幕宽高’,
app_time
string COMMENT ‘客户端日志产生时的时间’,
network
string COMMENT ‘网络模式’,
lng
string COMMENT ‘经度’,
lat
string COMMENT ‘纬度’,
create_date
string comment ‘设备新增时间’,
retention_day
int comment ‘截止当前日期留存天数’
) COMMENT ‘每日用户留存情况’
PARTITIONED BY (dt
string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_user_retention_day/’
;
2)导入数据(每天计算前1天的新用户访问留存明细)
hive (gmall)>
insert overwrite table dws_user_retention_day
partition(dt=“2019-02-11”)
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-1);
3)查询导入数据(每天计算前1天的新用户访问留存明细)
hive (gmall)> select count(*) from dws_user_retention_day;
8.2.2 DWS层(1,2,3,n天留存用户明细表)
1)导入数据(每天计算前1,2,3,n天的新用户访问留存明细)
hive (gmall)>
insert overwrite table dws_user_retention_day
partition(dt=“2019-02-11”)
select
nm.mid_id,
nm.user_id,
nm.version_code,
nm.version_name,
nm.lang,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-1)
union all
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
2 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-2)
union all
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
3 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-3);
2)查询导入数据(每天计算前1,2,3天的新用户访问留存明细)
hive (gmall)> select retention_day , count(*) from dws_user_retention_day group by retention_day;
8.2.3 Union与Union all区别
1)准备两张表
tableA tableB
id name score id name score
1 a 80 1 d 48
2 b 79 2 e 23
3 c 68 3 c 86
2)采用union查询
select name from tableA
union
select name from tableB
查询结果
name
a
d
b
e
c
3)采用union all查询
select name from tableA
union all
select name from tableB
查询结果
name
a
b
c
d
e
c
4)总结
(1)union会将联合的结果集去重,效率较union all差
(2)union all不会对结果集去重,所以效率高
8.3 ADS层
8.3.1 留存用户数
1)建表语句
hive (gmall)>
drop table if exists ads_user_retention_day_count;
create external table ads_user_retention_day_count
(
create_date
string comment ‘设备新增日期’,
retention_day
int comment ‘截止当前日期留存天数’,
retention_count
bigint comment ‘留存数量’
) COMMENT ‘每日用户留存情况’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_user_retention_day_count/’;
2)导入数据
hive (gmall)>
insert into table ads_user_retention_day_count
select
create_date,
retention_day,
count(*) retention_count
from dws_user_retention_day
where dt=‘2019-02-11’
group by create_date,retention_day;
3)查询导入数据
hive (gmall)> select * from ads_user_retention_day_count;
8.3.2 留存用户比率
1)建表语句
hive (gmall)>
drop table if exists ads_user_retention_day_rate;
create external table ads_user_retention_day_rate
(
stat_date
string comment ‘统计日期’,
create_date
string comment ‘设备新增日期’,
retention_day
int comment ‘截止当前日期留存天数’,
retention_count
bigint comment ‘留存数量’,
new_mid_count
bigint comment ‘当日设备新增数量’,
retention_ratio
decimal(10,2) comment ‘留存率’
) COMMENT ‘每日用户留存情况’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_user_retention_day_rate/’;
2)导入数据
hive (gmall)>
insert into table ads_user_retention_day_rate
select
‘2019-02-11’,
ur.create_date,
ur.retention_day,
ur.retention_count,
nc.new_mid_count,
ur.retention_count/nc.new_mid_count100
from
(
select
create_date,
retention_day,
count() retention_count
from dws_user_retention_day
where dt=‘2019-02-11’
group by create_date,retention_day
) ur join ads_new_mid_count nc on nc.create_date=ur.create_date;
3)查询导入数据
hive (gmall)>select * from ads_user_retention_day_rate;
第9章 新数据准备
为了分析沉默用户、本周回流用户数、流失用户、最近连续3周活跃用户、最近七天内连续三天活跃用户数,需要准备2019-02-12、2019-02-20日的数据。
1)2019-02-12数据准备
(1)修改日志时间
[atguigu@hadoop102 ~]$ dt.sh 2019-02-12
(2)启动集群
[atguigu@hadoop102 ~]$ cluster.sh start
(3)生成日志数据
[atguigu@hadoop102 ~]$ lg.sh
(4)将HDFS数据导入到ODS层
[atguigu@hadoop102 ~]$ ods_log.sh 2019-02-12
(5)将ODS数据导入到DWD层
[atguigu@hadoop102 ~]$ dwd_start_log.sh 2019-02-12
[atguigu@hadoop102 ~]$ dwd_base_log.sh 2019-02-12
[atguigu@hadoop102 ~]$ dwd_event_log.sh 2019-02-12
(6)将DWD数据导入到DWS层
[atguigu@hadoop102 ~]$ dws_uv_log.sh 2019-02-12
(7)验证
hive (gmall)> select * from dws_uv_detail_day where dt=‘2019-02-12’ limit 2;
2)2019-02-20数据准备
(1)修改日志时间
[atguigu@hadoop102 ~]$ dt.sh 2019-02-20
(2)启动集群
[atguigu@hadoop102 ~]$ cluster.sh start
(3)生成日志数据
[atguigu@hadoop102 ~]$ lg.sh
(4)将HDFS数据导入到ODS层
[atguigu@hadoop102 ~]$ ods_log.sh 2019-02-20
(5)将ODS数据导入到DWD层
[atguigu@hadoop102 ~]$ dwd_start_log.sh 2019-02-20
[atguigu@hadoop102 ~]$ dwd_base_log.sh 2019-02-20
[atguigu@hadoop102 ~]$ dwd_event_log.sh 2019-02-20
(6)将DWD数据导入到DWS层
[atguigu@hadoop102 ~]$ dws_uv_log.sh 2019-02-20
(7)验证
hive (gmall)> select * from dws_uv_detail_day where dt=‘2019-02-20’ limit 2;
第10章 需求四:沉默用户数
沉默用户:指的是只在安装当天启动过,且启动时间是在一周前
10.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
10.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_slient_count;
create external table ads_slient_count(
dt
string COMMENT ‘统计日期’,
slient_count
bigint COMMENT ‘沉默设备数’
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_slient_count’;
2)导入2019-02-20数据
hive (gmall)>
insert into table ads_slient_count
select
‘2019-02-20’ dt,
count() slient_count
from
(
select mid_id
from dws_uv_detail_day
where dt<=‘2019-02-20’
group by mid_id
having count()=1 and min(dt)
3)查询导入数据
hive (gmall)> select * from ads_slient_count;
10.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_slient_log.sh
在脚本中编写如下内容
#!/bin/bash
hive=/opt/module/hive/bin/hive
APP=gmall
if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
echo “-----------导入日期$do_date-----------”
sql="
insert into table “ A P P " . a d s s l i e n t c o u n t s e l e c t ′ APP".ads_slient_count select ' APP".adsslientcountselect′do_date’ dt,
count() slient_count
from
(
select
mid_id
from " A P P " . d w s u v d e t a i l d a y w h e r e d t < = ′ APP".dws_uv_detail_day where dt<=' APP".dwsuvdetaildaywheredt<=′do_date’
group by mid_id
having count()=1 and min(dt)<=date_add(’$do_date’,-7)
)t1;”
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_slient_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_slient_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_slient_count;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第11章 需求五:本周回流用户数
本周回流=本周活跃-本周新增-上周活跃
11.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
11.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_back_count;
create external table ads_back_count(
dt
string COMMENT ‘统计日期’,
wk_dt
string COMMENT ‘统计日期所在周’,
wastage_count
bigint COMMENT ‘回流设备数’
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_back_count’;
2)导入数据:
hive (gmall)>
insert into table ads_back_count
select
‘2019-02-20’ dt,
concat(date_add(next_day(‘2019-02-20’,‘MO’),-7),’’,date_add(next_day(‘2019-02-20’,‘MO’),-1)) wk_dt,
count(*)
from
(
select t1.mid_id
from
(
select mid_id
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day(‘2019-02-20’,‘MO’),-7),’’,date_add(next_day(‘2019-02-20’,‘MO’),-1))
)t1
left join
(
select mid_id
from dws_new_mid_day
where create_date<=date_add(next_day(‘2019-02-20’,‘MO’),-1) and create_date>=date_add(next_day(‘2019-02-20’,‘MO’),-7)
)t2
on t1.mid_id=t2.mid_id
left join
(
select mid_id
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day(‘2019-02-20’,‘MO’),-7*2),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-7-1))
)t3
on t1.mid_id=t3.mid_id
where t2.mid_id is null and t3.mid_id is null
)t4;
3)查询结果
hive (gmall)> select * from ads_back_count;
11.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_back_log.sh
在脚本中编写如下内容
#!/bin/bash
if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo “-----------导入日期$do_date-----------”
sql="
insert into table " A P P " . a d s b a c k c o u n t s e l e c t ′ APP".ads_back_count select ' APP".adsbackcountselect′do_date’ dt,
concat(date_add(next_day(‘KaTeX parse error: Expected group after '_' at position 21: …te','MO'),-7),'_̲',date_add(next…do_date’,‘MO’),-1)) wk_dt,
count()
from
(
select t1.mid_id
from
(
select mid_id
from “ A P P " . d w s u v d e t a i l w k w h e r e w k d t = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt=concat(dateadd(nextday(′do_date’,‘MO’),-7),’_’,date_add(next_day(' d o d a t e ′ , ′ M O ′ ) , − 1 ) ) ) t 1 l e f t j o i n ( s e l e c t m i d i d f r o m " do_date','MO'),-1)) )t1 left join ( select mid_id from " dodate′,′MO′),−1)))t1leftjoin(selectmididfrom"APP”.dws_new_mid_day
where create_date<=date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) a n d c r e a t e d a t e > = d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-1) and create_date>=date_add(next_day(' dodate′,′MO′),−1)andcreatedate>=dateadd(nextday(′do_date’,‘MO’),-7)
)t2
on t1.mid_id=t2.mid_id
left join
(
select mid_id
from " A P P " . d w s u v d e t a i l w k w h e r e w k d t = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt=concat(dateadd(nextday(′do_date’,‘MO’),-72),’_’,date_add(next_day(’$do_date’,‘MO’),-7-1))
)t3
on t1.mid_id=t3.mid_id
where t2.mid_id is null and t3.mid_id is null
)t4;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_back_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_back_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_back_count;
5)脚本执行时间
企业开发中一般在每周一凌晨30分~1点
第12章 需求六:流失用户数
流失用户:最近7天未登录我们称之为流失用户
12.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
12.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_wastage_count;
create external table ads_wastage_count(
dt
string COMMENT ‘统计日期’,
wastage_count
bigint COMMENT ‘流失设备数’
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_wastage_count’;
2)导入2019-02-20数据
hive (gmall)>
insert into table ads_wastage_count
select
‘2019-02-20’,
count(*)
from
(
select mid_id
from dws_uv_detail_day
group by mid_id
having max(dt)<=date_add(‘2019-02-20’,-7)
)t1;
12.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_wastage_log.sh
在脚本中编写如下内容
#!/bin/bash
if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo “-----------导入日期$do_date-----------”
sql="
insert into table " A P P " . a d s w a s t a g e c o u n t s e l e c t ′ APP".ads_wastage_count select ' APP".adswastagecountselect′do_date’,
count(*)
from
(
select mid_id
from " A P P " . d w s u v d e t a i l d a y g r o u p b y m i d i d h a v i n g m a x ( d t ) < = d a t e a d d ( ′ APP".dws_uv_detail_day group by mid_id having max(dt)<=date_add(' APP".dwsuvdetaildaygroupbymididhavingmax(dt)<=dateadd(′do_date’,-7)
)t1;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_wastage_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_wastage_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_wastage_count;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第13章 需求七:最近连续3周活跃用户数
最近3周连续活跃的用户:通常是周一对前3周的数据做统计,该数据一周计算一次。
13.1 DWS层
使用周活明细表dws_uv_detail_wk作为DWS层数据
13.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_continuity_wk_count;
create external table ads_continuity_wk_count(
dt
string COMMENT ‘统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期’,
wk_dt
string COMMENT ‘持续时间’,
continuity_count
bigint
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_continuity_wk_count’;
2)导入2019-02-20所在周的数据
hive (gmall)>
insert into table ads_continuity_wk_count
select
‘2019-02-20’,
concat(date_add(next_day(‘2019-02-20’,‘MO’),-73),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-1)),
count()
from
(
select mid_id
from dws_uv_detail_wk
where wk_dt>=concat(date_add(next_day(‘2019-02-20’,‘MO’),-73),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-72-1))
and wk_dt<=concat(date_add(next_day(‘2019-02-20’,‘MO’),-7),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-1))
group by mid_id
having count(*)=3
)t1;
3)查询
hive (gmall)> select * from ads_continuity_wk_count;
13.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_continuity_wk_log.sh
在脚本中编写如下内容
#!/bin/bash
if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo “-----------导入日期$do_date-----------”
sql="
insert into table “ A P P " . a d s c o n t i n u i t y w k c o u n t s e l e c t ′ APP".ads_continuity_wk_count select ' APP".adscontinuitywkcountselect′do_date’,
concat(date_add(next_day(‘KaTeX parse error: Expected group after '_' at position 23: …','MO'),-7*3),'_̲',date_add(next…do_date’,‘MO’),-1)),
count()
from
(
select mid_id
from " A P P " . d w s u v d e t a i l w k w h e r e w k d t > = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt>=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt>=concat(dateadd(nextday(′do_date’,‘MO’),-73),’’,date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 7 ∗ 2 − 1 ) ) a n d w k d t < = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-7*2-1)) and wk_dt<=concat(date_add(next_day(' dodate′,′MO′),−7∗2−1))andwkdt<=concat(dateadd(nextday(′do_date’,‘MO’),-7),’’,date_add(next_day(’$do_date’,‘MO’),-1))
group by mid_id
having count(*)=3
)t1;”
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_continuity_wk_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_continuity_wk_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_continuity_wk_count;
5)脚本执行时间
企业开发中一般在每周一凌晨30分~1点
第14章 需求八:最近七天内连续三天活跃用户数
说明:最近7天内连续3天活跃用户数
14.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
14.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_continuity_uv_count;
create external table ads_continuity_uv_count(
dt
string COMMENT ‘统计日期’,
wk_dt
string COMMENT ‘最近7天日期’,
continuity_count
bigint
) COMMENT ‘连续活跃设备数’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_continuity_uv_count’;
2)写出导入数据的SQL语句
hive (gmall)>
insert into table ads_continuity_uv_count
select
‘2019-02-12’,
concat(date_add(‘2019-02-12’,-6),’_’,‘2019-02-12’),
count()
from
(
select mid_id
from
(
select mid_id
from
(
select
mid_id,
date_sub(dt,rank) date_dif
from
(
select
mid_id,
dt,
rank() over(partition by mid_id order by dt) rank
from dws_uv_detail_day
where dt>=date_add(‘2019-02-12’,-6) and dt<=‘2019-02-12’
)t1
)t2
group by mid_id,date_dif
having count()>=3
)t3
group by mid_id
)t4;
(5)查询
hive (gmall)> select * from ads_continuity_uv_count;
14.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_continuity_log.sh
在脚本中编写如下内容
#!/bin/bash
if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo “-----------导入日期$do_date-----------”
sql="
insert into table " A P P " . a d s c o n t i n u i t y u v c o u n t s e l e c t ′ APP".ads_continuity_uv_count select ' APP".adscontinuityuvcountselect′do_date’,
concat(date_add(‘KaTeX parse error: Expected group after '_' at position 15: do_date',-6),'_̲','do_date’) dt,
count()
from
(
select mid_id
from
(
select mid_id
from
(
select
mid_id,
date_sub(dt,rank) date_diff
from
(
select
mid_id,
dt,
rank() over(partition by mid_id order by dt) rank
from " A P P " . d w s u v d e t a i l d a y w h e r e d t > = d a t e a d d ( ′ APP".dws_uv_detail_day where dt>=date_add(' APP".dwsuvdetaildaywheredt>=dateadd(′do_date’,-6) and dt<=’$do_date’
)t1
)t2
group by mid_id,date_diff
having count()>=3
)t3
group by mid_id
)t4;
"
h i v e − e " hive -e " hive−e"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_continuity_log.sh
3)脚本使用
[atguigu@hadoop102 module]$