2020-03-23

商数仓

第1章 数仓分层概念
1.1 为什么要分层
1.2 数仓分层
1.3 数据集市与数据仓库概念
1.4 数仓命名规范
Ø ODS层命名为ods
Ø DWD层命名为dwd
Ø DWS层命名为dws
Ø ADS层命名为ads
Ø 临时表数据库命名为xxx_tmp
Ø 备份数据数据库命名为xxx_bak
第2章 数仓搭建环境准备
集群规划

服务器hadoop102
服务器hadoop103
服务器hadoop104
Hive
Hive

MySQL
MySQL

2.1 Hive&MySQL安装
2.1.1 Hive&MySQL安装
详见:尚硅谷大数据技术之Hive
2.1.2 修改hive-site.xml
1)关闭元数据检查
[atguigu@hadoop102 conf]$ pwd
/opt/module/hive/conf
[atguigu@hadoop102 conf]$ vim hive-site.xml
增加如下配置:

hive.metastore.schema.verification
false

2.2 Hive运行引擎Tez
Tez是一个Hive的运行引擎,性能优于MR。为什么优于MR呢?看下图。

用Hive直接编写MR程序,假设有四个有依赖关系的MR作业,上图中,绿色是Reduce Task,云状表示写屏蔽,需要将中间结果持久化写到HDFS。
Tez可以将多个有依赖的作业转换为一个作业,这样只需写一次HDFS,且中间节点较少,从而大大提升作业的计算性能。
2.2.1 安装包准备
1)下载tez的依赖包:http://tez.apache.org
2)拷贝apache-tez-0.9.1-bin.tar.gz到hadoop102的/opt/module目录
[atguigu@hadoop102 module]$ ls
apache-tez-0.9.1-bin.tar.gz
3)解压缩apache-tez-0.9.1-bin.tar.gz
[atguigu@hadoop102 module]$ tar -zxvf apache-tez-0.9.1-bin.tar.gz
4)修改名称
[atguigu@hadoop102 module]$ mv apache-tez-0.9.1-bin/ tez-0.9.1
2.2.2 在Hive中配置Tez
1)进入到Hive的配置目录:/opt/module/hive/conf
[atguigu@hadoop102 conf]$ pwd
/opt/module/hive/conf
2)在hive-env.sh文件中添加tez环境变量配置和依赖包环境变量配置
[atguigu@hadoop102 conf]$ vim hive-env.sh
添加如下配置

Set HADOOP_HOME to point to a specific hadoop install directory

export HADOOP_HOME=/opt/module/hadoop-2.7.2

Hive Configuration Directory can be controlled by:

export HIVE_CONF_DIR=/opt/module/hive/conf

Folder containing extra libraries required for hive compilation/execution can be controlled by:

export TEZ_HOME=/opt/module/tez-0.9.1 #是你的tez的解压目录
export TEZ_JARS=""
for jar in ls $TEZ_HOME |grep jar; do
export TEZ_JARS= T E Z J A R S : TEZ_JARS: TEZJARS:TEZ_HOME/$jar
done
for jar in ls $TEZ_HOME/lib; do
export TEZ_JARS= T E Z J A R S : TEZ_JARS: TEZJARS:TEZ_HOME/lib/$jar
done

export HIVE_AUX_JARS_PATH=/opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar T E Z J A R S 3 ) 在 h i v e − s i t e . x m l 文 件 中 添 加 如 下 配 置 , 更 改 h i v e 计 算 引 擎 < p r o p e r t y > < n a m e > h i v e . e x e c u t i o n . e n g i n e < / n a m e > < v a l u e > t e z < / v a l u e > < / p r o p e r t y > 2.2.3 配 置 T e z 1 ) 在 H i v e 的 / o p t / m o d u l e / h i v e / c o n f 下 面 创 建 一 个 t e z − s i t e . x m l 文 件 [ a t g u i g u @ h a d o o p 102 c o n f ] TEZ_JARS 3)在hive-site.xml文件中添加如下配置,更改hive计算引擎 hive.execution.engine tez 2.2.3 配置Tez 1)在Hive的/opt/module/hive/conf下面创建一个tez-site.xml文件 [atguigu@hadoop102 conf] TEZJARS3hivesite.xmlhive<property><name>hive.execution.engine</name><value>tez</value></property>2.2.3Tez1Hive/opt/module/hive/conftezsite.xml[atguigu@hadoop102conf] pwd
/opt/module/hive/conf
[atguigu@hadoop102 conf]$ vim tez-site.xml
添加如下内容

tez.lib.uris ${fs.defaultFS}/tez/tez-0.9.1,${fs.defaultFS}/tez/tez-0.9.1/lib tez.lib.uris.classpath ${fs.defaultFS}/tez/tez-0.9.1,${fs.defaultFS}/tez/tez-0.9.1/lib tez.use.cluster.hadoop-libs true tez.history.logging.service.class org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService 2.2.4 上传Tez到集群 1)将/opt/module/tez-0.9.1上传到HDFS的/tez路径 [atguigu@hadoop102 conf]$ hadoop fs -mkdir /tez [atguigu@hadoop102 conf]$ hadoop fs -put /opt/module/tez-0.9.1/ /tez [atguigu@hadoop102 conf]$ hadoop fs -ls /tez /tez/tez-0.9.1 2.2.5 测试 1)启动Hive [atguigu@hadoop102 hive]$ bin/hive 2)创建LZO表 hive (default)> create table student( id int, name string); 3)向表中插入数据 hive (default)> insert into student values(1,"zhangsan"); 4)如果没有报错就表示成功了 hive (default)> select * from student; 1 zhangsan 2.2.6 小结 1)运行Tez时检查到用过多内存而被NodeManager杀死进程问题: Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1546781144082_0005 failed 2 times due to AM Container for appattempt_1546781144082_0005_000002 exited with exitCode: -103 For more detailed output, check application tracking page:http://hadoop103:8088/cluster/app/application_1546781144082_0005Then, click on links to logs of each attempt. Diagnostics: Container [pid=11116,containerID=container_1546781144082_0005_02_000001] is running beyond virtual memory limits. Current usage: 216.3 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container. 这种问题是从机上运行的Container试图使用过多的内存,而被NodeManager kill掉了。 [摘录] The NodeManager is killing your container. It sounds like you are trying to use hadoop streaming which is running as a child process of the map-reduce task. The NodeManager monitors the entire process tree of the task and if it eats up more memory than the maximum set in mapreduce.map.memory.mb or mapreduce.reduce.memory.mb respectively, we would expect the Nodemanager to kill the task, otherwise your task is stealing memory belonging to other containers, which you don't want. 解决方法: 方案一:或者是关掉虚拟内存检查。我们选这个,修改yarn-site.xml yarn.nodemanager.vmem-check-enabled false 方案二:mapred-site.xml中设置Map和Reduce任务的内存配置如下:(value中实际配置的内存需要根据自己机器内存大小及应用情况进行修改)   mapreduce.map.memory.mb   1536   mapreduce.map.java.opts   -Xmx1024M   mapreduce.reduce.memory.mb   3072   mapreduce.reduce.java.opts   -Xmx2560M 2.3 项目经验之元数据备份 元数据备份(重点,如数据损坏,可能整个集群无法运行,至少要保证每日零点之后备份到其它服务器两个复本)

第3章 数仓搭建之ODS层

3.1 创建数据库
1)创建gmall数据库
hive (default)> create database gmall;
说明:如果数据库存在且有数据,需要强制删除时执行:drop database gmall cascade;
2)使用gmall数据库
hive (default)> use gmall;
3.2 ODS层
原始数据层,存放原始数据,直接加载原始日志、数据,数据保持原貌不做处理。
3.2.1 创建启动日志表ods_start_log

1)创建输入数据是lzo输出是text,支持json解析的分区表
hive (gmall)>
drop table if exists ods_start_log;
CREATE EXTERNAL TABLE ods_start_log (line string)
PARTITIONED BY (dt string)
STORED AS
INPUTFORMAT ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘/warehouse/gmall/ods/ods_start_log’;
说明Hive的LZO压缩:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
2)加载数据
hive (gmall)>
load data inpath ‘/origin_data/gmall/log/topic_start/2019-02-10’ into table gmall.ods_start_log partition(dt=‘2019-02-10’);
注意:时间格式都配置成YYYY-MM-DD格式,这是Hive默认支持的时间格式
3)查看是否加载成功
hive (gmall)> select * from ods_start_log limit 2;
3.2.2 创建事件日志表ods_event_log

1)创建输入数据是lzo输出是text,支持json解析的分区表
hive (gmall)>
drop table if exists ods_event_log;
CREATE EXTERNAL TABLE ods_event_log(line string)
PARTITIONED BY (dt string)
STORED AS
INPUTFORMAT ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘/warehouse/gmall/ods/ods_event_log’;
2)加载数据
hive (gmall)>
load data inpath ‘/origin_data/gmall/log/topic_event/2019-02-10’ into table gmall.ods_event_log partition(dt=‘2019-02-10’);
注意:时间格式都配置成YYYY-MM-DD格式,这是Hive默认支持的时间格式
3)查看是否加载成功
hive (gmall)> select * from ods_event_log limit 2;
3.2.3 Shell中单引号和双引号区别
1)在/home/atguigu/bin创建一个test.sh文件
[atguigu@hadoop102 bin]$ vim test.sh
在文件中添加如下内容
#!/bin/bash
do_date=$1

echo ‘ d o d a t e ′ e c h o " do_date' echo " dodateecho"do_date"
echo “' d o d a t e ′ " e c h o ′ " do_date'" echo '" dodate"echo"do_date”’
echo date
2)查看执行结果
[atguigu@hadoop102 bin]$ test.sh 2019-02-10
d o d a t e 2019 − 02 − 1 0 ′ 2019 − 02 − 1 0 ′ " do_date 2019-02-10 '2019-02-10' " dodate2019021020190210"do_date"
2019年 05月 02日 星期四 21:02:08 CST
3)总结:
(1)单引号不取变量值
(2)双引号取变量值
(3)反引号`,执行引号中命令
(4)双引号内部嵌套单引号,取出变量值
(5)单引号内部嵌套双引号,不取出变量值
3.2.4 ODS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ods_log.sh
在脚本中编写如下内容
#!/bin/bash

定义变量方便修改

APP=gmall
hive=/opt/module/hive/bin/hive

如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天

if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

echo "===日志日期为 d o d a t e = = = " s q l = " l o a d d a t a i n p a t h ′ / o r i g i n d a t a / g m a l l / l o g / t o p i c s t a r t / do_date===" sql=" load data inpath '/origin_data/gmall/log/topic_start/ dodate==="sql="loaddatainpath/origindata/gmall/log/topicstart/do_date’ into table " A P P " . o d s s t a r t l o g p a r t i t i o n ( d t = ′ APP".ods_start_log partition(dt=' APP".odsstartlogpartition(dt=do_date’);

load data inpath ‘/origin_data/gmall/log/topic_event/ d o d a t e ′ i n t o t a b l e " do_date' into table " dodateintotable"APP".ods_event_log partition(dt=’$do_date’);
"

h i v e − e " hive -e " hivee"sql"
说明1:
[ -n 变量值 ] 判断变量的值,是否为空
– 变量的值,非空,返回true
– 变量的值,为空,返回false
说明2:
查看date命令的使用,[atguigu@hadoop102 ~]$ date --help
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ods_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ods_log.sh 2019-02-11
4)查看导入数据
hive (gmall)>
select * from ods_start_log where dt=‘2019-02-11’ limit 2;
select * from ods_event_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第4章 数仓搭建之DWD层
对ODS层数据进行清洗(去除空值,脏数据,超过极限范围的数据,行式存储改为列存储,改压缩格式)。
4.1 DWD层启动表数据解析
4.1.1 创建启动表
1)建表语句
hive (gmall)>
drop table if exists dwd_start_log;
CREATE EXTERNAL TABLE dwd_start_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
entry string,
open_ad_type string,
action string,
loading_time string,
detail string,
extend1 string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_start_log/’;
4.1.2 向启动表导入数据
hive (gmall)>
insert overwrite table dwd_start_log
PARTITION (dt=‘2019-02-10’)
select
get_json_object(line,’ . m i d ′ ) m i d i d , g e t j s o n o b j e c t ( l i n e , ′ .mid') mid_id, get_json_object(line,' .mid)midid,getjsonobject(line,.uid’) user_id,
get_json_object(line,’ . v c ′ ) v e r s i o n c o d e , g e t j s o n o b j e c t ( l i n e , ′ .vc') version_code, get_json_object(line,' .vc)versioncode,getjsonobject(line,.vn’) version_name,
get_json_object(line,’ . l ′ ) l a n g , g e t j s o n o b j e c t ( l i n e , ′ .l') lang, get_json_object(line,' .l)lang,getjsonobject(line,.sr’) source,
get_json_object(line,’ . o s ′ ) o s , g e t j s o n o b j e c t ( l i n e , ′ .os') os, get_json_object(line,' .os)os,getjsonobject(line,.ar’) area,
get_json_object(line,’ . m d ′ ) m o d e l , g e t j s o n o b j e c t ( l i n e , ′ .md') model, get_json_object(line,' .md)model,getjsonobject(line,.ba’) brand,
get_json_object(line,’ . s v ′ ) s d k v e r s i o n , g e t j s o n o b j e c t ( l i n e , ′ .sv') sdk_version, get_json_object(line,' .sv)sdkversion,getjsonobject(line,.g’) gmail,
get_json_object(line,’ . h w ′ ) h e i g h t w i d t h , g e t j s o n o b j e c t ( l i n e , ′ .hw') height_width, get_json_object(line,' .hw)heightwidth,getjsonobject(line,.t’) app_time,
get_json_object(line,’ . n w ′ ) n e t w o r k , g e t j s o n o b j e c t ( l i n e , ′ .nw') network, get_json_object(line,' .nw)network,getjsonobject(line,.ln’) lng,
get_json_object(line,’ . l a ′ ) l a t , g e t j s o n o b j e c t ( l i n e , ′ .la') lat, get_json_object(line,' .la)lat,getjsonobject(line,.entry’) entry,
get_json_object(line,’ . o p e n a d t y p e ′ ) o p e n a d t y p e , g e t j s o n o b j e c t ( l i n e , ′ .open_ad_type') open_ad_type, get_json_object(line,' .openadtype)openadtype,getjsonobject(line,.action’) action,
get_json_object(line,’ . l o a d i n g t i m e ′ ) l o a d i n g t i m e , g e t j s o n o b j e c t ( l i n e , ′ .loading_time') loading_time, get_json_object(line,' .loadingtime)loadingtime,getjsonobject(line,.detail’) detail,
get_json_object(line,’ . e x t e n d 1 ′ ) e x t e n d 1 f r o m o d s s t a r t l o g w h e r e d t = ′ 2019 − 02 − 1 0 ′ ; 3 ) 测 试 h i v e ( g m a l l ) > s e l e c t ∗ f r o m d w d s t a r t l o g l i m i t 2 ; 4.1.3 D W D 层 启 动 表 加 载 数 据 脚 本 1 ) 在 h a d o o p 102 的 / h o m e / a t g u i g u / b i n 目 录 下 创 建 脚 本 [ a t g u i g u @ h a d o o p 102 b i n ] .extend1') extend1 from ods_start_log where dt='2019-02-10'; 3)测试 hive (gmall)> select * from dwd_start_log limit 2; 4.1.3 DWD层启动表加载数据脚本 1)在hadoop102的/home/atguigu/bin目录下创建脚本 [atguigu@hadoop102 bin] .extend1)extend1fromodsstartlogwheredt=20190210;3hive(gmall)>selectfromdwdstartloglimit2;4.1.3DWD1hadoop102/home/atguigu/bin[atguigu@hadoop102bin] vim dwd_start_log.sh
在脚本中编写如下内容
#!/bin/bash

定义变量方便修改

APP=gmall
hive=/opt/module/hive/bin/hive

如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天

if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

sql="
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table “ A P P " . d w d s t a r t l o g P A R T I T I O N ( d t = ′ APP".dwd_start_log PARTITION (dt=' APP".dwdstartlogPARTITION(dt=do_date’)
select
get_json_object(line,’ . m i d ′ ) m i d i d , g e t j s o n o b j e c t ( l i n e , ′ .mid') mid_id, get_json_object(line,' .mid)midid,getjsonobject(line,.uid’) user_id,
get_json_object(line,’ . v c ′ ) v e r s i o n c o d e , g e t j s o n o b j e c t ( l i n e , ′ .vc') version_code, get_json_object(line,' .vc)versioncode,getjsonobject(line,.vn’) version_name,
get_json_object(line,’ . l ′ ) l a n g , g e t j s o n o b j e c t ( l i n e , ′ .l') lang, get_json_object(line,' .l)lang,getjsonobject(line,.sr’) source,
get_json_object(line,’ . o s ′ ) o s , g e t j s o n o b j e c t ( l i n e , ′ .os') os, get_json_object(line,' .os)os,getjsonobject(line,.ar’) area,
get_json_object(line,’ . m d ′ ) m o d e l , g e t j s o n o b j e c t ( l i n e , ′ .md') model, get_json_object(line,' .md)model,getjsonobject(line,.ba’) brand,
get_json_object(line,’ . s v ′ ) s d k v e r s i o n , g e t j s o n o b j e c t ( l i n e , ′ .sv') sdk_version, get_json_object(line,' .sv)sdkversion,getjsonobject(line,.g’) gmail,
get_json_object(line,’ . h w ′ ) h e i g h t w i d t h , g e t j s o n o b j e c t ( l i n e , ′ .hw') height_width, get_json_object(line,' .hw)heightwidth,getjsonobject(line,.t’) app_time,
get_json_object(line,’ . n w ′ ) n e t w o r k , g e t j s o n o b j e c t ( l i n e , ′ .nw') network, get_json_object(line,' .nw)network,getjsonobject(line,.ln’) lng,
get_json_object(line,’ . l a ′ ) l a t , g e t j s o n o b j e c t ( l i n e , ′ .la') lat, get_json_object(line,' .la)lat,getjsonobject(line,.entry’) entry,
get_json_object(line,’ . o p e n a d t y p e ′ ) o p e n a d t y p e , g e t j s o n o b j e c t ( l i n e , ′ .open_ad_type') open_ad_type, get_json_object(line,' .openadtype)openadtype,getjsonobject(line,.action’) action,
get_json_object(line,’ . l o a d i n g t i m e ′ ) l o a d i n g t i m e , g e t j s o n o b j e c t ( l i n e , ′ .loading_time') loading_time, get_json_object(line,' .loadingtime)loadingtime,getjsonobject(line,.detail’) detail,
get_json_object(line,' . e x t e n d 1 ′ ) e x t e n d 1 f r o m " .extend1') extend1 from " .extend1)extend1from"APP”.ods_start_log
where dt=’$do_date’;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dwd_start_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dwd_start_log.sh 2019-02-11
4)查询导入结果
hive (gmall)>
select * from dwd_start_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
4.2 DWD层事件表数据解析
4.2.1 创建基础明细表
明细表用于存储ODS层原始表转换过来的明细数据。

1)创建事件日志基础明细表
hive (gmall)>
drop table if exists dwd_base_event_log;
CREATE EXTERNAL TABLE dwd_base_event_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
event_name string,
event_json string,
server_time string)
PARTITIONED BY (dt string)
stored as parquet
location ‘/warehouse/gmall/dwd/dwd_base_event_log/’;
2)说明:其中event_name和event_json用来对应事件名和整个事件。这个地方将原始日志1对多的形式拆分出来了。操作的时候我们需要将原始日志展平,需要用到UDF和UDTF。
4.2.2 自定义UDF函数(解析公共字段)

1)创建一个maven工程:hivefunction
2)创建包名:com.atguigu.udf
3)在pom.xml文件中添加如下内容

UTF8
1.2.1

org.apache.hive hive-exec ${hive.version} maven-compiler-plugin 2.3.2 1.8 1.8 maven-assembly-plugin jar-with-dependencies make-assembly package single 4)UDF用于解析公共字段 package com.atguigu.udf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;

public class BaseFieldUDF extends UDF {

public String evaluate(String line, String jsonkeysString) {
    
    // 0 准备一个sb
    StringBuilder sb = new StringBuilder();

    // 1 切割jsonkeys  mid uid vc vn l sr os ar md
    String[] jsonkeys = jsonkeysString.split(",");

    // 2 处理line   服务器时间 | json
    String[] logContents = line.split("\\|");

    // 3 合法性校验
    if (logContents.length != 2 || StringUtils.isBlank(logContents[1])) {
        return "";
    }

    // 4 开始处理json
    try {
        JSONObject jsonObject = new JSONObject(logContents[1]);

        // 获取cm里面的对象
        JSONObject base = jsonObject.getJSONObject("cm");

        // 循环遍历取值
        for (int i = 0; i < jsonkeys.length; i++) {
            String filedName = jsonkeys[i].trim();

            if (base.has(filedName)) {
                sb.append(base.getString(filedName)).append("\t");
            } else {
                sb.append("\t");
            }
        }

        sb.append(jsonObject.getString("et")).append("\t");
        sb.append(logContents[0]).append("\t");
    } catch (JSONException e) {
        e.printStackTrace();
    }

    return sb.toString();
}

public static void main(String[] args) {

    String line = "1541217850324|{\"cm\":{\"mid\":\"m7856\",\"uid\":\"u8739\",\"ln\":\"-74.8\",\"sv\":\"V2.2.2\",\"os\":\"8.1.3\",\"g\":\"[email protected]\",\"nw\":\"3G\",\"l\":\"es\",\"vc\":\"6\",\"hw\":\"640*960\",\"ar\":\"MX\",\"t\":\"1541204134250\",\"la\":\"-31.7\",\"md\":\"huawei-17\",\"vn\":\"1.1.2\",\"sr\":\"O\",\"ba\":\"Huawei\"},\"ap\":\"weather\",\"et\":[{\"ett\":\"1541146624055\",\"en\":\"display\",\"kv\":{\"goodsid\":\"n4195\",\"copyright\":\"ESPN\",\"content_provider\":\"CNN\",\"extend2\":\"5\",\"action\":\"2\",\"extend1\":\"2\",\"place\":\"3\",\"showtype\":\"2\",\"category\":\"72\",\"newstype\":\"5\"}},{\"ett\":\"1541213331817\",\"en\":\"loading\",\"kv\":{\"extend2\":\"\",\"loading_time\":\"15\",\"action\":\"3\",\"extend1\":\"\",\"type1\":\"\",\"type\":\"3\",\"loading_way\":\"1\"}},{\"ett\":\"1541126195645\",\"en\":\"ad\",\"kv\":{\"entry\":\"3\",\"show_style\":\"0\",\"action\":\"2\",\"detail\":\"325\",\"source\":\"4\",\"behavior\":\"2\",\"content\":\"1\",\"newstype\":\"5\"}},{\"ett\":\"1541202678812\",\"en\":\"notification\",\"kv\":{\"ap_time\":\"1541184614380\",\"action\":\"3\",\"type\":\"4\",\"content\":\"\"}},{\"ett\":\"1541194686688\",\"en\":\"active_background\",\"kv\":{\"active_source\":\"3\"}}]}";
    String x = new BaseFieldUDF().evaluate(line, "mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,nw,ln,la,t");
    System.out.println(x);
}

}
注意:使用main函数主要用于模拟数据测试。
4.2.3 自定义UDTF函数(解析具体事件字段)

1)创建包名:com.atguigu.udtf
2)在com.atguigu.udtf包下创建类名:EventJsonUDTF
3)用于展开业务字段
package com.atguigu.udtf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.json.JSONArray;
import org.json.JSONException;

import java.util.ArrayList;

public class EventJsonUDTF extends GenericUDTF {

//该方法中,我们将指定输出参数的名称和参数类型:
@Override
public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {

    ArrayList fieldNames = new ArrayList();
    ArrayList fieldOIs = new ArrayList();

    fieldNames.add("event_name");
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    fieldNames.add("event_json");
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

    return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}

//输入1条记录,输出若干条结果
@Override
public void process(Object[] objects) throws HiveException {

    // 获取传入的et
    String input = objects[0].toString();

    // 如果传进来的数据为空,直接返回过滤掉该数据
    if (StringUtils.isBlank(input)) {
        return;
    } else {

        try {
            // 获取一共有几个事件(ad/facoriters)
            JSONArray ja = new JSONArray(input);

            if (ja == null)
                return;

            // 循环遍历每一个事件
            for (int i = 0; i < ja.length(); i++) {
                String[] result = new String[2];

                try {
                    // 取出每个的事件名称(ad/facoriters)
                    result[0] = ja.getJSONObject(i).getString("en");

                    // 取出每一个事件整体
                    result[1] = ja.getString(i);
                } catch (JSONException e) {
                    continue;
                }

                // 将结果返回
                forward(result);
            }
        } catch (JSONException e) {
            e.printStackTrace();
        }
    }
}

//当没有记录处理的时候该方法会被调用,用来清理代码或者产生额外的输出
@Override
public void close() throws HiveException {

}

}
2)打包

3)将hivefunction-1.0-SNAPSHOT上传到hadoop102的/opt/module/hive/
4)将jar包添加到Hive的classpath
hive (gmall)> add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar;
5)创建临时函数与开发好的java class关联
hive (gmall)>
create temporary function base_analizer as ‘com.atguigu.udf.BaseFieldUDF’;

create temporary function flat_analizer as ‘com.atguigu.udtf.EventJsonUDTF’;
4.2.4 解析事件日志基础明细表
1)解析事件日志基础明细表
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_base_event_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
event_name,
event_json,
server_time
from
(
select
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[0] as mid_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[1] as user_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[2] as version_code,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[3] as version_name,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[4] as lang,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[5] as source,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[6] as os,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[7] as area,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[8] as model,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[9] as brand,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[10] as sdk_version,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[11] as gmail,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[12] as height_width,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[13] as app_time,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[14] as network,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[15] as lng,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[16] as lat,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[17] as ops,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[18] as server_time
from ods_event_log where dt=‘2019-02-10’ and base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’)<>’’
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
2)测试
hive (gmall)> select * from dwd_base_event_log limit 2;
4.2.5 DWD层数据解析脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim dwd_base_log.sh
在脚本中编写如下内容
#!/bin/bash

定义变量方便修改

APP=gmall
hive=/opt/module/hive/bin/hive

如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天

if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

sql="
add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar;

create temporary function base_analizer as ‘com.atguigu.udf.BaseFieldUDF’;
create temporary function flat_analizer as ‘com.atguigu.udtf.EventJsonUDTF’;

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table " A P P " . d w d b a s e e v e n t l o g P A R T I T I O N ( d t = ′ APP".dwd_base_event_log PARTITION (dt=' APP".dwdbaseeventlogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source ,
os ,
area ,
model ,
brand ,
sdk_version ,
gmail ,
height_width ,
network ,
lng ,
lat ,
app_time ,
event_name ,
event_json ,
server_time
from
(
select
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[0] as mid_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[1] as user_id,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[2] as version_code,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[3] as version_name,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[4] as lang,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[5] as source,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[6] as os,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[7] as area,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[8] as model,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[9] as brand,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[10] as sdk_version,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[11] as gmail,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[12] as height_width,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[13] as app_time,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[14] as network,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[15] as lng,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[16] as lat,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[17] as ops,
split(base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’),’\t’)[18] as server_time
from " A P P " . o d s e v e n t l o g w h e r e d t = ′ APP".ods_event_log where dt=' APP".odseventlogwheredt=do_date’ and base_analizer(line,‘mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la’)<>’’
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dwd_base_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dwd_base_log.sh 2019-02-11
4)查询导入结果
hive (gmall)>
select * from dwd_base_event_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
4.3 DWD层事件表获取

4.3.1 商品点击表

1)建表语句
hive (gmall)>
drop table if exists dwd_display_log;
CREATE EXTERNAL TABLE dwd_display_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
action string,
goodsid string,
place string,
extend1 string,
category string,
server_time string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_display_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_display_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action)action,getjsonobject(eventjson,.kv.goodsid’) goodsid,
get_json_object(event_json,’ . k v . p l a c e ′ ) p l a c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.place') place, get_json_object(event_json,' .kv.place)place,getjsonobject(eventjson,.kv.extend1’) extend1,
get_json_object(event_json,’$.kv.category’) category,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘display’;
3)测试
hive (gmall)> select * from dwd_display_log limit 2;
4.3.2 商品详情页表
1)建表语句
hive (gmall)>
drop table if exists dwd_newsdetail_log;
CREATE EXTERNAL TABLE dwd_newsdetail_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
entry string,
action string,
goodsid string,
showtype string,
news_staytime string,
loading_time string,
type1 string,
category string,
server_time string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_newsdetail_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_newsdetail_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry)entry,getjsonobject(eventjson,.kv.action’) action,
get_json_object(event_json,’ . k v . g o o d s i d ′ ) g o o d s i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.goodsid') goodsid, get_json_object(event_json,' .kv.goodsid)goodsid,getjsonobject(eventjson,.kv.showtype’) showtype,
get_json_object(event_json,’ . k v . n e w s s t a y t i m e ′ ) n e w s s t a y t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.news_staytime') news_staytime, get_json_object(event_json,' .kv.newsstaytime)newsstaytime,getjsonobject(eventjson,.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . t y p e 1 ′ ) t y p e 1 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.type1') type1, get_json_object(event_json,' .kv.type1)type1,getjsonobject(eventjson,.kv.category’) category,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘newsdetail’;
3)测试
hive (gmall)> select * from dwd_newsdetail_log limit 2;
4.3.3 商品列表页表
1)建表语句
hive (gmall)>
drop table if exists dwd_loading_log;
CREATE EXTERNAL TABLE dwd_loading_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
action string,
loading_time string,
loading_way string,
extend1 string,
extend2 string,
type string,
type1 string,
server_time string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_loading_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_loading_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action)action,getjsonobject(eventjson,.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . l o a d i n g w a y ′ ) l o a d i n g w a y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.loading_way') loading_way, get_json_object(event_json,' .kv.loadingway)loadingway,getjsonobject(eventjson,.kv.extend1’) extend1,
get_json_object(event_json,’ . k v . e x t e n d 2 ′ ) e x t e n d 2 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.extend2') extend2, get_json_object(event_json,' .kv.extend2)extend2,getjsonobject(eventjson,.kv.type’) type,
get_json_object(event_json,’$.kv.type1’) type1,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘loading’;
3)测试
hive (gmall)> select * from dwd_loading_log limit 2;
4.3.4 广告表
1)建表语句
hive (gmall)>
drop table if exists dwd_ad_log;
CREATE EXTERNAL TABLE dwd_ad_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
entry string,
action string,
content string,
detail string,
ad_source string,
behavior string,
newstype string,
show_style string,
server_time string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_ad_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_ad_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry)entry,getjsonobject(eventjson,.kv.action’) action,
get_json_object(event_json,’ . k v . c o n t e n t ′ ) c o n t e n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.content') content, get_json_object(event_json,' .kv.content)content,getjsonobject(eventjson,.kv.detail’) detail,
get_json_object(event_json,’ . k v . s o u r c e ′ ) a d s o u r c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.source') ad_source, get_json_object(event_json,' .kv.source)adsource,getjsonobject(eventjson,.kv.behavior’) behavior,
get_json_object(event_json,’ . k v . n e w s t y p e ′ ) n e w s t y p e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.newstype') newstype, get_json_object(event_json,' .kv.newstype)newstype,getjsonobject(eventjson,.kv.show_style’) show_style,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘ad’;
3)测试
hive (gmall)> select * from dwd_ad_log limit 2;
4.3.5 消息通知表
1)建表语句
hive (gmall)>
drop table if exists dwd_notification_log;
CREATE EXTERNAL TABLE dwd_notification_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
action string,
noti_type string,
ap_time string,
content string,
server_time string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_notification_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_notification_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action)action,getjsonobject(eventjson,.kv.noti_type’) noti_type,
get_json_object(event_json,’ . k v . a p t i m e ′ ) a p t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.ap_time') ap_time, get_json_object(event_json,' .kv.aptime)aptime,getjsonobject(eventjson,.kv.content’) content,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘notification’;
3)测试
hive (gmall)> select * from dwd_notification_log limit 2;
4.3.6 用户前台活跃表
1)建表语句
hive (gmall)>
drop table if exists dwd_active_foreground_log;
CREATE EXTERNAL TABLE dwd_active_foreground_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
push_id string,
access string,
server_time string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_foreground_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_active_foreground_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . p u s h i d ′ ) p u s h i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.push_id') push_id, get_json_object(event_json,' .kv.pushid)pushid,getjsonobject(eventjson,.kv.access’) access,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘active_foreground’;
3)测试
hive (gmall)> select * from dwd_active_foreground_log limit 2;
4.3.7 用户后台活跃表
1)建表语句
hive (gmall)>
drop table if exists dwd_active_background_log;
CREATE EXTERNAL TABLE dwd_active_background_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
active_source string,
server_time string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_background_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_active_background_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’$.kv.active_source’) active_source,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘active_background’;
3)测试
hive (gmall)> select * from dwd_active_background_log limit 2;
4.3.8 评论表
1)建表语句
hive (gmall)>
drop table if exists dwd_comment_log;
CREATE EXTERNAL TABLE dwd_comment_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
comment_id int,
userid int,
p_comment_id int,
content string,
addtime string,
other_id int,
praise_count int,
reply_count int,
server_time string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_comment_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_comment_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . c o m m e n t i d ′ ) c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.comment_id') comment_id, get_json_object(event_json,' .kv.commentid)commentid,getjsonobject(eventjson,.kv.userid’) userid,
get_json_object(event_json,’ . k v . p c o m m e n t i d ′ ) p c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.p_comment_id') p_comment_id, get_json_object(event_json,' .kv.pcommentid)pcommentid,getjsonobject(eventjson,.kv.content’) content,
get_json_object(event_json,’ . k v . a d d t i m e ′ ) a d d t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.addtime') addtime, get_json_object(event_json,' .kv.addtime)addtime,getjsonobject(eventjson,.kv.other_id’) other_id,
get_json_object(event_json,’ . k v . p r a i s e c o u n t ′ ) p r a i s e c o u n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.praise_count') praise_count, get_json_object(event_json,' .kv.praisecount)praisecount,getjsonobject(eventjson,.kv.reply_count’) reply_count,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘comment’;
3)测试
hive (gmall)> select * from dwd_comment_log limit 2;
4.3.9 收藏表
1)建表语句
hive (gmall)>
drop table if exists dwd_favorites_log;
CREATE EXTERNAL TABLE dwd_favorites_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
id int,
course_id int,
userid int,
add_time string,
server_time string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_favorites_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_favorites_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id)id,getjsonobject(eventjson,.kv.course_id’) course_id,
get_json_object(event_json,’ . k v . u s e r i d ′ ) u s e r i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.userid') userid, get_json_object(event_json,' .kv.userid)userid,getjsonobject(eventjson,.kv.add_time’) add_time,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘favorites’;
3)测试
hive (gmall)> select * from dwd_favorites_log limit 2;
4.3.10 点赞表
1)建表语句
hive (gmall)>
drop table if exists dwd_praise_log;
CREATE EXTERNAL TABLE dwd_praise_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
id string,
userid string,
target_id string,
type string,
add_time string,
server_time string
)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_praise_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_praise_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id)id,getjsonobject(eventjson,.kv.userid’) userid,
get_json_object(event_json,’ . k v . t a r g e t i d ′ ) t a r g e t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.target_id') target_id, get_json_object(event_json,' .kv.targetid)targetid,getjsonobject(eventjson,.kv.type’) type,
get_json_object(event_json,’$.kv.add_time’) add_time,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘praise’;
3)测试
hive (gmall)> select * from dwd_praise_log limit 2;
4.3.11 错误日志表
1)建表语句
hive (gmall)>
drop table if exists dwd_error_log;
CREATE EXTERNAL TABLE dwd_error_log(
mid_id string,
user_id string,
version_code string,
version_name string,
lang string,
source string,
os string,
area string,
model string,
brand string,
sdk_version string,
gmail string,
height_width string,
app_time string,
network string,
lng string,
lat string,
errorBrief string,
errorDetail string,
server_time string)
PARTITIONED BY (dt string)
location ‘/warehouse/gmall/dwd/dwd_error_log/’;
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dwd_error_log
PARTITION (dt=‘2019-02-10’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e r r o r B r i e f ′ ) e r r o r B r i e f , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.errorBrief') errorBrief, get_json_object(event_json,' .kv.errorBrief)errorBrief,getjsonobject(eventjson,.kv.errorDetail’) errorDetail,
server_time
from dwd_base_event_log
where dt=‘2019-02-10’ and event_name=‘error’;
3)测试
hive (gmall)> select * from dwd_error_log limit 2;
4.3.12 DWD层事件表加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim dwd_event_log.sh
在脚本中编写如下内容
#!/bin/bash

定义变量方便修改

APP=gmall
hive=/opt/module/hive/bin/hive

如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天

if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

sql="
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table “ A P P " . d w d d i s p l a y l o g P A R T I T I O N ( d t = ′ APP".dwd_display_log PARTITION (dt=' APP".dwddisplaylogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action)action,getjsonobject(eventjson,.kv.goodsid’) goodsid,
get_json_object(event_json,’ . k v . p l a c e ′ ) p l a c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.place') place, get_json_object(event_json,' .kv.place)place,getjsonobject(eventjson,.kv.extend1’) extend1,
get_json_object(event_json,' . k v . c a t e g o r y ′ ) c a t e g o r y , s e r v e r t i m e f r o m " .kv.category') category, server_time from " .kv.category)category,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘display’;

insert overwrite table " A P P " . d w d n e w s d e t a i l l o g P A R T I T I O N ( d t = ′ APP".dwd_newsdetail_log PARTITION (dt=' APP".dwdnewsdetaillogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry)entry,getjsonobject(eventjson,.kv.action’) action,
get_json_object(event_json,’ . k v . g o o d s i d ′ ) g o o d s i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.goodsid') goodsid, get_json_object(event_json,' .kv.goodsid)goodsid,getjsonobject(eventjson,.kv.showtype’) showtype,
get_json_object(event_json,’ . k v . n e w s s t a y t i m e ′ ) n e w s s t a y t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.news_staytime') news_staytime, get_json_object(event_json,' .kv.newsstaytime)newsstaytime,getjsonobject(eventjson,.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . t y p e 1 ′ ) t y p e 1 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.type1') type1, get_json_object(event_json,' .kv.type1)type1,getjsonobject(eventjson,.kv.category’) category,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=do_date’ and event_name=‘newsdetail’;

insert overwrite table “ A P P " . d w d l o a d i n g l o g P A R T I T I O N ( d t = ′ APP".dwd_loading_log PARTITION (dt=' APP".dwdloadinglogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action)action,getjsonobject(eventjson,.kv.loading_time’) loading_time,
get_json_object(event_json,’ . k v . l o a d i n g w a y ′ ) l o a d i n g w a y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.loading_way') loading_way, get_json_object(event_json,' .kv.loadingway)loadingway,getjsonobject(eventjson,.kv.extend1’) extend1,
get_json_object(event_json,’ . k v . e x t e n d 2 ′ ) e x t e n d 2 , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.extend2') extend2, get_json_object(event_json,' .kv.extend2)extend2,getjsonobject(eventjson,.kv.type’) type,
get_json_object(event_json,' . k v . t y p e 1 ′ ) t y p e 1 , s e r v e r t i m e f r o m " .kv.type1') type1, server_time from " .kv.type1)type1,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘loading’;

insert overwrite table " A P P " . d w d a d l o g P A R T I T I O N ( d t = ′ APP".dwd_ad_log PARTITION (dt=' APP".dwdadlogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e n t r y ′ ) e n t r y , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.entry') entry, get_json_object(event_json,' .kv.entry)entry,getjsonobject(eventjson,.kv.action’) action,
get_json_object(event_json,’ . k v . c o n t e n t ′ ) c o n t e n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.content') content, get_json_object(event_json,' .kv.content)content,getjsonobject(eventjson,.kv.detail’) detail,
get_json_object(event_json,’ . k v . s o u r c e ′ ) a d s o u r c e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.source') ad_source, get_json_object(event_json,' .kv.source)adsource,getjsonobject(eventjson,.kv.behavior’) behavior,
get_json_object(event_json,’ . k v . n e w s t y p e ′ ) n e w s t y p e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.newstype') newstype, get_json_object(event_json,' .kv.newstype)newstype,getjsonobject(eventjson,.kv.show_style’) show_style,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=do_date’ and event_name=‘ad’;

insert overwrite table " A P P " . d w d n o t i f i c a t i o n l o g P A R T I T I O N ( d t = ′ APP".dwd_notification_log PARTITION (dt=' APP".dwdnotificationlogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . a c t i o n ′ ) a c t i o n , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.action') action, get_json_object(event_json,' .kv.action)action,getjsonobject(eventjson,.kv.noti_type’) noti_type,
get_json_object(event_json,’ . k v . a p t i m e ′ ) a p t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.ap_time') ap_time, get_json_object(event_json,' .kv.aptime)aptime,getjsonobject(eventjson,.kv.content’) content,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=do_date’ and event_name=‘notification’;

insert overwrite table " A P P " . d w d a c t i v e f o r e g r o u n d l o g P A R T I T I O N ( d t = ′ APP".dwd_active_foreground_log PARTITION (dt=' APP".dwdactiveforegroundlogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . p u s h i d ′ ) p u s h i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.push_id') push_id, get_json_object(event_json,' .kv.pushid)pushid,getjsonobject(eventjson,.kv.access’) access,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=do_date’ and event_name=‘active_foreground’;

insert overwrite table “ A P P " . d w d a c t i v e b a c k g r o u n d l o g P A R T I T I O N ( d t = ′ APP".dwd_active_background_log PARTITION (dt=' APP".dwdactivebackgroundlogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,' . k v . a c t i v e s o u r c e ′ ) a c t i v e s o u r c e , s e r v e r t i m e f r o m " .kv.active_source') active_source, server_time from " .kv.activesource)activesource,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘active_background’;

insert overwrite table " A P P " . d w d c o m m e n t l o g P A R T I T I O N ( d t = ′ APP".dwd_comment_log PARTITION (dt=' APP".dwdcommentlogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . c o m m e n t i d ′ ) c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.comment_id') comment_id, get_json_object(event_json,' .kv.commentid)commentid,getjsonobject(eventjson,.kv.userid’) userid,
get_json_object(event_json,’ . k v . p c o m m e n t i d ′ ) p c o m m e n t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.p_comment_id') p_comment_id, get_json_object(event_json,' .kv.pcommentid)pcommentid,getjsonobject(eventjson,.kv.content’) content,
get_json_object(event_json,’ . k v . a d d t i m e ′ ) a d d t i m e , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.addtime') addtime, get_json_object(event_json,' .kv.addtime)addtime,getjsonobject(eventjson,.kv.other_id’) other_id,
get_json_object(event_json,’ . k v . p r a i s e c o u n t ′ ) p r a i s e c o u n t , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.praise_count') praise_count, get_json_object(event_json,' .kv.praisecount)praisecount,getjsonobject(eventjson,.kv.reply_count’) reply_count,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=do_date’ and event_name=‘comment’;

insert overwrite table " A P P " . d w d f a v o r i t e s l o g P A R T I T I O N ( d t = ′ APP".dwd_favorites_log PARTITION (dt=' APP".dwdfavoriteslogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id)id,getjsonobject(eventjson,.kv.course_id’) course_id,
get_json_object(event_json,’ . k v . u s e r i d ′ ) u s e r i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.userid') userid, get_json_object(event_json,' .kv.userid)userid,getjsonobject(eventjson,.kv.add_time’) add_time,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=do_date’ and event_name=‘favorites’;

insert overwrite table “ A P P " . d w d p r a i s e l o g P A R T I T I O N ( d t = ′ APP".dwd_praise_log PARTITION (dt=' APP".dwdpraiselogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . i d ′ ) i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.id') id, get_json_object(event_json,' .kv.id)id,getjsonobject(eventjson,.kv.userid’) userid,
get_json_object(event_json,’ . k v . t a r g e t i d ′ ) t a r g e t i d , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.target_id') target_id, get_json_object(event_json,' .kv.targetid)targetid,getjsonobject(eventjson,.kv.type’) type,
get_json_object(event_json,' . k v . a d d t i m e ′ ) a d d t i m e , s e r v e r t i m e f r o m " .kv.add_time') add_time, server_time from " .kv.addtime)addtime,servertimefrom"APP”.dwd_base_event_log
where dt=’$do_date’ and event_name=‘praise’;

insert overwrite table " A P P " . d w d e r r o r l o g P A R T I T I O N ( d t = ′ APP".dwd_error_log PARTITION (dt=' APP".dwderrorlogPARTITION(dt=do_date’)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,’ . k v . e r r o r B r i e f ′ ) e r r o r B r i e f , g e t j s o n o b j e c t ( e v e n t j s o n , ′ .kv.errorBrief') errorBrief, get_json_object(event_json,' .kv.errorBrief)errorBrief,getjsonobject(eventjson,.kv.errorDetail’) errorDetail,
server_time
from " A P P " . d w d b a s e e v e n t l o g w h e r e d t = ′ APP".dwd_base_event_log where dt=' APP".dwdbaseeventlogwheredt=do_date’ and event_name=‘error’;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dwd_event_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dwd_event_log.sh 2019-02-11
4)查询导入结果
hive (gmall)>
select * from dwd_comment_log where dt=‘2019-02-11’ limit 2;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第5章 业务知识准备
5.1 业务术语

  1. 用户
    用户以设备为判断标准,在移动统计中,每个独立设备认为是一个独立用户。Android系统根据IMEI号,IOS系统根据OpenUDID来标识一个独立用户,每部手机一个用户。
  2. 新增用户
    首次联网使用应用的用户。如果一个用户首次打开某APP,那这个用户定义为新增用户;卸载再安装的设备,不会被算作一次新增。新增用户包括日新增用户、周新增用户、月新增用户。
  3. 活跃用户
    打开应用的用户即为活跃用户,不考虑用户的使用情况。每天一台设备打开多次会被计为一个活跃用户。
  4. 周(月)活跃用户
    某个自然周(月)内启动过应用的用户,该周(月)内的多次启动只记一个活跃用户。
  5. 月活跃率
    月活跃用户与截止到该月累计的用户总和之间的比例。
  6. 沉默用户
    用户仅在安装当天(次日)启动一次,后续时间无再启动行为。该指标可以反映新增用户质量和用户与APP的匹配程度。
  7. 版本分布
    不同版本的周内各天新增用户数,活跃用户数和启动次数。利于判断APP各个版本之间的优劣和用户行为习惯。
  8. 本周回流用户
    上周未启动过应用,本周启动了应用的用户。
  9. 连续n周活跃用户
    连续n周,每周至少启动一次。
  10. 忠诚用户
    连续活跃5周以上的用户
  11. 连续活跃用户
    连续2周及以上活跃的用户
  12. 近期流失用户
    连续n(2<= n <= 4)周没有启动应用的用户。(第n+1周没有启动过)
  13. 留存用户
    某段时间内的新增用户,经过一段时间后,仍然使用应用的被认作是留存用户;这部分用户占当时新增用户的比例即是留存率。
    例如,5月份新增用户200,这200人在6月份启动过应用的有100人,7月份启动过应用的有80人,8月份启动过应用的有50人;则5月份新增用户一个月后的留存率是50%,二个月后的留存率是40%,三个月后的留存率是25%。
  14. 用户新鲜度
    每天启动应用的新老用户比例,即新增用户数占活跃用户数的比例。
  15. 单次使用时长
    每次启动使用的时间长度。
  16. 日使用时长
    累计一天内的使用时间长度。
  17. 启动次数计算标准
    IOS平台应用退到后台就算一次独立的启动;Android平台我们规定,两次启动之间的间隔小于30秒,被计算一次启动。用户在使用过程中,若因收发短信或接电话等退出应用30秒又再次返回应用中,那这两次行为应该是延续而非独立的,所以可以被算作一次使用行为,即一次启动。业内大多使用30秒这个标准,但用户还是可以自定义此时间间隔。
    5.2 系统函数
    5.2.1 collect_set函数
    1)创建原数据表
    hive (gmall)>
    drop table if exists stud;
    create table stud (name string, area string, course string, score int);
    2)向原数据表中插入数据
    hive (gmall)>
    insert into table stud values(‘zhang3’,‘bj’,‘math’,88);
    insert into table stud values(‘li4’,‘bj’,‘math’,99);
    insert into table stud values(‘wang5’,‘sh’,‘chinese’,92);
    insert into table stud values(‘zhao6’,‘sh’,‘chinese’,54);
    insert into table stud values(‘tian7’,‘bj’,‘chinese’,91);
    3)查询表中数据
    hive (gmall)> select * from stud;
    stud.name stud.area stud.course stud.score
    zhang3 bj math 88
    li4 bj math 99
    wang5 sh chinese 92
    zhao6 sh chinese 54
    tian7 bj chinese 91
    4)把同一分组的不同行的数据聚合成一个集合
    hive (gmall)> select course, collect_set(area), avg(score) from stud group by course;
    chinese [“sh”,“bj”] 79.0
    math [“bj”] 93.5
    5) 用下标可以取某一个
    hive (gmall)> select course, collect_set(area)[0], avg(score) from stud group by course;
    chinese sh 79.0
    math bj 93.5
    5.2.2 日期处理函数
    1)date_format函数(根据格式整理日期)
    hive (gmall)> select date_format(‘2019-02-10’,‘yyyy-MM’);
    2019-02
    2)date_add函数(加减日期)
    hive (gmall)> select date_add(‘2019-02-10’,-1);
    2019-02-09
    hive (gmall)> select date_add(‘2019-02-10’,1);
    2019-02-11
    3)next_day函数
    (1)取当前天的下一个周一
    hive (gmall)> select next_day(‘2019-02-12’,‘MO’)
    2019-02-18
    说明:星期一到星期日的英文(Monday,Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday)
    (2)取当前周的周一
    hive (gmall)> select date_add(next_day(‘2019-02-12’,‘MO’),-7);
    2019-02-11
    4)last_day函数(求当月最后一天日期)
    hive (gmall)> select last_day(‘2019-02-10’);
    2019-02-28
    第6章 需求一:用户活跃主题
    6.1 DWS层
    目标:统计当日、当周、当月活动的每个设备明细
    6.1.1 每日活跃设备明细

1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_day;
create external table dws_uv_detail_day
(
mid_id string COMMENT ‘设备唯一标识’,
user_id string COMMENT ‘用户标识’,
version_code string COMMENT ‘程序版本号’,
version_name string COMMENT ‘程序版本名’,
lang string COMMENT ‘系统语言’,
source string COMMENT ‘渠道号’,
os string COMMENT ‘安卓系统版本’,
area string COMMENT ‘区域’,
model string COMMENT ‘手机型号’,
brand string COMMENT ‘手机品牌’,
sdk_version string COMMENT ‘sdkVersion’,
gmail string COMMENT ‘gmail’,
height_width string COMMENT ‘屏幕宽高’,
app_time string COMMENT ‘客户端日志产生时的时间’,
network string COMMENT ‘网络模式’,
lng string COMMENT ‘经度’,
lat string COMMENT ‘纬度’
)
partitioned by(dt string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_uv_detail_day’
;
2)数据导入
以用户单日访问为key进行聚合,如果某个用户在一天中使用了两种操作系统、两个系统版本、多个地区,登录不同账号,只取其中之一
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dws_uv_detail_day
partition(dt=‘2019-02-10’)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang))lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat
from dwd_start_log
where dt=‘2019-02-10’
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_day limit 1;
hive (gmall)> select count(*) from dws_uv_detail_day;
4)思考:不同渠道来源的每日活跃数统计怎么计算?
6.1.2 每周活跃设备明细

根据日用户访问明细,获得周用户访问明细。
1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_wk;
create external table dws_uv_detail_wk(
mid_id string COMMENT ‘设备唯一标识’,
user_id string COMMENT ‘用户标识’,
version_code string COMMENT ‘程序版本号’,
version_name string COMMENT ‘程序版本名’,
lang string COMMENT ‘系统语言’,
source string COMMENT ‘渠道号’,
os string COMMENT ‘安卓系统版本’,
area string COMMENT ‘区域’,
model string COMMENT ‘手机型号’,
brand string COMMENT ‘手机品牌’,
sdk_version string COMMENT ‘sdkVersion’,
gmail string COMMENT ‘gmail’,
height_width string COMMENT ‘屏幕宽高’,
app_time string COMMENT ‘客户端日志产生时的时间’,
network string COMMENT ‘网络模式’,
lng string COMMENT ‘经度’,
lat string COMMENT ‘纬度’,
monday_date string COMMENT ‘周一日期’,
sunday_date string COMMENT ‘周日日期’
) COMMENT ‘活跃用户按周明细’
PARTITIONED BY (wk_dt string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_uv_detail_wk/’
;
2)数据导入
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dws_uv_detail_wk partition(wk_dt)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang)) lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat,
date_add(next_day(‘2019-02-10’,‘MO’),-7),
date_add(next_day(‘2019-02-10’,‘MO’),-1),
concat(date_add( next_day(‘2019-02-10’,‘MO’),-7), ‘_’ , date_add(next_day(‘2019-02-10’,‘MO’),-1)
)
from dws_uv_detail_day
where dt>=date_add(next_day(‘2019-02-10’,‘MO’),-7) and dt<=date_add(next_day(‘2019-02-10’,‘MO’),-1)
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_wk limit 1;
hive (gmall)> select count(*) from dws_uv_detail_wk;
6.1.3 每月活跃设备明细

1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_mn;

create external table dws_uv_detail_mn(
mid_id string COMMENT ‘设备唯一标识’,
user_id string COMMENT ‘用户标识’,
version_code string COMMENT ‘程序版本号’,
version_name string COMMENT ‘程序版本名’,
lang string COMMENT ‘系统语言’,
source string COMMENT ‘渠道号’,
os string COMMENT ‘安卓系统版本’,
area string COMMENT ‘区域’,
model string COMMENT ‘手机型号’,
brand string COMMENT ‘手机品牌’,
sdk_version string COMMENT ‘sdkVersion’,
gmail string COMMENT ‘gmail’,
height_width string COMMENT ‘屏幕宽高’,
app_time string COMMENT ‘客户端日志产生时的时间’,
network string COMMENT ‘网络模式’,
lng string COMMENT ‘经度’,
lat string COMMENT ‘纬度’
) COMMENT ‘活跃用户按月明细’
PARTITIONED BY (mn string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_uv_detail_mn/’
;
2)数据导入
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dws_uv_detail_mn partition(mn)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang)) lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat,
date_format(‘2019-02-10’,‘yyyy-MM’)
from dws_uv_detail_day
where date_format(dt,‘yyyy-MM’) = date_format(‘2019-02-10’,‘yyyy-MM’)
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_mn limit 1;
hive (gmall)> select count(*) from dws_uv_detail_mn ;
6.1.4 DWS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim dws_uv_log.sh
在脚本中编写如下内容
#!/bin/bash

定义变量方便修改

APP=gmall
hive=/opt/module/hive/bin/hive

如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天

if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

sql="
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table " A P P " . d w s u v d e t a i l d a y p a r t i t i o n ( d t = ′ APP".dws_uv_detail_day partition(dt=' APP".dwsuvdetaildaypartition(dt=do_date’)
select
mid_id,
concat_ws(’|’, collect_set(user_id)) user_id,
concat_ws(’|’, collect_set(version_code)) version_code,
concat_ws(’|’, collect_set(version_name)) version_name,
concat_ws(’|’, collect_set(lang)) lang,
concat_ws(’|’, collect_set(source)) source,
concat_ws(’|’, collect_set(os)) os,
concat_ws(’|’, collect_set(area)) area,
concat_ws(’|’, collect_set(model)) model,
concat_ws(’|’, collect_set(brand)) brand,
concat_ws(’|’, collect_set(sdk_version)) sdk_version,
concat_ws(’|’, collect_set(gmail)) gmail,
concat_ws(’|’, collect_set(height_width)) height_width,
concat_ws(’|’, collect_set(app_time)) app_time,
concat_ws(’|’, collect_set(network)) network,
concat_ws(’|’, collect_set(lng)) lng,
concat_ws(’|’, collect_set(lat)) lat
from " A P P " . d w d s t a r t l o g w h e r e d t = ′ APP".dwd_start_log where dt=' APP".dwdstartlogwheredt=do_date’
group by mid_id;

insert overwrite table “ A P P " . d w s u v d e t a i l w k p a r t i t i o n ( w k d t ) s e l e c t m i d i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( u s e r i d ) ) u s e r i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n c o d e ) ) v e r s i o n c o d e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n n a m e ) ) v e r s i o n n a m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a n g ) ) l a n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s o u r c e ) ) s o u r c e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( o s ) ) o s , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a r e a ) ) a r e a , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( m o d e l ) ) m o d e l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( b r a n d ) ) b r a n d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s d k v e r s i o n ) ) s d k v e r s i o n , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( g m a i l ) ) g m a i l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( h e i g h t w i d t h ) ) h e i g h t w i d t h , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a p p t i m e ) ) a p p t i m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( n e t w o r k ) ) n e t w o r k , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l n g ) ) l n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a t ) ) l a t , d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk partition(wk_dt) select mid_id, concat_ws('|', collect_set(user_id)) user_id, concat_ws('|', collect_set(version_code)) version_code, concat_ws('|', collect_set(version_name)) version_name, concat_ws('|', collect_set(lang)) lang, concat_ws('|', collect_set(source)) source, concat_ws('|', collect_set(os)) os, concat_ws('|', collect_set(area)) area, concat_ws('|', collect_set(model)) model, concat_ws('|', collect_set(brand)) brand, concat_ws('|', collect_set(sdk_version)) sdk_version, concat_ws('|', collect_set(gmail)) gmail, concat_ws('|', collect_set(height_width)) height_width, concat_ws('|', collect_set(app_time)) app_time, concat_ws('|', collect_set(network)) network, concat_ws('|', collect_set(lng)) lng, concat_ws('|', collect_set(lat)) lat, date_add(next_day(' APP".dwsuvdetailwkpartition(wkdt)selectmidid,concatws(,collectset(userid))userid,concatws(,collectset(versioncode))versioncode,concatws(,collectset(versionname))versionname,concatws(,collectset(lang))lang,concatws(,collectset(source))source,concatws(,collectset(os))os,concatws(,collectset(area))area,concatws(,collectset(model))model,concatws(,collectset(brand))brand,concatws(,collectset(sdkversion))sdkversion,concatws(,collectset(gmail))gmail,concatws(,collectset(heightwidth))heightwidth,concatws(,collectset(apptime))apptime,concatws(,collectset(network))network,concatws(,collectset(lng))lng,concatws(,collectset(lat))lat,dateadd(nextday(do_date’,‘MO’),-7),
date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) , c o n c a t ( d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-1), concat(date_add( next_day(' dodate,MO),1),concat(dateadd(nextday(do_date’,‘MO’),-7), ‘_’ , date_add(next_day(' d o d a t e ′ , ′ M O ′ ) , − 1 ) ) f r o m " do_date','MO'),-1) ) from " dodate,MO),1))from"APP”.dws_uv_detail_day
where dt>=date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 7 ) a n d d t < = d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-7) and dt<=date_add(next_day(' dodate,MO),7)anddt<=dateadd(nextday(do_date’,‘MO’),-1)
group by mid_id;

insert overwrite table " A P P " . d w s u v d e t a i l m n p a r t i t i o n ( m n ) s e l e c t m i d i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( u s e r i d ) ) u s e r i d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n c o d e ) ) v e r s i o n c o d e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( v e r s i o n n a m e ) ) v e r s i o n n a m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a n g ) ) l a n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s o u r c e ) ) s o u r c e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( o s ) ) o s , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a r e a ) ) a r e a , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( m o d e l ) ) m o d e l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( b r a n d ) ) b r a n d , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( s d k v e r s i o n ) ) s d k v e r s i o n , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( g m a i l ) ) g m a i l , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( h e i g h t w i d t h ) ) h e i g h t w i d t h , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( a p p t i m e ) ) a p p t i m e , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( n e t w o r k ) ) n e t w o r k , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l n g ) ) l n g , c o n c a t w s ( ′ ∣ ′ , c o l l e c t s e t ( l a t ) ) l a t , d a t e f o r m a t ( ′ APP".dws_uv_detail_mn partition(mn) select mid_id, concat_ws('|', collect_set(user_id)) user_id, concat_ws('|', collect_set(version_code)) version_code, concat_ws('|', collect_set(version_name)) version_name, concat_ws('|', collect_set(lang))lang, concat_ws('|', collect_set(source)) source, concat_ws('|', collect_set(os)) os, concat_ws('|', collect_set(area)) area, concat_ws('|', collect_set(model)) model, concat_ws('|', collect_set(brand)) brand, concat_ws('|', collect_set(sdk_version)) sdk_version, concat_ws('|', collect_set(gmail)) gmail, concat_ws('|', collect_set(height_width)) height_width, concat_ws('|', collect_set(app_time)) app_time, concat_ws('|', collect_set(network)) network, concat_ws('|', collect_set(lng)) lng, concat_ws('|', collect_set(lat)) lat, date_format(' APP".dwsuvdetailmnpartition(mn)selectmidid,concatws(,collectset(userid))userid,concatws(,collectset(versioncode))versioncode,concatws(,collectset(versionname))versionname,concatws(,collectset(lang))lang,concatws(,collectset(source))source,concatws(,collectset(os))os,concatws(,collectset(area))area,concatws(,collectset(model))model,concatws(,collectset(brand))brand,concatws(,collectset(sdkversion))sdkversion,concatws(,collectset(gmail))gmail,concatws(,collectset(heightwidth))heightwidth,concatws(,collectset(apptime))apptime,concatws(,collectset(network))network,concatws(,collectset(lng))lng,concatws(,collectset(lat))lat,dateformat(do_date’,‘yyyy-MM’)
from " A P P " . d w s u v d e t a i l d a y w h e r e d a t e f o r m a t ( d t , ′ y y y y − M M ′ ) = d a t e f o r m a t ( ′ APP".dws_uv_detail_day where date_format(dt,'yyyy-MM') = date_format(' APP".dwsuvdetaildaywheredateformat(dt,yyyyMM)=dateformat(do_date’,‘yyyy-MM’)
group by mid_id;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dws_uv_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dws_uv_log.sh 2019-02-11
4)查询结果
hive (gmall)> select count() from dws_uv_detail_day where dt=‘2019-02-11’;
hive (gmall)> select count(
) from dws_uv_detail_wk;
hive (gmall)> select count(*) from dws_uv_detail_mn ;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
6.2 ADS层
目标:当日、当周、当月活跃设备数
6.2.1 活跃设备数

1)建表语句
hive (gmall)>
drop table if exists ads_uv_count;
create external table ads_uv_count(
dt string COMMENT ‘统计日期’,
day_count bigint COMMENT ‘当日用户数量’,
wk_count bigint COMMENT ‘当周用户数量’,
mn_count bigint COMMENT ‘当月用户数量’,
is_weekend string COMMENT ‘Y,N是否是周末,用于得到本周最终结果’,
is_monthend string COMMENT ‘Y,N是否是月末,用于得到本月最终结果’
) COMMENT ‘活跃设备数’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_uv_count/’
;
2)导入数据
hive (gmall)>
insert into table ads_uv_count
select
‘2019-02-10’ dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day(‘2019-02-10’,‘MO’),-1)=‘2019-02-10’,‘Y’,‘N’) ,
if(last_day(‘2019-02-10’)=‘2019-02-10’,‘Y’,‘N’)
from
(
select
‘2019-02-10’ dt,
count() ct
from dws_uv_detail_day
where dt=‘2019-02-10’
)daycount join
(
select
‘2019-02-10’ dt,
count (
) ct
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day(‘2019-02-10’,‘MO’),-7),’_’ ,date_add(next_day(‘2019-02-10’,‘MO’),-1) )
) wkcount on daycount.dt=wkcount.dt
join
(
select
‘2019-02-10’ dt,
count (*) ct
from dws_uv_detail_mn
where mn=date_format(‘2019-02-10’,‘yyyy-MM’)
)mncount on daycount.dt=mncount.dt
;
3)查询导入结果
hive (gmall)> select * from ads_uv_count ;
6.2.2 ADS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_uv_log.sh
在脚本中编写如下内容
#!/bin/bash

定义变量方便修改

APP=gmall
hive=/opt/module/hive/bin/hive

如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天

if [ -n “$1” ] ;then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

sql="
set hive.exec.dynamic.partition.mode=nonstrict;

insert into table “ A P P " . a d s u v c o u n t s e l e c t ′ APP".ads_uv_count select ' APP".adsuvcountselectdo_date’ dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) = ′ do_date','MO'),-1)=' dodate,MO),1)=do_date’,‘Y’,‘N’) ,
if(last_day(‘ d o d a t e ′ ) = ′ do_date')=' dodate)=do_date’,‘Y’,‘N’)
from
(
select
' d o d a t e ′ d t , c o u n t ( ∗ ) c t f r o m " do_date' dt, count(*) ct from " dodatedt,count()ctfrom"APP”.dws_uv_detail_day
where dt=‘ d o d a t e ′ ) d a y c o u n t j o i n ( s e l e c t ′ do_date' )daycount join ( select ' dodate)daycountjoin(selectdo_date’ dt,
count () ct
from " A P P " . d w s u v d e t a i l w k w h e r e w k d t = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt=concat(dateadd(nextday(do_date’,‘MO’),-7),’_’ ,date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) ) ) w k c o u n t o n d a y c o u n t . d t = w k c o u n t . d t j o i n ( s e l e c t ′ do_date','MO'),-1) ) ) wkcount on daycount.dt=wkcount.dt join ( select ' dodate,MO),1)))wkcountondaycount.dt=wkcount.dtjoin(selectdo_date’ dt,
count (
) ct
from " A P P " . d w s u v d e t a i l m n w h e r e m n = d a t e f o r m a t ( ′ APP".dws_uv_detail_mn where mn=date_format(' APP".dwsuvdetailmnwheremn=dateformat(do_date’,‘yyyy-MM’)
)mncount on daycount.dt=mncount.dt;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_uv_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_uv_log.sh 2019-02-11
4)脚本执行时间
企业开发中一般在每日凌晨30分~1点
5)查询导入结果
hive (gmall)> select * from ads_uv_count;
第7章 需求二:用户新增主题
首次联网使用应用的用户。如果一个用户首次打开某APP,那这个用户定义为新增用户;卸载再安装的设备,不会被算作一次新增。新增用户包括日新增用户、周新增用户、月新增用户。
7.1 DWS层(每日新增设备明细表)

1)建表语句
hive (gmall)>
drop table if exists dws_new_mid_day;
create external table dws_new_mid_day
(
mid_id string COMMENT ‘设备唯一标识’,
user_id string COMMENT ‘用户标识’,
version_code string COMMENT ‘程序版本号’,
version_name string COMMENT ‘程序版本名’,
lang string COMMENT ‘系统语言’,
source string COMMENT ‘渠道号’,
os string COMMENT ‘安卓系统版本’,
area string COMMENT ‘区域’,
model string COMMENT ‘手机型号’,
brand string COMMENT ‘手机品牌’,
sdk_version string COMMENT ‘sdkVersion’,
gmail string COMMENT ‘gmail’,
height_width string COMMENT ‘屏幕宽高’,
app_time string COMMENT ‘客户端日志产生时的时间’,
network string COMMENT ‘网络模式’,
lng string COMMENT ‘经度’,
lat string COMMENT ‘纬度’,
create_date string comment ‘创建时间’
) COMMENT ‘每日新增设备信息’
stored as parquet
location ‘/warehouse/gmall/dws/dws_new_mid_day/’;
2)导入数据
用每日活跃用户表Left Join每日新增设备表,关联的条件是mid_id相等。如果是每日新增的设备,则在每日新增设备表中为null。
hive (gmall)>
insert into table dws_new_mid_day
select
ud.mid_id,
ud.user_id ,
ud.version_code ,
ud.version_name ,
ud.lang ,
ud.source,
ud.os,
ud.area,
ud.model,
ud.brand,
ud.sdk_version,
ud.gmail,
ud.height_width,
ud.app_time,
ud.network,
ud.lng,
ud.lat,
‘2019-02-10’
from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id
where ud.dt=‘2019-02-10’ and nm.mid_id is null;
3)查询导入数据
hive (gmall)> select count(*) from dws_new_mid_day ;
7.2 ADS层(每日新增设备表)

1)建表语句
hive (gmall)>
drop table if exists ads_new_mid_count;
create external table ads_new_mid_count
(
create_date string comment ‘创建时间’ ,
new_mid_count BIGINT comment ‘新增设备数量’
) COMMENT ‘每日新增设备信息数量’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_new_mid_count/’;
2)导入数据
hive (gmall)>
insert into table ads_new_mid_count
select
create_date,
count(*)
from dws_new_mid_day
where create_date=‘2019-02-10’
group by create_date;
3)查询导入数据
hive (gmall)> select * from ads_new_mid_count;
第8章 需求三:用户留存主题
8.1 需求目标
8.1.1 用户留存概念

8.1.2 需求描述

8.2 DWS层
8.2.1 DWS层(每日留存用户明细表)

1)建表语句
hive (gmall)>
drop table if exists dws_user_retention_day;
create external table dws_user_retention_day
(
mid_id string COMMENT ‘设备唯一标识’,
user_id string COMMENT ‘用户标识’,
version_code string COMMENT ‘程序版本号’,
version_name string COMMENT ‘程序版本名’,
lang string COMMENT ‘系统语言’,
source string COMMENT ‘渠道号’,
os string COMMENT ‘安卓系统版本’,
area string COMMENT ‘区域’,
model string COMMENT ‘手机型号’,
brand string COMMENT ‘手机品牌’,
sdk_version string COMMENT ‘sdkVersion’,
gmail string COMMENT ‘gmail’,
height_width string COMMENT ‘屏幕宽高’,
app_time string COMMENT ‘客户端日志产生时的时间’,
network string COMMENT ‘网络模式’,
lng string COMMENT ‘经度’,
lat string COMMENT ‘纬度’,
create_date string comment ‘设备新增时间’,
retention_day int comment ‘截止当前日期留存天数’
) COMMENT ‘每日用户留存情况’
PARTITIONED BY (dt string)
stored as parquet
location ‘/warehouse/gmall/dws/dws_user_retention_day/’
;
2)导入数据(每天计算前1天的新用户访问留存明细)
hive (gmall)>
insert overwrite table dws_user_retention_day
partition(dt=“2019-02-11”)
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-1);
3)查询导入数据(每天计算前1天的新用户访问留存明细)
hive (gmall)> select count(*) from dws_user_retention_day;
8.2.2 DWS层(1,2,3,n天留存用户明细表)
1)导入数据(每天计算前1,2,3,n天的新用户访问留存明细)
hive (gmall)>
insert overwrite table dws_user_retention_day
partition(dt=“2019-02-11”)
select
nm.mid_id,
nm.user_id,
nm.version_code,
nm.version_name,
nm.lang,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-1)

union all
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
2 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-2)

union all
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
3 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt=‘2019-02-11’ and nm.create_date=date_add(‘2019-02-11’,-3);
2)查询导入数据(每天计算前1,2,3天的新用户访问留存明细)
hive (gmall)> select retention_day , count(*) from dws_user_retention_day group by retention_day;
8.2.3 Union与Union all区别
1)准备两张表
tableA tableB
id  name  score id  name  score
1   a    80 1   d    48
2   b    79 2   e    23
3   c    68 3   c    86
2)采用union查询
select name from tableA             
union                        
select name from tableB             
查询结果
name
a
d
b
e
c
3)采用union all查询
select name from tableA
union all
select name from tableB
查询结果
name
a
b
c
d
e
c
4)总结
(1)union会将联合的结果集去重,效率较union all差
(2)union all不会对结果集去重,所以效率高
8.3 ADS层
8.3.1 留存用户数

1)建表语句
hive (gmall)>
drop table if exists ads_user_retention_day_count;
create external table ads_user_retention_day_count
(
create_date string comment ‘设备新增日期’,
retention_day int comment ‘截止当前日期留存天数’,
retention_count bigint comment ‘留存数量’
) COMMENT ‘每日用户留存情况’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_user_retention_day_count/’;
2)导入数据
hive (gmall)>
insert into table ads_user_retention_day_count
select
create_date,
retention_day,
count(*) retention_count
from dws_user_retention_day
where dt=‘2019-02-11’
group by create_date,retention_day;
3)查询导入数据
hive (gmall)> select * from ads_user_retention_day_count;
8.3.2 留存用户比率

1)建表语句
hive (gmall)>
drop table if exists ads_user_retention_day_rate;
create external table ads_user_retention_day_rate
(
stat_date string comment ‘统计日期’,
create_date string comment ‘设备新增日期’,
retention_day int comment ‘截止当前日期留存天数’,
retention_count bigint comment ‘留存数量’,
new_mid_count bigint comment ‘当日设备新增数量’,
retention_ratio decimal(10,2) comment ‘留存率’
) COMMENT ‘每日用户留存情况’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_user_retention_day_rate/’;
2)导入数据
hive (gmall)>
insert into table ads_user_retention_day_rate
select
‘2019-02-11’,
ur.create_date,
ur.retention_day,
ur.retention_count,
nc.new_mid_count,
ur.retention_count/nc.new_mid_count100
from
(
select
create_date,
retention_day,
count(
) retention_count
from dws_user_retention_day
where dt=‘2019-02-11’
group by create_date,retention_day
) ur join ads_new_mid_count nc on nc.create_date=ur.create_date;
3)查询导入数据
hive (gmall)>select * from ads_user_retention_day_rate;
第9章 新数据准备
为了分析沉默用户、本周回流用户数、流失用户、最近连续3周活跃用户、最近七天内连续三天活跃用户数,需要准备2019-02-12、2019-02-20日的数据。
1)2019-02-12数据准备
(1)修改日志时间
[atguigu@hadoop102 ~]$ dt.sh 2019-02-12
(2)启动集群
[atguigu@hadoop102 ~]$ cluster.sh start
(3)生成日志数据
[atguigu@hadoop102 ~]$ lg.sh
(4)将HDFS数据导入到ODS层
[atguigu@hadoop102 ~]$ ods_log.sh 2019-02-12
(5)将ODS数据导入到DWD层
[atguigu@hadoop102 ~]$ dwd_start_log.sh 2019-02-12
[atguigu@hadoop102 ~]$ dwd_base_log.sh 2019-02-12
[atguigu@hadoop102 ~]$ dwd_event_log.sh 2019-02-12
(6)将DWD数据导入到DWS层
[atguigu@hadoop102 ~]$ dws_uv_log.sh 2019-02-12
(7)验证
hive (gmall)> select * from dws_uv_detail_day where dt=‘2019-02-12’ limit 2;
2)2019-02-20数据准备
(1)修改日志时间
[atguigu@hadoop102 ~]$ dt.sh 2019-02-20
(2)启动集群
[atguigu@hadoop102 ~]$ cluster.sh start
(3)生成日志数据
[atguigu@hadoop102 ~]$ lg.sh
(4)将HDFS数据导入到ODS层
[atguigu@hadoop102 ~]$ ods_log.sh 2019-02-20
(5)将ODS数据导入到DWD层
[atguigu@hadoop102 ~]$ dwd_start_log.sh 2019-02-20
[atguigu@hadoop102 ~]$ dwd_base_log.sh 2019-02-20
[atguigu@hadoop102 ~]$ dwd_event_log.sh 2019-02-20
(6)将DWD数据导入到DWS层
[atguigu@hadoop102 ~]$ dws_uv_log.sh 2019-02-20
(7)验证
hive (gmall)> select * from dws_uv_detail_day where dt=‘2019-02-20’ limit 2;
第10章 需求四:沉默用户数
沉默用户:指的是只在安装当天启动过,且启动时间是在一周前
10.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
10.2 ADS层

1)建表语句
hive (gmall)>
drop table if exists ads_slient_count;
create external table ads_slient_count(
dt string COMMENT ‘统计日期’,
slient_count bigint COMMENT ‘沉默设备数’
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_slient_count’;
2)导入2019-02-20数据
hive (gmall)>
insert into table ads_slient_count
select
‘2019-02-20’ dt,
count() slient_count
from
(
select mid_id
from dws_uv_detail_day
where dt<=‘2019-02-20’
group by mid_id
having count(
)=1 and min(dt) ) t1;
3)查询导入数据
hive (gmall)> select * from ads_slient_count;
10.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_slient_log.sh
在脚本中编写如下内容
#!/bin/bash

hive=/opt/module/hive/bin/hive
APP=gmall

if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

echo “-----------导入日期$do_date-----------”

sql="
insert into table “ A P P " . a d s s l i e n t c o u n t s e l e c t ′ APP".ads_slient_count select ' APP".adsslientcountselectdo_date’ dt,
count() slient_count
from
(
select
mid_id
from " A P P " . d w s u v d e t a i l d a y w h e r e d t < = ′ APP".dws_uv_detail_day where dt<=' APP".dwsuvdetaildaywheredt<=do_date’
group by mid_id
having count(
)=1 and min(dt)<=date_add(’$do_date’,-7)
)t1;”

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_slient_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_slient_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_slient_count;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第11章 需求五:本周回流用户数
本周回流=本周活跃-本周新增-上周活跃
11.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
11.2 ADS层

1)建表语句
hive (gmall)>
drop table if exists ads_back_count;
create external table ads_back_count(
dt string COMMENT ‘统计日期’,
wk_dt string COMMENT ‘统计日期所在周’,
wastage_count bigint COMMENT ‘回流设备数’
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_back_count’;
2)导入数据:
hive (gmall)>
insert into table ads_back_count
select
‘2019-02-20’ dt,
concat(date_add(next_day(‘2019-02-20’,‘MO’),-7),’’,date_add(next_day(‘2019-02-20’,‘MO’),-1)) wk_dt,
count(*)
from
(
select t1.mid_id
from
(
select mid_id
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day(‘2019-02-20’,‘MO’),-7),’
’,date_add(next_day(‘2019-02-20’,‘MO’),-1))
)t1
left join
(
select mid_id
from dws_new_mid_day
where create_date<=date_add(next_day(‘2019-02-20’,‘MO’),-1) and create_date>=date_add(next_day(‘2019-02-20’,‘MO’),-7)
)t2
on t1.mid_id=t2.mid_id
left join
(
select mid_id
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day(‘2019-02-20’,‘MO’),-7*2),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-7-1))
)t3
on t1.mid_id=t3.mid_id
where t2.mid_id is null and t3.mid_id is null
)t4;
3)查询结果
hive (gmall)> select * from ads_back_count;
11.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_back_log.sh
在脚本中编写如下内容
#!/bin/bash

if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

hive=/opt/module/hive/bin/hive
APP=gmall

echo “-----------导入日期$do_date-----------”

sql="
insert into table " A P P " . a d s b a c k c o u n t s e l e c t ′ APP".ads_back_count select ' APP".adsbackcountselectdo_date’ dt,
concat(date_add(next_day(‘KaTeX parse error: Expected group after '_' at position 21: …te','MO'),-7),'_̲',date_add(next…do_date’,‘MO’),-1)) wk_dt,
count()
from
(
select t1.mid_id
from
(
select mid_id
from “ A P P " . d w s u v d e t a i l w k w h e r e w k d t = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt=concat(dateadd(nextday(do_date’,‘MO’),-7),’_’,date_add(next_day(' d o d a t e ′ , ′ M O ′ ) , − 1 ) ) ) t 1 l e f t j o i n ( s e l e c t m i d i d f r o m " do_date','MO'),-1)) )t1 left join ( select mid_id from " dodate,MO),1)))t1leftjoin(selectmididfrom"APP”.dws_new_mid_day
where create_date<=date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 1 ) a n d c r e a t e d a t e > = d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-1) and create_date>=date_add(next_day(' dodate,MO),1)andcreatedate>=dateadd(nextday(do_date’,‘MO’),-7)
)t2
on t1.mid_id=t2.mid_id
left join
(
select mid_id
from " A P P " . d w s u v d e t a i l w k w h e r e w k d t = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt=concat(dateadd(nextday(do_date’,‘MO’),-7
2),’_’,date_add(next_day(’$do_date’,‘MO’),-7-1))
)t3
on t1.mid_id=t3.mid_id
where t2.mid_id is null and t3.mid_id is null
)t4;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_back_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_back_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_back_count;
5)脚本执行时间
企业开发中一般在每周一凌晨30分~1点
第12章 需求六:流失用户数
流失用户:最近7天未登录我们称之为流失用户
12.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
12.2 ADS层

1)建表语句
hive (gmall)>
drop table if exists ads_wastage_count;
create external table ads_wastage_count(
dt string COMMENT ‘统计日期’,
wastage_count bigint COMMENT ‘流失设备数’
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_wastage_count’;
2)导入2019-02-20数据
hive (gmall)>
insert into table ads_wastage_count
select
‘2019-02-20’,
count(*)
from
(
select mid_id
from dws_uv_detail_day
group by mid_id
having max(dt)<=date_add(‘2019-02-20’,-7)
)t1;
12.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_wastage_log.sh
在脚本中编写如下内容
#!/bin/bash

if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

hive=/opt/module/hive/bin/hive
APP=gmall

echo “-----------导入日期$do_date-----------”

sql="
insert into table " A P P " . a d s w a s t a g e c o u n t s e l e c t ′ APP".ads_wastage_count select ' APP".adswastagecountselectdo_date’,
count(*)
from
(
select mid_id
from " A P P " . d w s u v d e t a i l d a y g r o u p b y m i d i d h a v i n g m a x ( d t ) < = d a t e a d d ( ′ APP".dws_uv_detail_day group by mid_id having max(dt)<=date_add(' APP".dwsuvdetaildaygroupbymididhavingmax(dt)<=dateadd(do_date’,-7)
)t1;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_wastage_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_wastage_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_wastage_count;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
第13章 需求七:最近连续3周活跃用户数
最近3周连续活跃的用户:通常是周一对前3周的数据做统计,该数据一周计算一次。
13.1 DWS层
使用周活明细表dws_uv_detail_wk作为DWS层数据
13.2 ADS层

1)建表语句
hive (gmall)>
drop table if exists ads_continuity_wk_count;
create external table ads_continuity_wk_count(
dt string COMMENT ‘统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期’,
wk_dt string COMMENT ‘持续时间’,
continuity_count bigint
)
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_continuity_wk_count’;
2)导入2019-02-20所在周的数据
hive (gmall)>
insert into table ads_continuity_wk_count
select
‘2019-02-20’,
concat(date_add(next_day(‘2019-02-20’,‘MO’),-73),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-1)),
count(
)
from
(
select mid_id
from dws_uv_detail_wk
where wk_dt>=concat(date_add(next_day(‘2019-02-20’,‘MO’),-73),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-72-1))
and wk_dt<=concat(date_add(next_day(‘2019-02-20’,‘MO’),-7),’_’,date_add(next_day(‘2019-02-20’,‘MO’),-1))
group by mid_id
having count(*)=3
)t1;
3)查询
hive (gmall)> select * from ads_continuity_wk_count;
13.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_continuity_wk_log.sh
在脚本中编写如下内容
#!/bin/bash

if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

hive=/opt/module/hive/bin/hive
APP=gmall

echo “-----------导入日期$do_date-----------”

sql="
insert into table “ A P P " . a d s c o n t i n u i t y w k c o u n t s e l e c t ′ APP".ads_continuity_wk_count select ' APP".adscontinuitywkcountselectdo_date’,
concat(date_add(next_day(‘KaTeX parse error: Expected group after '_' at position 23: …','MO'),-7*3),'_̲',date_add(next…do_date’,‘MO’),-1)),
count()
from
(
select mid_id
from " A P P " . d w s u v d e t a i l w k w h e r e w k d t > = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ APP".dws_uv_detail_wk where wk_dt>=concat(date_add(next_day(' APP".dwsuvdetailwkwherewkdt>=concat(dateadd(nextday(do_date’,‘MO’),-7
3),’’,date_add(next_day(‘ d o d a t e ′ , ′ M O ′ ) , − 7 ∗ 2 − 1 ) ) a n d w k d t < = c o n c a t ( d a t e a d d ( n e x t d a y ( ′ do_date','MO'),-7*2-1)) and wk_dt<=concat(date_add(next_day(' dodate,MO),721))andwkdt<=concat(dateadd(nextday(do_date’,‘MO’),-7),’’,date_add(next_day(’$do_date’,‘MO’),-1))
group by mid_id
having count(*)=3
)t1;”

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_continuity_wk_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_continuity_wk_log.sh 2019-02-20
4)查询结果
hive (gmall)> select * from ads_continuity_wk_count;
5)脚本执行时间
企业开发中一般在每周一凌晨30分~1点
第14章 需求八:最近七天内连续三天活跃用户数
说明:最近7天内连续3天活跃用户数
14.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
14.2 ADS层

1)建表语句
hive (gmall)>
drop table if exists ads_continuity_uv_count;
create external table ads_continuity_uv_count(
dt string COMMENT ‘统计日期’,
wk_dt string COMMENT ‘最近7天日期’,
continuity_count bigint
) COMMENT ‘连续活跃设备数’
row format delimited fields terminated by ‘\t’
location ‘/warehouse/gmall/ads/ads_continuity_uv_count’;
2)写出导入数据的SQL语句
hive (gmall)>
insert into table ads_continuity_uv_count
select
‘2019-02-12’,
concat(date_add(‘2019-02-12’,-6),’_’,‘2019-02-12’),
count()
from
(
select mid_id
from
(
select mid_id
from
(
select
mid_id,
date_sub(dt,rank) date_dif
from
(
select
mid_id,
dt,
rank() over(partition by mid_id order by dt) rank
from dws_uv_detail_day
where dt>=date_add(‘2019-02-12’,-6) and dt<=‘2019-02-12’
)t1
)t2
group by mid_id,date_dif
having count(
)>=3
)t3
group by mid_id
)t4;
(5)查询
hive (gmall)> select * from ads_continuity_uv_count;
14.3 编写脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim ads_continuity_log.sh
在脚本中编写如下内容
#!/bin/bash

if [ -n “$1” ];then
do_date=$1
else
do_date=date -d "-1 day" +%F
fi

hive=/opt/module/hive/bin/hive
APP=gmall

echo “-----------导入日期$do_date-----------”

sql="
insert into table " A P P " . a d s c o n t i n u i t y u v c o u n t s e l e c t ′ APP".ads_continuity_uv_count select ' APP".adscontinuityuvcountselectdo_date’,
concat(date_add(‘KaTeX parse error: Expected group after '_' at position 15: do_date',-6),'_̲','do_date’) dt,
count()
from
(
select mid_id
from
(
select mid_id
from
(
select
mid_id,
date_sub(dt,rank) date_diff
from
(
select
mid_id,
dt,
rank() over(partition by mid_id order by dt) rank
from " A P P " . d w s u v d e t a i l d a y w h e r e d t > = d a t e a d d ( ′ APP".dws_uv_detail_day where dt>=date_add(' APP".dwsuvdetaildaywheredt>=dateadd(do_date’,-6) and dt<=’$do_date’
)t1
)t2
group by mid_id,date_diff
having count(
)>=3
)t3
group by mid_id
)t4;
"

h i v e − e " hive -e " hivee"sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ads_continuity_log.sh
3)脚本使用
[atguigu@hadoop102 module]$

你可能感兴趣的:(笔记)