大数据篇:数据仓库案例

离线数据仓库

数据仓库(Data WareHouse)是为企业所有决策制定过程,提供所有系统数据支持的战略集合

通过对数据仓库中数据的分析,可以帮助企业,改进业务流程、控制、成本、提高产品质量等

数据仓库,并不是数据最终目的地,而是为数据最终的目的地做好准备:清洗、转义、分类、重组、合并、拆分、统计等等

大数据篇:数据仓库案例_第1张图片

1 项目简介

1.1 项目需求

  1. 用户行为数据采集平台搭建
  2. 业务数据采集平台搭建
  3. 数据仓库维度建模
  4. 分析:用户、流量、会员、商品、销售、地区、活动等主题
  5. 采用即席查询工具,随时进行指标分析
  6. 对集群性能进行监控,发生异常需要报警
  7. 元数据管理
  8. 质量监控

1.2 技术选型

  • 数据采集功能如何技术选型
采集框架名称 主要功能
Sqoop 大数据平台和关系型数据库的导入导出
Datax 大数据平台和关系型数据库的导入导出
flume 擅长日志数据的采集和解析
logstash 擅长日志数据的采集和解析
maxwell 常用作实时解析mysql的binlog数据
canal 常用作实时解析mysql的binlog数据
waterDrop 数据导入导出工具
  • 消息中间件的技术选型
开源MQ 概述
RabbitMQ LShift 用Erlang实现,支持多协议,broker架构,重量级
ZeroMQ AMQP最初设计者iMatix公司实现,轻量消息内核,无broker设计。C++实现
Kafka LinkedIn用Scala语言实现,支持hadoop数据并行加载
ActiveMQ Apach的一种JMS具体实现,支持代理和p2p部署。支持多协议。Java实现
Redis Key-value NoSQL数据库,有MQ的功能
MemcacheQ 国人利用memcache缓冲队列协议开发的消息队列,C/C++实现
  • 数据永久存储技术框架选型
框架名称 主要用途
HDFS 分布式文件存储系统
Hbase Key,value对的nosql数据库
Kudu Cloudera公司开源提供的类似于Hbase的数据存储
Hive 基于MR的数据仓库工具
  • 数据离线计算框架技术选型(hive引擎)
框架名称 基本介绍
MapReduce 最早期的分布式文件计算系统
Spark 基于spark,一站式解决批流处理问题
Flink 基于flink,一站式解决批流处理问题
  • 分析数据库选型
对比项目 Druid Kylin Presto Impala ES
亚秒级响应 × × ×
百亿数据集
SQL支持 √(需插件)
离线
实时 × × ×
精确去重 × ×
多表Join × ×
JDBC for BI × ×
  • 其他选型
    • 任务调度:DolphinScheduler
    • 集群监控:CM+CDH
    • 元数据管理:Atlas
    • BI工具:Zeppelin、Superset

1.3 架构

1.4 集群资源规划

  • 如何确认集群规模(假设每台服务器8T磁盘,128G内存)
    1. 每天日活跃用户100万,每人一天平均100条:100万 * 100条 = 1亿条
    2. 每条日志1K左右,每天1一条:1亿 / 1024 /1024 = 约100G
    3. 半年内不扩容服务器来算:100G * 180天 = 约18T
    4. 保存3个副本:18T * 3 = 54T
    5. 预留20% ~ 30%BUF:54T / 0.7 = 77T
    6. 总结:约10台服务器

由于资源有限,采用3台进行制作

服务名称 子服务 服务器 cdh01.cm 服务器 cdh02.cm 服务器 cdh03.cm
HDFS NameNode
DataNode
SecondaryNameNode






Yarn NodeManager
Resourcemanager



Zookeeper Zookeeper Server
Flume Flume
Flume(消费 Kafka)



Kafka Kafka
Hive Hive
MySQL MySQL
Sqoop Sqoop
Presto Coordinator
Worker



DolphinScheduler DolphinScheduler
Druid Druid
Kylin Kylin
Hbase HMaster
HRegionServer



Superset Superset
Atlas Atlas
Solr Solr

2 数据生成模块

此模块主要针对于用户行为数据的采集,为什么要进行用户行为数据的采集呢?

因为对于企业来说,用户就是钱,需要将用户的习惯等数据进行采集,以便在大数据衍生产品如用户画像标签系统进行分析,那么一般情况下用户的信息都是离线分析的,后期我们可以将分析结果存入ES等倒排索引生态中,在使用实时计算的方式匹配用户习惯,进行定制化推荐,更进一步的深度学习,对相似用户进行推荐。

2.1 埋点数据基本格式

  • 公共字段:基本所有安卓手机都包含的字段

  • 业务字段:埋点上报的字段,有具体的业务类型

{
"ap":"xxxxx",//项目数据来源 app pc
"cm": { //公共字段
	"mid": "", // (String) 设备唯一标识
	"uid": "", // (String) 用户标识
	"vc": "1", // (String) versionCode,程序版本号
	"vn": "1.0", // (String) versionName,程序版本名
	"l": "zh", // (String) language 系统语言
	"sr": "", // (String) 渠道号,应用从哪个渠道来的。
	"os": "7.1.1", // (String) Android 系统版本
	"ar": "CN", // (String) area 区域
	"md": "BBB100-1", // (String) model 手机型号
	"ba": "blackberry", // (String) brand 手机品牌
	"sv": "V2.2.1", // (String) sdkVersion
	"g": "", // (String) gmail
	"hw": "1620x1080", // (String) heightXwidth,屏幕宽高
	"t": "1506047606608", // (String) 客户端日志产生时的时间
	"nw": "WIFI", // (String) 网络模式
	"ln": 0, // (double) lng 经度
	"la": 0 // (double) lat 纬度
},
"et": [ //事件
	{
	"ett": "1506047605364", //客户端事件产生时间
	"en": "display", //事件名称
	"kv": { //事件结果,以 key-value 形式自行定义
		"goodsid": "236",
		"action": "1",
		"extend1": "1",
		"place": "2",
		"category": "75"
		}
	}
]
}
  • 示例日志(服务器时间戳 | 日志),时间戳可以有效判定网络服务的通信时长:
1540934156385| {
	"ap": "gmall",	//数仓库名
	"cm": {
	"uid": "1234",
	"vc": "2",
	"vn": "1.0",
	"la": "EN",
	"sr": "",
	"os": "7.1.1",
	"ar": "CN",
	"md": "BBB100-1",
	"ba": "blackberry",
	"sv": "V2.2.1",
	"g": "[email protected]",
	"hw": "1620x1080",
	"t": "1506047606608",
	"nw": "WIFI",
	"ln": 0,
        "la": 0
},
"et": [
	{
	"ett": "1506047605364", //客户端事件产生时间
	"en": "display", //事件名称
	"kv": { //事件结果,以 key-value 形式自行定义
		"goodsid": "236",
		"action": "1",
		"extend1": "1",
		"place": "2",
		"category": "75"
		}
	},{
	"ett": "1552352626835",
	"en": "active_background",
	"kv": {
		"active_source": "1"
		}
	}
	]
}
}

2.2 埋点事件日志数据

2.2.1 商品列表页

  • 事件名称:loading
标签 含义
action 动作:开始加载=1,加载成功=2,加载失败=3
loading_time 加载时长:计算下拉开始到接口返回数据的时间,(开始加载报 0,加载成 功或加载失败才上报时间)
loading_way 加载类型:1-读取缓存,2-从接口拉新数据 (加载成功才上报加载类型)
extend1 扩展字段 Extend1
extend2 扩展字段 Extend2
type 加载类型:自动加载=1,用户下拽加载=2,底部加载=3(底部条触发点击底部提示条/点击返回顶部加载)
type1 加载失败码:把加载失败状态码报回来(报空为加载成功,没有失败)

2.2.2 商品点击

  • 事件标签:display
标签 含义
action 动作:曝光商品=1,点击商品=2
goodsid 商品 ID(服务端下发的 ID)
place 顺序(第几条商品,第一条为 0,第二条为 1,如此类推)
extend1 曝光类型:1 - 首次曝光 2-重复曝光
category 分类 ID(服务端定义的分类 ID)

2.2.3 商品详情页

大数据篇:数据仓库案例_第2张图片

  • 事件标签:newsdetail
标签 含义
entry 页面入口来源:应用首页=1、push=2、详情页相关推荐=3
action 动作:开始加载=1,加载成功=2(pv),加载失败=3, 退出页面=4
goodsid 商品 ID(服务端下发的 ID)
show_style 商品样式:0、无图、1、一张大图、2、两张图、3、三张小图、4、一张小图、 5、一张大图两张小图
news_staytime 页面停留时长:从商品开始加载时开始计算,到用户关闭页面所用的时间。 若中途用跳转到其它页面了,则暂停计时,待回到详情页时恢复计时。或中 途划出的时间超过 10 分钟,则本次计时作废,不上报本次数据。如未加载成 功退出,则报空。
loading_time 加载时长:计算页面开始加载到接口返回数据的时间 (开始加载报 0,加载 成功或加载失败才上报时间)
type1 加载失败码:把加载失败状态码报回来(报空为加载成功,没有失败)
category 分类 ID(服务端定义的分类 ID)

2.2.4 广告

大数据篇:数据仓库案例_第3张图片

  • 事件名称:ad
标签 含义
entry 入口:商品列表页=1 应用首页=2 商品详情页=3
action 动作: 广告展示=1 广告点击=2
contentType Type: 1 商品 2 营销活动
displayMills 展示时长 毫秒数
itemId 商品 id
activityId 营销活动 id

2.2.5 消息通知

  • 事件标签:notification
标签 含义
action 动作:通知产生=1,通知弹出=2,通知点击=3,常驻通知展示(不重复上 报,一天之内只报一次)=4
type 通知 id:预警通知=1,天气预报(早=2,晚=3),常驻=4
ap_time 客户端弹出时间
content 备用字段

2.2.6 用户后台活跃

  • 事件标签: active_background
标签 含义
active_source 1=upgrade,2=download(下载),3=plugin_upgrade

2.2.7 评论

  • 描述:评论表(comment)
序号 字段名称 字段描述 字段类型 长度 允许空 缺省值
1 comment_id 评论表 int 10,0
2 userid 用户 id int 10,0 0
3 p_comment_id 父级评论 id(为 0 则是
一级评论,不 为 0 则是回复)
int 10,0
4 content 评论内容 string 1000
5 addtime 创建时间 string
6 other_id 评论的相关 id int 10,0
7 praise_count 点赞数量 int 10,0 0
8 reply_count 回复数量 int 10,0 0

2.2.8 收藏

  • 描述:收藏(favorites)
序号 字段名称 字段描述 字段类型 长度 允许空 缺省值
1 id 主键 int 10,0
2 course_id 商品 id int 10,0 0
3 userid 用户 ID int 10,0 0
4 add_time 创建时间 string

2.2.9 点赞

  • 描述:所有的点赞表(praise)
序号 字段名称 字段描述 字段类型 长度 允许空 缺省值
1 id 主键 id int 10,0
2 userid 用户 id int 10,0
3 target_id 点赞的对象 id int 10,0
4 type 创建点赞类型:1问答点赞 2问答评论点赞
3文章点赞数 4评论点赞
int 10,0
5 add_time 添加时间 string

2.2.10 错误日志

errorBrief 错误摘要
errorBrief 错误详情

2.3 埋点启动日志数据

{
	"action":"1",
	"ar":"MX",
	"ba":"HTC",
	"detail":"",
	"en":"start",
	"entry":"2",
	"extend1":"",
	"g":"[email protected]",
	"hw":"640*960",
	"l":"en",
	"la":"20.4",
	"ln":"-99.3",
	"loading_time":"2",
	"md":"HTC-2",
	"mid":"995",
	"nw":"4G",
	"open_ad_type":"2",
	"os":"8.1.2",
	"sr":"B",
	"sv":"V2.0.6",
	"t":"1561472502444",
	"uid":"995",
	"vc":"10",
	"vn":"1.3.4"
}
  • 事件标签: start
标签 含义
entry 入 口 : push=1 , widget=2 , icon=3 , notification=4, lockscreen_widget =5
open_ad_type 开屏广告类型: 开屏原生广告=1, 开屏插屏广告=2
action 状态:成功=1 失败=2
loading_time 加载时长:计算下拉开始到接口返回数据的时间,(开始加载报 0,加载成 功或加载失败才上报时间)
detail 失败码(没有则上报空)
extend1 失败的 message(没有则上报空)
en 日志类型 start

2.4 数据生成脚本

如下案例中将省略图中红框中的日志生成过程,直接使用Java程序构建logFile文件。

2.4.1 数据生成格式

  • 启动日志

大数据篇:数据仓库案例_第4张图片

{"action":"1","ar":"MX","ba":"Sumsung","detail":"201","en":"start","entry":"4","extend1":"","g":"[email protected]","hw":"1080*1920","l":"pt","la":"-11.0","ln":"-70.0","loading_time":"9","md":"sumsung-5","mid":"244","nw":"3G","open_ad_type":"1","os":"8.2.3","sr":"D","sv":"V2.1.3","t":"1589612165914","uid":"244","vc":"16","vn":"1.2.1"}

  • 事件日志(由于转换问题,图中没有 "时间戳|")

1589695383284|{"cm":{"ln":"-79.4","sv":"V2.5.3","os":"8.0.6","g":"[email protected]","mid":"245","nw":"WIFI","l":"pt","vc":"6","hw":"1080*1920","ar":"MX","uid":"245","t":"1589627025851","la":"-39.6","md":"HTC-7","vn":"1.3.5","ba":"HTC","sr":"N"},"ap":"app","et":[{"ett":"1589650631883","en":"display","kv":{"goodsid":"53","action":"2","extend1":"2","place":"3","category":"50"}},{"ett":"1589690866312","en":"newsdetail","kv":{"entry":"3","goodsid":"54","news_staytime":"1","loading_time":"6","action":"4","showtype":"0","category":"78","type1":""}},{"ett":"1589641734037","en":"loading","kv":{"extend2":"","loading_time":"0","action":"1","extend1":"","type":"2","type1":"201","loading_way":"2"}},{"ett":"1589687684878","en":"ad","kv":{"activityId":"1","displayMills":"92030","entry":"3","action":"5","contentType":"0"}},{"ett":"1589632980772","en":"active_background","kv":{"active_source":"1"}},{"ett":"1589682030324","en":"error","kv":{"errorDetail":"java.lang.NullPointerException\n at cn.lift.appIn.web.AbstractBaseController.validInbound(AbstractBaseController.java:72)\n at cn.lift.dfdf.web.AbstractBaseController.validInbound","errorBrief":"at cn.lift.dfdf.web.AbstractBaseController.validInbound(AbstractBaseController.java:72)"}},{"ett":"1589675065650","en":"comment","kv":{"p_comment_id":2,"addtime":"1589624299628","praise_count":509,"other_id":6,"comment_id":7,"reply_count":35,"userid":3,"content":"关色芦候佰间纶珊斑禁尹赞涤仇彭企呵姜毅"}},{"ett":"1589631359459","en":"favorites","kv":{"course_id":7,"id":0,"add_time":"1589681240066","userid":7}},{"ett":"1589616574187","en":"praise","kv":{"target_id":1,"id":7,"type":3,"add_time":"1589642497314","userid":8}}]}

2.4.2 创建maven工程


大数据篇:数据仓库案例_第5张图片

大数据篇:数据仓库案例_第6张图片

  • data-producer:pom.xml
    
    
        1.7.20
        1.0.7
    
     
        
            com.alibaba
            fastjson
            1.2.51
         
        
            ch.qos.logback
            logback-core
            ${logback.version}
        
        
            ch.qos.logback
            logback-classic
            ${logback.version}
        
        
            org.projectlombok
            lombok
            1.18.10
            provided
        

    
    
        
            
                maven-compiler-plugin
                2.3.2
                
                    1.8
                    1.8
                
            
            
                maven-assembly-plugin
                
                    
                        jar-with-dependencies
                    
                    
                        
                            
                            com.heaton.bigdata.datawarehouse.app.App
                        
                    
                
                
                    
                        make-assembly
                        package
                        
                            single
                        
                    
                
            
        
    
  • data-producer:logback.xml

 
     
    
         
            %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{50} - %msg%n
        
     
     
         
            ${LOG_HOME}/app-%d{yyyy-MM-dd}.log 
            30
        
        
            %msg%n
         
        
            10MB
        
     
     
        0 
        512 
        
     
    
        
        
        
    

  • data-flume:pom.xml
    
        
            org.apache.flume
            flume-ng-core
            1.9.0
        
    
    
        
            
                maven-compiler-plugin
                2.3.2
                
                    1.8
                    1.8
                
            
        
        
  • hive-function:pom.xml
    
        
            org.apache.hive
            hive-exec
            2.1.1
        
    
    
        
            
                maven-compiler-plugin
                2.3.2
                
                    1.8
                    1.8
                
            
        
    

2.4.3 各事件bean

data-producer工程

2.4.3.1 公共日志类
import lombok.Data;
/** 
* @author Heaton
* @email [email protected]
* @date 2020/4/25 14:54 
* @describe 公共日志类
*/
@Data
public class AppBase {
    private String mid; // (String) 设备唯一
    private String uid; // (String) 用户 uid
    private String vc; // (String) versionCode,程序版本号
    private String vn; // (String) versionName,程序版本名
    private String l; // (String) 系统语言
    private String sr; // (String) 渠道号,应用从哪个渠道来的。
    private String os; // (String) Android 系统版本
    private String ar; // (String) 区域
    private String md; // (String) 手机型号
    private String ba; // (String) 手机品牌
    private String sv; // (String) sdkVersion
    private String g; // (String) gmail
    private String hw; // (String) heightXwidth,屏幕宽高
    private String t; // (String) 客户端日志产生时的时间
    private String nw; // (String) 网络模式
    private String ln; // (double) lng 经度
    private String la; // (double) lat 纬度
}
2.4.3.2 启动日志类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 启动日志类
 */
@Data
public class AppStart extends AppBase {
    private String entry;//入口: push=1,widget=2,icon=3,notification=4, lockscreen_widget
    private String open_ad_type;//开屏广告类型: 开屏原生广告=1, 开屏插屏广告=2
    private String action;//状态:成功=1 失败=2
    private String loading_time;//加载时长:计算下拉开始到接口返回数据的时间,(开始加载报 0,加载成功或加载失败才上报时间)
    private String detail;//失败码(没有则上报空)
    private String extend1;//失败的 message(没有则上报空)
    private String en;//启动日志类型标记
}
2.4.3.3 错误日志类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 错误日志类
 */
@Data
public class AppErrorLog {
    private String errorBrief; //错误摘要
    private String errorDetail; //错误详情
}
2.4.3.4 商品点击日志类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 商品点击日志类
 */
@Data
public class AppDisplay {
    private String action;//动作:曝光商品=1,点击商品=2
    private String goodsid;//商品 ID(服务端下发的 ID)
    private String place;//顺序(第几条商品,第一条为 0,第二条为 1,如此类推)
    private String extend1;//曝光类型:1 - 首次曝光 2-重复曝光(没有使用)
    private String category;//分类 ID(服务端定义的分类 ID)
}
2.4.3.5 商品详情类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 商品详情类
 */
@Data
public class AppNewsDetail {
    private String entry;//页面入口来源:应用首页=1、push=2、详情页相关推荐
    private String action;//动作:开始加载=1,加载成功=2(pv),加载失败=3, 退出页面=4
    private String goodsid;//商品 ID(服务端下发的 ID)
    private String showtype;//商品样式:0、无图 1、一张大图 2、两张图 3、三张小图 4、一张小 图 5、一张大图两张小图 来源于详情页相关推荐的商品,上报样式都为 0(因为都是左文右图)
    private String news_staytime;//页面停留时长:从商品开始加载时开始计算,到用户关闭页面 所用的时间。若中途用跳转到其它页面了,则暂停计时,待回到详情页时恢复计时。或中途划出的时间超 过 10 分钟,则本次计时作废,不上报本次数据。如未加载成功退出,则报空。
    private String loading_time;//加载时长:计算页面开始加载到接口返回数据的时间 (开始加 载报 0,加载成功或加载失败才上报时间)
    private String type1;//加载失败码:把加载失败状态码报回来(报空为加载成功,没有失败)
    private String category;//分类 ID(服务端定义的分类 ID)
}
2.4.3.6 商品列表类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 商品列表类
 */
@Data
public class AppLoading {
    private String action;//动作:开始加载=1,加载成功=2,加载失败
    private String loading_time;//加载时长:计算下拉开始到接口返回数据的时间,(开始加载报 0, 加载成功或加载失败才上报时间)
    private String loading_way;//加载类型:1-读取缓存,2-从接口拉新数据 (加载成功才上报加 载类型)
    private String extend1;//扩展字段 Extend1
    private String extend2;//扩展字段 Extend2
    private String type;//加载类型:自动加载=1,用户下拽加载=2,底部加载=3(底部条触发点击底 部提示条/点击返回顶部加载)
    private String type1;//加载失败码:把加载失败状态码报回来(报空为加载成功,没有失败)
}
2.4.3.7 广告类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 广告类
 */
@Data
public class AppAd {
    private String entry;//入口:商品列表页=1  应用首页=2 商品详情页=3
    private String action;//动作: 广告展示=1 广告点击=2
    private String contentType;//Type: 1 商品 2 营销活动
    private String displayMills;//展示时长 毫秒数
    private String itemId; //商品id
    private String activityId; //营销活动id
}
2.4.3.8 消息通知日志类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 消息通知日志类
 */
@Data
public class AppNotification {
    private String action;//动作:通知产生=1,通知弹出=2,通知点击=3,常驻通知展示(不重复上 报,一天之内只报一次)
    private String type;//通知 id:预警通知=1,天气预报(早=2,晚=3),常驻=4
    private String ap_time;//客户端弹出时间
    private String content;//备用字段
}
2.4.3.9 用户后台活跃类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 用户后台活跃类
 */
@Data
public class AppActive {
    private String active_source;//1=upgrade,2=download(下载),3=plugin_upgrade
}
2.4.3.10 用户评论类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 用户评论类
 */
@Data
public class AppComment {
    private int comment_id;//评论表
    private int userid;//用户 id
    private int p_comment_id;//父级评论 id(为 0 则是一级评论,不为 0 则是回复)
    private String content;//评论内容
    private String addtime;//创建时间
    private int other_id;//评论的相关 id
    private int praise_count;//点赞数量
    private int reply_count;//回复数量
}
2.4.3.11 用户收藏类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 用户收藏类
 */
@Data
public class AppFavorites {
    private int id;//主键
    private int course_id;//商品 id
    private int userid;//用户 ID
    private String add_time;//创建时间
}
2.4.3.12 用户点赞类
import lombok.Data;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 用户点赞类
 */
@Data
public class AppPraise {
    private int id; //主键 id
    private int userid;//用户 id
    private int target_id;//点赞的对象 id
    private int type;//点赞类型 1 问答点赞 2 问答评论点赞 3 文章点赞数 4 评论点赞
    private String add_time;//添加时间
}

2.4.4 启动类

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.UnsupportedEncodingException;
import java.util.Random;
/**
 * @author Heaton
 * @email [email protected]
 * @date 2020/4/25 14:54
 * @describe 启动类
 */
public class App {

    private final static Logger logger = LoggerFactory.getLogger(App.class);
    private static Random rand = new Random();
    // 设备id
    private static int s_mid = 0;

    // 用户id
    private static int s_uid = 0;

    // 商品id
    private static int s_goodsid = 0;

    public static void main(String[] args) {

        // 参数一:控制发送每条的延时时间,默认是0
        Long delay = args.length > 0 ? Long.parseLong(args[0]) : 0L;

        // 参数二:循环遍历次数
        int loop_len = args.length > 1 ? Integer.parseInt(args[1]) : 1000;

        // 生成数据
        generateLog(delay, loop_len);
    }

    private static void generateLog(Long delay, int loop_len) {

        for (int i = 0; i < loop_len; i++) {

            int flag = rand.nextInt(2);

            switch (flag) {
                case (0):
                    //应用启动
                    AppStart appStart = generateStart();
                    String jsonString = JSON.toJSONString(appStart);

                    //控制台打印
                    logger.info(jsonString);
                    break;

                case (1):

                    JSONObject json = new JSONObject();

                    json.put("ap", "app");
                    json.put("cm", generateComFields());

                    JSONArray eventsArray = new JSONArray();

                    // 事件日志
                    // 商品点击,展示
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateDisplay());
                        json.put("et", eventsArray);
                    }

                    // 商品详情页
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateNewsDetail());
                        json.put("et", eventsArray);
                    }

                    // 商品列表页
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateNewList());
                        json.put("et", eventsArray);
                    }

                    // 广告
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateAd());
                        json.put("et", eventsArray);
                    }

                    // 消息通知
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateNotification());
                        json.put("et", eventsArray);
                    }

                    // 用户后台活跃
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateBackground());
                        json.put("et", eventsArray);
                    }

                    //故障日志
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateError());
                        json.put("et", eventsArray);
                    }

                    // 用户评论
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateComment());
                        json.put("et", eventsArray);
                    }

                    // 用户收藏
                    if (rand.nextBoolean()) {
                        eventsArray.add(generateFavorites());
                        json.put("et", eventsArray);
                    }

                    // 用户点赞
                    if (rand.nextBoolean()) {
                        eventsArray.add(generatePraise());
                        json.put("et", eventsArray);
                    }

                    //时间
                    long millis = System.currentTimeMillis();

                    //控制台打印
                    logger.info(millis + "|" + json.toJSONString());
                    break;
            }

            // 延迟
            try {
                Thread.sleep(delay);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    /**
     * 公共字段设置
     */
    private static JSONObject generateComFields() {

        AppBase appBase = new AppBase();

        //设备id
        appBase.setMid(s_mid + "");
        s_mid++;

        // 用户id
        appBase.setUid(s_uid + "");
        s_uid++;

        // 程序版本号 5,6等
        appBase.setVc("" + rand.nextInt(20));

        //程序版本名 v1.1.1
        appBase.setVn("1." + rand.nextInt(4) + "." + rand.nextInt(10));

        // 安卓系统版本
        appBase.setOs("8." + rand.nextInt(3) + "." + rand.nextInt(10));

        // 语言  es,en,pt
        int flag = rand.nextInt(3);
        switch (flag) {
            case (0):
                appBase.setL("es");
                break;
            case (1):
                appBase.setL("en");
                break;
            case (2):
                appBase.setL("pt");
                break;
        }

        // 渠道号   从哪个渠道来的
        appBase.setSr(getRandomChar(1));

        // 区域
        flag = rand.nextInt(2);
        switch (flag) {
            case 0:
                appBase.setAr("BR");
            case 1:
                appBase.setAr("MX");
        }

        // 手机品牌 ba ,手机型号 md,就取2位数字了
        flag = rand.nextInt(3);
        switch (flag) {
            case 0:
                appBase.setBa("Sumsung");
                appBase.setMd("sumsung-" + rand.nextInt(20));
                break;
            case 1:
                appBase.setBa("Huawei");
                appBase.setMd("Huawei-" + rand.nextInt(20));
                break;
            case 2:
                appBase.setBa("HTC");
                appBase.setMd("HTC-" + rand.nextInt(20));
                break;
        }

        // 嵌入sdk的版本
        appBase.setSv("V2." + rand.nextInt(10) + "." + rand.nextInt(10));
        // gmail
        appBase.setG(getRandomCharAndNumr(8) + "@gmail.com");

        // 屏幕宽高 hw
        flag = rand.nextInt(4);
        switch (flag) {
            case 0:
                appBase.setHw("640*960");
                break;
            case 1:
                appBase.setHw("640*1136");
                break;
            case 2:
                appBase.setHw("750*1134");
                break;
            case 3:
                appBase.setHw("1080*1920");
                break;
        }

        // 客户端产生日志时间
        long millis = System.currentTimeMillis();
        appBase.setT("" + (millis - rand.nextInt(99999999)));

        // 手机网络模式 3G,4G,WIFI
        flag = rand.nextInt(3);
        switch (flag) {
            case 0:
                appBase.setNw("3G");
                break;
            case 1:
                appBase.setNw("4G");
                break;
            case 2:
                appBase.setNw("WIFI");
                break;
        }

        // 拉丁美洲 西经34°46′至西经117°09;北纬32°42′至南纬53°54′
        // 经度
        appBase.setLn((-34 - rand.nextInt(83) - rand.nextInt(60) / 10.0) + "");
        // 纬度
        appBase.setLa((32 - rand.nextInt(85) - rand.nextInt(60) / 10.0) + "");

        return (JSONObject) JSON.toJSON(appBase);
    }

    /**
     * 商品展示事件
     */
    private static JSONObject generateDisplay() {

        AppDisplay appDisplay = new AppDisplay();

        boolean boolFlag = rand.nextInt(10) < 7;

        // 动作:曝光商品=1,点击商品=2,
        if (boolFlag) {
            appDisplay.setAction("1");
        } else {
            appDisplay.setAction("2");
        }

        // 商品id
        String goodsId = s_goodsid + "";
        s_goodsid++;

        appDisplay.setGoodsid(goodsId);

        // 顺序  设置成6条吧
        int flag = rand.nextInt(6);
        appDisplay.setPlace("" + flag);

        // 曝光类型
        flag = 1 + rand.nextInt(2);
        appDisplay.setExtend1("" + flag);

        // 分类
        flag = 1 + rand.nextInt(100);
        appDisplay.setCategory("" + flag);

        JSONObject jsonObject = (JSONObject) JSON.toJSON(appDisplay);

        return packEventJson("display", jsonObject);
    }

    /**
     * 商品详情页
     */
    private static JSONObject generateNewsDetail() {

        AppNewsDetail appNewsDetail = new AppNewsDetail();

        // 页面入口来源
        int flag = 1 + rand.nextInt(3);
        appNewsDetail.setEntry(flag + "");

        // 动作
        appNewsDetail.setAction("" + (rand.nextInt(4) + 1));

        // 商品id
        appNewsDetail.setGoodsid(s_goodsid + "");

        // 商品来源类型
        flag = 1 + rand.nextInt(3);
        appNewsDetail.setShowtype(flag + "");

        // 商品样式
        flag = rand.nextInt(6);
        appNewsDetail.setShowtype("" + flag);

        // 页面停留时长
        flag = rand.nextInt(10) * rand.nextInt(7);
        appNewsDetail.setNews_staytime(flag + "");

        // 加载时长
        flag = rand.nextInt(10) * rand.nextInt(7);
        appNewsDetail.setLoading_time(flag + "");

        // 加载失败码
        flag = rand.nextInt(10);
        switch (flag) {
            case 1:
                appNewsDetail.setType1("102");
                break;
            case 2:
                appNewsDetail.setType1("201");
                break;
            case 3:
                appNewsDetail.setType1("325");
                break;
            case 4:
                appNewsDetail.setType1("433");
                break;
            case 5:
                appNewsDetail.setType1("542");
                break;
            default:
                appNewsDetail.setType1("");
                break;
        }

        // 分类
        flag = 1 + rand.nextInt(100);
        appNewsDetail.setCategory("" + flag);

        JSONObject eventJson = (JSONObject) JSON.toJSON(appNewsDetail);

        return packEventJson("newsdetail", eventJson);
    }

    /**
     * 商品列表
     */
    private static JSONObject generateNewList() {

        AppLoading appLoading = new AppLoading();

        // 动作
        int flag = rand.nextInt(3) + 1;
        appLoading.setAction(flag + "");

        // 加载时长
        flag = rand.nextInt(10) * rand.nextInt(7);
        appLoading.setLoading_time(flag + "");

        // 失败码
        flag = rand.nextInt(10);
        switch (flag) {
            case 1:
                appLoading.setType1("102");
                break;
            case 2:
                appLoading.setType1("201");
                break;
            case 3:
                appLoading.setType1("325");
                break;
            case 4:
                appLoading.setType1("433");
                break;
            case 5:
                appLoading.setType1("542");
                break;
            default:
                appLoading.setType1("");
                break;
        }

        // 页面  加载类型
        flag = 1 + rand.nextInt(2);
        appLoading.setLoading_way("" + flag);

        // 扩展字段1
        appLoading.setExtend1("");

        // 扩展字段2
        appLoading.setExtend2("");

        // 用户加载类型
        flag = 1 + rand.nextInt(3);
        appLoading.setType("" + flag);

        JSONObject jsonObject = (JSONObject) JSON.toJSON(appLoading);

        return packEventJson("loading", jsonObject);
    }

    /**
     * 广告相关字段
     */
    private static JSONObject generateAd() {

        AppAd appAd = new AppAd();

        // 入口
        int flag = rand.nextInt(3) + 1;
        appAd.setEntry(flag + "");

        // 动作
        flag = rand.nextInt(5) + 1;
        appAd.setAction(flag + "");

        // 内容类型类型
        flag = rand.nextInt(6) + 1;
        appAd.setContentType(flag + "");

        // 展示样式
        flag = rand.nextInt(120000) + 1000;
        appAd.setDisplayMills(flag + "");

        flag = rand.nextInt(1);
        if (flag == 1) {
            appAd.setContentType(flag + "");
            flag = rand.nextInt(6);
            appAd.setItemId(flag + "");
        } else {
            appAd.setContentType(flag + "");
            flag = rand.nextInt(1) + 1;
            appAd.setActivityId(flag + "");
        }

        JSONObject jsonObject = (JSONObject) JSON.toJSON(appAd);

        return packEventJson("ad", jsonObject);
    }

    /**
     * 启动日志
     */
    private static AppStart generateStart() {

        AppStart appStart = new AppStart();

        //设备id
        appStart.setMid(s_mid + "");
        s_mid++;

        // 用户id
        appStart.setUid(s_uid + "");
        s_uid++;

        // 程序版本号 5,6等
        appStart.setVc("" + rand.nextInt(20));

        //程序版本名 v1.1.1
        appStart.setVn("1." + rand.nextInt(4) + "." + rand.nextInt(10));

        // 安卓系统版本
        appStart.setOs("8." + rand.nextInt(3) + "." + rand.nextInt(10));

        //设置日志类型
        appStart.setEn("start");

        //    语言  es,en,pt
        int flag = rand.nextInt(3);
        switch (flag) {
            case (0):
                appStart.setL("es");
                break;
            case (1):
                appStart.setL("en");
                break;
            case (2):
                appStart.setL("pt");
                break;
        }

        // 渠道号   从哪个渠道来的
        appStart.setSr(getRandomChar(1));

        // 区域
        flag = rand.nextInt(2);
        switch (flag) {
            case 0:
                appStart.setAr("BR");
            case 1:
                appStart.setAr("MX");
        }

        // 手机品牌 ba ,手机型号 md,就取2位数字了
        flag = rand.nextInt(3);
        switch (flag) {
            case 0:
                appStart.setBa("Sumsung");
                appStart.setMd("sumsung-" + rand.nextInt(20));
                break;
            case 1:
                appStart.setBa("Huawei");
                appStart.setMd("Huawei-" + rand.nextInt(20));
                break;
            case 2:
                appStart.setBa("HTC");
                appStart.setMd("HTC-" + rand.nextInt(20));
                break;
        }

        // 嵌入sdk的版本
        appStart.setSv("V2." + rand.nextInt(10) + "." + rand.nextInt(10));
        // gmail
        appStart.setG(getRandomCharAndNumr(8) + "@gmail.com");

        // 屏幕宽高 hw
        flag = rand.nextInt(4);
        switch (flag) {
            case 0:
                appStart.setHw("640*960");
                break;
            case 1:
                appStart.setHw("640*1136");
                break;
            case 2:
                appStart.setHw("750*1134");
                break;
            case 3:
                appStart.setHw("1080*1920");
                break;
        }

        // 客户端产生日志时间
        long millis = System.currentTimeMillis();
        appStart.setT("" + (millis - rand.nextInt(99999999)));

        // 手机网络模式 3G,4G,WIFI
        flag = rand.nextInt(3);
        switch (flag) {
            case 0:
                appStart.setNw("3G");
                break;
            case 1:
                appStart.setNw("4G");
                break;
            case 2:
                appStart.setNw("WIFI");
                break;
        }

        // 拉丁美洲 西经34°46′至西经117°09;北纬32°42′至南纬53°54′
        // 经度
        appStart.setLn((-34 - rand.nextInt(83) - rand.nextInt(60) / 10.0) + "");
        // 纬度
        appStart.setLa((32 - rand.nextInt(85) - rand.nextInt(60) / 10.0) + "");

        // 入口
        flag = rand.nextInt(5) + 1;
        appStart.setEntry(flag + "");

        // 开屏广告类型
        flag = rand.nextInt(2) + 1;
        appStart.setOpen_ad_type(flag + "");

        // 状态
        flag = rand.nextInt(10) > 8 ? 2 : 1;
        appStart.setAction(flag + "");

        // 加载时长
        appStart.setLoading_time(rand.nextInt(20) + "");

        // 失败码
        flag = rand.nextInt(10);
        switch (flag) {
            case 1:
                appStart.setDetail("102");
                break;
            case 2:
                appStart.setDetail("201");
                break;
            case 3:
                appStart.setDetail("325");
                break;
            case 4:
                appStart.setDetail("433");
                break;
            case 5:
                appStart.setDetail("542");
                break;
            default:
                appStart.setDetail("");
                break;
        }

        // 扩展字段
        appStart.setExtend1("");

        return appStart;
    }

    /**
     * 消息通知
     */
    private static JSONObject generateNotification() {

        AppNotification appNotification = new AppNotification();

        int flag = rand.nextInt(4) + 1;

        // 动作
        appNotification.setAction(flag + "");

        // 通知id
        flag = rand.nextInt(4) + 1;
        appNotification.setType(flag + "");

        // 客户端弹时间
        appNotification.setAp_time((System.currentTimeMillis() - rand.nextInt(99999999)) + "");

        // 备用字段
        appNotification.setContent("");

        JSONObject jsonObject = (JSONObject) JSON.toJSON(appNotification);

        return packEventJson("notification", jsonObject);
    }

    /**
     * 后台活跃
     */
    private static JSONObject generateBackground() {

        AppActive appActive_background = new AppActive();

        // 启动源
        int flag = rand.nextInt(3) + 1;
        appActive_background.setActive_source(flag + "");

        JSONObject jsonObject = (JSONObject) JSON.toJSON(appActive_background);

        return packEventJson("active_background", jsonObject);
    }

    /**
     * 错误日志数据
     */
    private static JSONObject generateError() {

        AppErrorLog appErrorLog = new AppErrorLog();

        String[] errorBriefs = {"at cn.lift.dfdf.web.AbstractBaseController.validInbound(AbstractBaseController.java:72)", "at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)"};        //错误摘要
        String[] errorDetails = {"java.lang.NullPointerException\\n    " + "at cn.lift.appIn.web.AbstractBaseController.validInbound(AbstractBaseController.java:72)\\n " + "at cn.lift.dfdf.web.AbstractBaseController.validInbound", "at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67)\\n " + "at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n" + " at java.lang.reflect.Method.invoke(Method.java:606)\\n"};        //错误详情

        //错误摘要
        appErrorLog.setErrorBrief(errorBriefs[rand.nextInt(errorBriefs.length)]);
        //错误详情
        appErrorLog.setErrorDetail(errorDetails[rand.nextInt(errorDetails.length)]);

        JSONObject jsonObject = (JSONObject) JSON.toJSON(appErrorLog);

        return packEventJson("error", jsonObject);
    }

    /**
     * 为各个事件类型的公共字段(时间、事件类型、Json数据)拼接
     */
    private static JSONObject packEventJson(String eventName, JSONObject jsonObject) {

        JSONObject eventJson = new JSONObject();

        eventJson.put("ett", (System.currentTimeMillis() - rand.nextInt(99999999)) + "");
        eventJson.put("en", eventName);
        eventJson.put("kv", jsonObject);

        return eventJson;
    }

    /**
     * 获取随机字母组合
     *
     * @param length 字符串长度
     */
    private static String getRandomChar(Integer length) {

        StringBuilder str = new StringBuilder();
        Random random = new Random();

        for (int i = 0; i < length; i++) {
            // 字符串
            str.append((char) (65 + random.nextInt(26)));// 取得大写字母
        }

        return str.toString();
    }

    /**
     * 获取随机字母数字组合
     *
     * @param length 字符串长度
     */
    private static String getRandomCharAndNumr(Integer length) {

        StringBuilder str = new StringBuilder();
        Random random = new Random();

        for (int i = 0; i < length; i++) {

            boolean b = random.nextBoolean();

            if (b) { // 字符串
                // int choice = random.nextBoolean() ? 65 : 97; 取得65大写字母还是97小写字母
                str.append((char) (65 + random.nextInt(26)));// 取得大写字母
            } else { // 数字
                str.append(String.valueOf(random.nextInt(10)));
            }
        }

        return str.toString();
    }

    /**
     * 收藏
     */
    private static JSONObject generateFavorites() {

        AppFavorites favorites = new AppFavorites();

        favorites.setCourse_id(rand.nextInt(10));
        favorites.setUserid(rand.nextInt(10));
        favorites.setAdd_time((System.currentTimeMillis() - rand.nextInt(99999999)) + "");

        JSONObject jsonObject = (JSONObject) JSON.toJSON(favorites);

        return packEventJson("favorites", jsonObject);
    }

    /**
     * 点赞
     */
    private static JSONObject generatePraise() {

        AppPraise praise = new AppPraise();

        praise.setId(rand.nextInt(10));
        praise.setUserid(rand.nextInt(10));
        praise.setTarget_id(rand.nextInt(10));
        praise.setType(rand.nextInt(4) + 1);
        praise.setAdd_time((System.currentTimeMillis() - rand.nextInt(99999999)) + "");

        JSONObject jsonObject = (JSONObject) JSON.toJSON(praise);

        return packEventJson("praise", jsonObject);
    }

    /**
     * 评论
     */
    private static JSONObject generateComment() {

        AppComment comment = new AppComment();

        comment.setComment_id(rand.nextInt(10));
        comment.setUserid(rand.nextInt(10));
        comment.setP_comment_id(rand.nextInt(5));

        comment.setContent(getCONTENT());
        comment.setAddtime((System.currentTimeMillis() - rand.nextInt(99999999)) + "");

        comment.setOther_id(rand.nextInt(10));
        comment.setPraise_count(rand.nextInt(1000));
        comment.setReply_count(rand.nextInt(200));

        JSONObject jsonObject = (JSONObject) JSON.toJSON(comment);

        return packEventJson("comment", jsonObject);
    }

    /**
     * 生成单个汉字
     */
    private static char getRandomChar() {
        String str = "";
        int hightPos; //
        int lowPos;

        Random random = new Random();

        //随机生成汉子的两个字节
        hightPos = (176 + Math.abs(random.nextInt(39)));
        lowPos = (161 + Math.abs(random.nextInt(93)));

        byte[] b = new byte[2];
        b[0] = (Integer.valueOf(hightPos)).byteValue();
        b[1] = (Integer.valueOf(lowPos)).byteValue();

        try {
            str = new String(b, "GBK");
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
            System.out.println("错误");
        }
        return str.charAt(0);
    }

    /**
     * 拼接成多个汉字
     */
    private static String getCONTENT() {

        StringBuilder str = new StringBuilder();

        for (int i = 0; i < rand.nextInt(100); i++) {
            str.append(getRandomChar());
        }
        return str.toString();
    }
}

2.4.5 启动测试

注意,需要将日志模拟放到2台服务器上,模拟日志每一条中即包括公共日志,又包含事件日志,需要flume拦截器进行日志分发,当然也需要两个flume-ng来做这个事情

打包上传2台服务器节点,生产数据为后面的测试做准备,这里为用户目录test文件夹下

通过参数控制生成消息速度及产量(如下 2秒一条,打印1000条)

#控制时间及条数
nohup java -jar data-producer-1.0-SNAPSHOT-jar-with-dependencies.jar 2000 1000 &
#监控日志
tail -F /root/logs/*.log

通过www.json.cn查看数据格式

3 创建KafKa-Topic

  • 创建启动日志主题:topic_start
  • 创建事件日志主题:topic_event

4 Flume准备

共分为2组flume

第一组:将服务器日志收集,并使用Kafka-Channels将数据发往Kafka不同的Topic,其中使用拦截器进行公共日志和事件日志的分发,

第二组:收集Kafka数据,使用Flie-Channels缓存数据,最终发往Hdfs保存

4.1 Flume:File->Kafka配置编写

  • vim /root/test/file-flume-kafka.conf
#1 定义组件
a1.sources=r1
a1.channels=c1 c2

# 2 source配置 type类型 positionFile记录日志读取位置 filegroups读取哪些目录  app.+为读取什么开头 channels发往哪里
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /root/test/flume/log_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /root/logs/app.+
a1.sources.r1.fileHeader = true
a1.sources.r1.channels = c1 c2

#3 拦截器 这里2个为自定义的拦截器 multiplexing为类型区分选择器 header头用于区分类型 mapping匹配头
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = com.heaton.bigdata.flume.LogETLInterceptor$Builder
a1.sources.r1.interceptors.i2.type = com.heaton.bigdata.flume.LogTypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = topic
a1.sources.r1.selector.mapping.topic_start = c1
a1.sources.r1.selector.mapping.topic_event = c2

#4 channel配置 kafkaChannel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = cdh01.cm:9092,cdh02.cm:9092,cdh03.cm:9092
a1.channels.c1.kafka.topic = topic_start
a1.channels.c1.parseAsFlumeEvent = false
a1.channels.c1.kafka.consumer.group.id = flume-consumer

a1.channels.c2.type =org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.kafka.bootstrap.servers = cdh01.cm:9092,cdh02.cm:9092,cdh03.cm:9092
a1.channels.c2.kafka.topic = topic_event
a1.channels.c2.parseAsFlumeEvent = false
a1.channels.c2.kafka.consumer.group.id = flume-consumer

在生产日志的2台服务器节点上创建flume配置文件。

LogETLInterceptor,LogTypeInterceptor为自定义拦截

4.2 自定义拦截器

data-flume工程

  • LogUtils
import org.apache.commons.lang.math.NumberUtils;

public class LogUtils {
    public static boolean validateEvent(String log) {
        /** 服务器时间 | json
         1588319303710|{
         "cm":{
         "ln":"-51.5","sv":"V2.0.7","os":"8.0.8","g":"[email protected]","mid":"13",
         "nw":"4G","l":"en","vc":"7","hw":"640*960","ar":"MX","uid":"13","t":"1588291826938",
         "la":"-38.2","md":"Huawei-14","vn":"1.3.6","ba":"Huawei","sr":"Y"
         },
         "ap":"app",
         "et":[{
                "ett":"1588228193191","en":"ad","kv":{"activityId":"1","displayMills":"113201","entry":"3","action":"5","contentType":"0"}
                },{
                "ett":"1588300304713","en":"notification","kv":{"ap_time":"1588277440794","action":"2","type":"3","content":""}
                },{
                "ett":"1588249203743","en":"active_background","kv":{"active_source":"3"}
                },{
                "ett":"1588254200122","en":"favorites","kv":{"course_id":5,"id":0,"add_time":"1588264138625","userid":0}
                },{
                "ett":"1588281152824","en":"praise","kv":{"target_id":4,"id":3,"type":3,"add_time":"1588307696417","userid":8}
                }]
         }
         */
        // 1 切割
        String[] logContents = log.split("\\|");
        // 2 校验
        if (logContents.length != 2) {
            return false;
        }
        //3 校验服务器时间
        if (logContents[0].length() != 13 || !NumberUtils.isDigits(logContents[0])) {
            return false;
        }
        // 4 校验 json
        if (!logContents[1].trim().startsWith("{")
                || !logContents[1].trim().endsWith("}")) {
            return false;
        }
        return true;
    }

    public static boolean validateStart(String log) {
        /**
         {
         "action":"1","ar":"MX","ba":"HTC","detail":"201","en":"start","entry":"4","extend1":"",
         "g":"[email protected]","hw":"750*1134","l":"pt","la":"-29.7","ln":"-48.1","loading_time":"0",
         "md":"HTC-18","mid":"14","nw":"3G","open_ad_type":"2","os":"8.0.8","sr":"D","sv":"V2.8.2",
         "t":"1588251833523","uid":"14","vc":"15","vn":"1.2.9"
         }
        */
        if (log == null) {
            return false;
        }
        // 校验 json
        if (!log.trim().startsWith("{") || !log.trim().endsWith("}")) {
            return false;
        }
        return true;
    }
}
  • LogETLInterceptor
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;

public class LogETLInterceptor implements Interceptor {
    @Override
    public void initialize() {
    //初始化
    }

    @Override
    public Event intercept(Event event) {
        // 1 获取数据
        byte[] body = event.getBody();
        String log = new String(body, Charset.forName("UTF-8"));
        // 2 判断数据类型并向 Header 中赋值
        if (log.contains("start")) {
            if (LogUtils.validateStart(log)) {
                return event;
            }
        } else {
            if (LogUtils.validateEvent(log)) {
                return event;
            }
        }
        // 3 返回校验结果
        return null;
    }

    @Override
    public List intercept(List events) {
        ArrayList interceptors = new ArrayList<>();
        for (Event event : events) {
            Event intercept1 = intercept(event);
            if (intercept1 != null) {
                interceptors.add(intercept1);
            }
        }
        return interceptors;
    }

    @Override
    public void close() {
    //关闭
    }

    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new LogETLInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}
  • LogTypeInterceptor
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class LogTypeInterceptor implements Interceptor {
    @Override
    public void initialize() {

    }

    @Override
    public Event intercept(Event event) {
        // 区分日志类型: body header
        // 1 获取 body 数据
        byte[] body = event.getBody();
        String log = new String(body, Charset.forName("UTF-8"));
        // 2 获取 header
        Map headers = event.getHeaders();
        // 3 判断数据类型并向 Header 中赋值
        if (log.contains("start")) {
            headers.put("topic", "topic_start");
        } else {
            headers.put("topic", "topic_event");
        }
        return event;
    }

    @Override
    public List intercept(List events) {
        ArrayList interceptors = new ArrayList<>();
        for (Event event : events) {
            Event intercept1 = intercept(event);
            interceptors.add(intercept1);
        }
        return interceptors;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new LogTypeInterceptor();
        }

        @Override
        public void configure(Context context) {
        }
    }
}

将项目打包放入Flume/lib目录下(所有节点):

CDH路径参考:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/flume-ng/lib

4.3 Flume启停脚本

  • vim /root/log-kafka-flume.sh
#! /bin/bash

case $1 in
"start"){
	for i in cdh02.cm cdh03.cm
	do
		echo " --------启动 $i 消费 flume-------"
		ssh $i "nohup flume-ng agent --conf-file /root/test/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/root/test/file-flume-kafka.log 2>&1 &"
	done
};;
"stop"){
	for i in cdh02.cm cdh03.cm
	do
		echo " --------停止 $i 消费 flume-------"
		ssh $i "ps -ef | grep file-flume-kafka | grep -v grep |awk '{print \$2}' | xargs kill"
	done
};;
esac


4.4 Flume:Kafka->HDFS配置编写

在第三台服务上准备

  • vim /root/test/kafka-flume-hdfs.conf
## 组件 
a1.sources=r1 r2
a1.channels=c1 c2 
a1.sinks=k1 k2
    
## Kafka-source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers= cdh01.cm:9092,cdh02.cm:9092,cdh03.cm:9092
a1.sources.r1.kafka.topics = topic_start
## Kafka- source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = cdh01.cm:9092,cdh02.cm:9092,cdh03.cm:9092
a1.sources.r2.kafka.topics = topic_event
    
## channel1
a1.channels.c1.type = file
##索引文件路径
a1.channels.c1.checkpointDir=/root/test/flume/checkpoint/behavior1
##持久化路径
a1.channels.c1.dataDirs = /root/test/flume/data/behavior1/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
## channel2
a1.channels.c2.type = file
##索引文件路径
a1.channels.c1.checkpointDir=/root/test/flume/checkpoint/behavior2
##持久化路径
a1.channels.c1.dataDirs = /root/test/flume/data/behavior2/
a1.channels.c2.maxFileSize = 2146435071
a1.channels.c2.capacity = 1000000
a1.channels.c2.keep-alive = 6
    
## HDFS-sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
## HDFS-sink2       
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
    
## 不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollInterval = 50
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = snappy
a1.sinks.k2.hdfs.codeC = snappy
    
## 组件拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2

4.5 Flume启停脚本

在第三台服务上准备

  • vim /root/test/kafka-hdfs-flume.sh
#! /bin/bash

case $1 in
"start"){
	for i in cdh01.cm
	do
		echo " --------启动 $i 消费 flume-------"
		ssh $i "nohup flume-ng agent --conf-file /root/test/kafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/root/test/kafka-flume-hdfs.log 2>&1 &"
	done
};;
"stop"){
	for i in cdh01.cm
	do
		echo " --------停止 $i 消费 flume-------"
		ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs kill"
	done
};;
esac


5 业务数据

此模块后主要针对于企业报表决策,为数据分析提供数据支持,解决大数据量下,无法快速产出报表,及一些即席业务需求的快速展示提供数据支撑。划分企业离线与实时业务,用离线的方式直观的管理数据呈现,为实时方案奠定良好基础。

5.1 电商业务流程

大数据篇:数据仓库案例_第7张图片

5.2 SKU-SPU

大数据篇:数据仓库案例_第8张图片

  • SKU(Stock Keeping Unit):库存量基本单位,现在已经被引申为产品统一编号的简称, 每种产品均对应有唯一的 SKU 号。
  • SPU(Standard Product Unit):是商品信息聚合的最小单位,是一组可复用、易检索的 标准化信息集合。
  • 总结:黑鲨3 手机就是 SPU。一台铠甲灰、256G 内存的就是 SKU。

5.3 业务表结构

5.3.1 订单表(order_info)

5.3.2 订单详情表(order_detail)

大数据篇:数据仓库案例_第9张图片

5.3.3 SKU 商品表(sku_info)

大数据篇:数据仓库案例_第10张图片

5.3.4 用户表(user_info)

5.3.5 商品一级分类表(base_category1)

5.3.6 商品二级分类表(base_category2)

5.3.7 商品三级分类表(base_category3)

5.3.8 支付流水表(payment_info)

5.3.9 省份表(base_province)

5.3.10 地区表(base_region)

5.3.11 品牌表(base_trademark)

5.3.12 订单状态表(order_status_log)

5.3.13 SPU 商品表(spu_info)

5.3.14 商品评论表(comment_info)

5.3.15 退单表(order_refund_info)

5.3.16 加入购物车表(cart_info)

5.3.17 商品收藏表(favor_info)

大数据篇:数据仓库案例_第11张图片

5.3.18 优惠券领用表(coupon_use)

大数据篇:数据仓库案例_第12张图片

5.3.19 优惠券表(coupon_info)

5.3.20 活动表(activity_info)

5.3.21 活动订单关联表(activity_order)

5.3.22 优惠规则表(activity_rule)

5.3.23 编码字典表(base_dic)

5.3.24 活动参与商品表(activity_sku)

5.4 时间表结构

5.4.1 时间表(date_info)

5.4.2 假期表(holiday_info)

5.4.3 假期年表(holiday_year)

6 同步策略及数仓分层

数据同步策略的类型包括:全量表、增量表、新增及变化表

  • 全量表:每天一个分区,存储完整的数据。

  • 增量表:每天新增数据放在一个分区,存储新增加的数据。

  • 新增及变化表:每天新增和变化的数据放在一个分区,存储新增加的数据和变化的数据。

  • 特殊表:没有分区,只需要存储一次。

6.1 全量策略

每日全量,每天存储一份完整数据,作为一个分区。

适合场景:表数据量不大,且有新增或修改业务的场景

例如:品牌表、编码表、商品分类表、优惠规则表、活动表、商品表、加购表、收藏表、SKU/SPU表

6.2 增量策略

每日增量,每天储存一份增量数据,作为一个分区

适合场景:表数据量大,且只会有新增数据的场景。

例如:退单表、订单状态表、支付流水表、订单详情表、活动与订单关联表、商品评论表

6.3 新增及变化策略

每日新增及变化,储存创建时间和操作时间都是今天的数据,作为一个分区

适合场景:表数据量大,既会有新增,又会有修改。

例如:用户表、订单表、优惠卷领用表。

6.4 特殊策略

某些特殊的维度表,可不必遵循上述同步策略,在数仓中只做一次同步,数据不变化不更新

适合场景:表数据几乎不会变化

1.客观世界维度:没变化的客观世界的维度(比如性别,地区,民族,政治成分,鞋子尺码)可以只存一 份固定值

2.日期维度:日期维度可以一次性导入一年或若干年的数据。

3.地区维度:省份表、地区表

6.5 分析业务表同步策略

考虑到特殊表可能会缓慢变化,比如打仗占地盘,地区表可能就会发生变化,故也选择分区全量同步策略。

6.6 数仓分层

  • 为什么分层:
    • 简单化:把复杂的任务分解为多层来完成,每层处理各自的任务,方便定位问题。
    • 减少重复开发:规范数据分层,通过中间层数据,能够极大的减少重复计算,增加结果复用性。
    • 隔离数据:不论是数据异常还是数据敏感性,使真实数据和统计数据解耦。
    • 一般在DWD层进行维度建模
  • ODS层:原始数据层,存放原始数据
  • DWD层:对ODS层数据进行清洗(去空、脏数据,转换类型等),维度退化,脱敏(保护隐私)
  • DWS层:以DWD为基础,按天进行汇总
  • DWT层:以DWS为基础,按主题进行汇总
  • ADS层:为各种数据分析报表提供数据

7 Sqoop同步数据

Sqoop注意点:

Hive 中的 Null 在底层是以“\N”来存储,而 MySQL 中的 Null 在底层就是 Null,为了 保证数据两端的一致性。

  • 在导出数据时采用 --input-null-string 和 --input-null-non-string
  • 导入数据时采用 --null-string 和 --null-non-string

本例思路为:sqoop抽取mysql数据上传至Hdfs上,存储为parquet文件,在建立hive-ods表,使用对应数据。

使用DolphinScheduler调度执行脚本。

  • Sqoop采集Mysql和Hive数据格式
mysql字段类型 hive:ods字段类型 hive:dwd-ads字段类型
tinyint tinyint tinyint
int int int
bigint bigint bigint
varchar string string
datetime bigint string
bit boolean int
double double double
decimal decimal decimal

8 ods层构建

8.1 ods建表

hive创建ods数据库,使用DolphinScheduler创建数据源,在创建DAG时需要选择hive库。

顺便将dwd,dws,dwt,ads一起创建了

  1. base_dic
drop table if exists ods.mall__base_dic

CREATE EXTERNAL TABLE `ods.mall__base_dic`(
  `dic_code` string COMMENT '编号',
  `dic_name` string COMMENT '编码名称',
  `parent_code`  string COMMENT '父编号',
  `create_time` bigint COMMENT '创建日期',
  `operate_time` bigint COMMENT '修改日期'
  ) COMMENT '编码字典表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/base_dic/'
tblproperties ("parquet.compression"="snappy") 
  1. base_trademark
drop table if exists ods.mall__base_trademark

CREATE EXTERNAL TABLE `ods.mall__base_trademark`(
  `tm_id` string COMMENT '品牌id',
  `tm_name` string COMMENT '品牌名称'
  ) COMMENT '品牌表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/base_trademark/'
tblproperties ("parquet.compression"="snappy")
  1. base_category3
drop table if exists ods.mall__base_category3

CREATE EXTERNAL TABLE `ods.mall__base_category3`(
  `id` bigint COMMENT '编号',
  `name` string COMMENT '三级分类名称',
  `category2_id` bigint COMMENT '二级分类编号'
  ) COMMENT '三级分类表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/base_category3/'
tblproperties ("parquet.compression"="snappy") 

  1. base_category2
drop table if exists ods.mall__base_category2

CREATE EXTERNAL TABLE `ods.mall__base_category2`(
  `id` bigint COMMENT '编号',
  `name` string COMMENT '二级分类名称',
  `category1_id` bigint COMMENT '一级分类编号'
  ) COMMENT '二级分类表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/base_category2/'
tblproperties ("parquet.compression"="snappy") 
  1. base_category1
drop table if exists ods.mall__base_category1

CREATE EXTERNAL TABLE `ods.mall__base_category1`(
  `id` bigint COMMENT '编号',
  `name` string COMMENT '分类名称'
  ) COMMENT '一级分类表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/base_category1/'
tblproperties ("parquet.compression"="snappy") 
  1. activity_rule
drop table if exists ods.mall__activity_rule

CREATE EXTERNAL TABLE `ods.mall__activity_rule`(
  `id` int COMMENT '编号',
  `activity_id` int COMMENT '类型',
  `condition_amount` decimal(16,2) COMMENT '满减金额',
  `condition_num` bigint COMMENT '满减件数',
  `benefit_amount` decimal(16,2) COMMENT '优惠金额',
  `benefit_discount` bigint COMMENT '优惠折扣',
  `benefit_level` bigint COMMENT '优惠级别'
  ) COMMENT '优惠规则'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/activity_rule/'
tblproperties ("parquet.compression"="snappy") 
  1. activity_info
drop table if exists ods.mall__activity_info

CREATE EXTERNAL TABLE `ods.mall__activity_info`(
  `id` bigint COMMENT '活动id',
  `activity_name` string COMMENT '活动名称',
  `activity_type` string COMMENT '活动类型',
  `start_time` bigint COMMENT '开始时间',
  `end_time` bigint COMMENT '结束时间',
  `create_time` bigint COMMENT '创建时间'
  ) COMMENT '活动表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/activity_info/'
tblproperties ("parquet.compression"="snappy") 
  1. activity_sku
drop table if exists ods.mall__activity_sku

CREATE EXTERNAL TABLE `ods.mall__activity_sku`(
  `id` bigint COMMENT '编号',
  `activity_id` bigint COMMENT '活动id',
  `sku_id` bigint COMMENT 'sku_id',
  `create_time` bigint COMMENT '创建时间'
  ) COMMENT '活动参与商品'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/activity_sku/'
tblproperties ("parquet.compression"="snappy") 
  1. cart_info
drop table if exists ods.mall__cart_info

CREATE EXTERNAL TABLE `ods.mall__cart_info`(
  `id` bigint COMMENT '编号',
  `user_id` bigint COMMENT '用户id',
  `sku_id` bigint COMMENT 'sku_id',
  `cart_price` decimal(10,2) COMMENT '放入购物车时价格',
  `sku_num` bigint COMMENT '数量',
  `sku_name` string COMMENT 'sku名称',
  `create_time` bigint COMMENT '创建时间',
  `operate_time` bigint COMMENT '修改时间',
  `is_ordered` bigint COMMENT '是否已经下单',
  `order_time` bigint COMMENT '下单时间'
  ) COMMENT '购物车表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/cart_info/'
tblproperties ("parquet.compression"="snappy") 
  1. favor_info
drop table if exists ods.mall__favor_info

CREATE EXTERNAL TABLE `ods.mall__favor_info`(
  `id` bigint COMMENT '编号',
  `user_id` bigint COMMENT '用户id',
  `sku_id` bigint COMMENT 'sku_id',
  `spu_id` bigint COMMENT '商品id',
  `is_cancel` string COMMENT '是否已取消 0 正常 1 已取消',
  `create_time` bigint COMMENT '创建时间',
  `cancel_time` bigint COMMENT '修改时间'
  ) COMMENT '商品收藏表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/favor_info/'
tblproperties ("parquet.compression"="snappy") 
  1. coupon_info
drop table if exists ods.mall__coupon_info

CREATE EXTERNAL TABLE `ods.mall__coupon_info`(
  `id` bigint COMMENT '购物券编号',
  `coupon_name` string COMMENT '购物券名称',
  `coupon_type` string COMMENT '购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券',
  `condition_amount` decimal(10,2) COMMENT '满额数',
  `condition_num` bigint COMMENT '满件数',
  `activity_id` bigint COMMENT '活动编号',
  `benefit_amount` decimal(16,2) COMMENT '减金额',
  `benefit_discount` bigint COMMENT '折扣',
  `create_time` bigint COMMENT '创建时间',
  `range_type` string COMMENT '范围类型 1、商品 2、品类 3、品牌',
  `spu_id` bigint COMMENT '商品id',
  `tm_id` bigint COMMENT '品牌id',
  `category3_id` bigint COMMENT '品类id',
  `limit_num` int COMMENT '最多领用次数',
  `operate_time` bigint COMMENT '修改时间',
  `expire_time` bigint COMMENT '过期时间'
  ) COMMENT '优惠券表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/coupon_info/'
tblproperties ("parquet.compression"="snappy") 
  1. sku_info
drop table if exists ods.mall__sku_info

CREATE EXTERNAL TABLE `ods.mall__sku_info`(
  `id` bigint COMMENT 'skuid',
  `spu_id` bigint COMMENT 'spuid',
  `price` decimal(10,0) COMMENT '价格',
  `sku_name` string COMMENT 'sku名称',
  `sku_desc` string COMMENT '商品规格描述',
  `weight` decimal(10,2) COMMENT '重量',
  `tm_id` bigint COMMENT '品牌',
  `category3_id` bigint COMMENT '三级分类id',
  `create_time` bigint COMMENT '创建时间'
  ) COMMENT '库存单元表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/sku_info/'
tblproperties ("parquet.compression"="snappy") 
  1. spu_info
drop table if exists ods.mall__spu_info

CREATE EXTERNAL TABLE `ods.mall__spu_info`(
  `id` bigint COMMENT '商品id',
  `spu_name` string COMMENT '商品名称',
  `category3_id` bigint COMMENT '三级分类id',
  `tm_id` bigint COMMENT '品牌id'
  ) COMMENT '商品表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/spu_info/'
tblproperties ("parquet.compression"="snappy") 
  1. base_province
drop table if exists ods.mall__base_province

CREATE EXTERNAL TABLE `ods.mall__base_province`(
  `id` bigint COMMENT 'id',
  `name` string COMMENT '省名称',
  `region_id` string COMMENT '大区id',
  `area_code` string COMMENT '行政区位码',
  `iso_code` string COMMENT '国际编码'
  ) COMMENT '省份表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/base_province/'
tblproperties ("parquet.compression"="snappy") 
  1. base_region
drop table if exists ods.mall__base_region

CREATE EXTERNAL TABLE `ods.mall__base_region`(
  `id` string COMMENT '大区id',
  `region_name` string COMMENT '大区名称'
  ) COMMENT '地区表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/base_region/'
tblproperties ("parquet.compression"="snappy") 
  1. refund_info
drop table if exists ods.mall__order_refund_info

CREATE EXTERNAL TABLE `ods.mall__order_refund_info`(
  `id` bigint COMMENT '编号',
  `user_id` bigint COMMENT '用户id',
  `order_id` bigint COMMENT '订单编号',
  `sku_id` bigint COMMENT 'skuid',
  `refund_type` string COMMENT '退款类型',
  `refund_num` bigint COMMENT '退货件数',
  `refund_amount` decimal(16,2) COMMENT '退款金额',
  `refund_reason_type` string COMMENT '原因类型',
  `create_time` bigint COMMENT '创建时间'
  ) COMMENT '退单表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/order_refund_info/'
tblproperties ("parquet.compression"="snappy") 
  1. order_status_log
drop table if exists ods.mall__order_status_log

CREATE EXTERNAL TABLE `ods.mall__order_status_log`(
  `id` bigint COMMENT '编号',
  `order_id` bigint COMMENT '订单编号',
  `order_status` string COMMENT '订单状态',
  `operate_time` bigint COMMENT '操作时间'
  ) COMMENT '订单状态表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/order_status_log/'
tblproperties ("parquet.compression"="snappy") 
  1. payment_info
drop table if exists ods.mall__payment_info

CREATE EXTERNAL TABLE `ods.mall__payment_info`(
  `id` bigint COMMENT '编号',
  `out_trade_no` string COMMENT '对外业务编号',
  `order_id` bigint COMMENT '订单编号',
  `user_id` bigint COMMENT '用户编号',
  `alipay_trade_no` string COMMENT '支付宝交易流水编号',
  `total_amount` decimal(16,2) COMMENT '支付金额',
  `subject` string COMMENT '交易内容',
  `payment_type` string COMMENT '支付方式',
  `payment_time` bigint COMMENT '支付时间'
  ) COMMENT '支付流水表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/payment_info/'
tblproperties ("parquet.compression"="snappy") 
  1. order_detail
drop table if exists ods.mall__order_detail

CREATE EXTERNAL TABLE `ods.mall__order_detail`(
  `id` bigint COMMENT '编号',
  `order_id` bigint COMMENT '订单编号',
  `user_id` bigint COMMENT '用户id',
  `sku_id` bigint COMMENT 'sku_id',
  `sku_name` string COMMENT 'sku名称',
  `order_price` decimal(10,2) COMMENT '购买价格(下单时sku价格)',
  `sku_num` string COMMENT '购买个数',
  `create_time` bigint COMMENT '创建时间'
  ) COMMENT '订单明细表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/order_detail/'
tblproperties ("parquet.compression"="snappy") 
  1. activity_order
drop table if exists ods.mall__activity_order

CREATE EXTERNAL TABLE `ods.mall__activity_order`(
  `id` bigint COMMENT '编号',
  `activity_id` bigint COMMENT '活动id',
  `order_id` bigint COMMENT '订单编号',
  `create_time` bigint COMMENT '发生日期'
  ) COMMENT '活动与订单关联表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/activity_order/'
tblproperties ("parquet.compression"="snappy") 
  1. comment_info
drop table if exists ods.mall__comment_info

CREATE EXTERNAL TABLE `ods.mall__comment_info`(
  `id` bigint COMMENT '编号',
  `user_id` bigint COMMENT '用户名称',
  `sku_id` bigint COMMENT 'skuid',
  `spu_id` bigint COMMENT '商品id',
  `order_id` bigint COMMENT '订单编号',
  `appraise` string COMMENT '评价 1 好评 2 中评 3 差评',
  `comment_txt` string COMMENT '评价内容',
  `create_time` bigint COMMENT '创建时间'
  ) COMMENT '商品评论表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/comment_info/'
tblproperties ("parquet.compression"="snappy") 
  1. coupon_use
drop table if exists ods.mall__coupon_use

CREATE EXTERNAL TABLE `ods.mall__coupon_use`(
  `id` bigint COMMENT '编号',
  `coupon_id` bigint COMMENT '购物券ID',
  `user_id` bigint COMMENT '用户ID',
  `order_id` bigint COMMENT '订单ID',
  `coupon_status` string COMMENT '购物券状态',
  `get_time` bigint COMMENT '领券时间',
  `using_time` bigint COMMENT '使用时间',
  `used_time` bigint COMMENT '过期时间'
  ) COMMENT '优惠券领用表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/coupon_use/'
tblproperties ("parquet.compression"="snappy")
  1. user_info
drop table if exists ods.mall__user_info

CREATE EXTERNAL TABLE `ods.mall__user_info`(
  `id` bigint COMMENT '编号',
  `name` string COMMENT '用户姓名',
  `email` string COMMENT '邮箱',
  `user_level` string COMMENT '用户级别',
  `birthday` bigint COMMENT '用户生日',
  `gender` string COMMENT '性别 M男,F女',
  `create_time` bigint COMMENT '创建时间',
  `operate_time` bigint COMMENT '修改时间'
  ) COMMENT '用户表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/user_info/'
tblproperties ("parquet.compression"="snappy") 
  1. order_info
drop table if exists ods.mall__order_info

CREATE EXTERNAL TABLE `ods.mall__order_info`(
  `id` bigint COMMENT '编号',
  `final_total_amount` decimal(16,2) COMMENT '总金额',
  `order_status` string COMMENT '订单状态',
  `user_id` bigint COMMENT '用户id',
  `out_trade_no` string COMMENT '订单交易编号(第三方支付用)',
  `create_time` bigint COMMENT '创建时间',
  `operate_time` bigint COMMENT '操作时间',
  `province_id` int COMMENT '地区',
  `benefit_reduce_amount` decimal(16,2) COMMENT '优惠金额',
  `original_total_amount` decimal(16,2) COMMENT '原价金额',
  `feight_fee` decimal(16,2) COMMENT '运费'
  ) COMMENT '订单表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/order_info/'
tblproperties ("parquet.compression"="snappy")
  1. start_log

此为埋点启动日志表

drop table if exists ods.mall__start_log

CREATE EXTERNAL TABLE `ods.mall__start_log`(
  `line` string COMMENT '启动日志'
  ) COMMENT '启动日志表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
location '/warehouse/ods/mall/start_log/'
  1. event_log

此为埋点事件日志表

drop table if exists ods.mall__event_log

CREATE EXTERNAL TABLE `ods.mall__event_log`(
  `line` string COMMENT '事件日志'
  ) COMMENT '事件日志表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
location '/warehouse/ods/mall/event_log/'
  1. date_info

此为时间表

drop table if exists ods.mall__date_info

CREATE EXTERNAL TABLE `ods.mall__date_info`(
`date_id` int COMMENT '日',
`week_id` int COMMENT '周',
`week_day` int COMMENT '周的第几天',
`day` int COMMENT '每月的第几天',
`month` int COMMENT '第几月',
`quarter` int COMMENT '第几季度',
`year` int COMMENT '年',
`is_workday` int COMMENT '是否是周末',
`holiday_id` int COMMENT '是否是节假日'
  ) COMMENT '时间维度表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ods/mall/date_info/'
tblproperties ("parquet.compression"="snappy") 

8.2 mysql数据抽取

  • sqoop抽取脚本基础
#!/bin/bash

db_date=${date}
mysql_db_name=${db_name}
mysql_db_addr=${db_addr}
mysql_db_user=${db_user}
mysql_db_password=${db_password}

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

echo "日期:"$db_date
echo "mysql库名:"$mysql_db_name


import_data() {
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/sqoop import \
--connect jdbc:mysql://$mysql_db_addr:3306/$mysql_db_name?tinyInt1isBit=false \
--username $mysql_db_user \
--password $mysql_db_password \
--target-dir /origin_data/$mysql_db_name/$1/$db_date \
--delete-target-dir \
--num-mappers 1 \
--null-string '' \
--null-non-string '\\n' \
--fields-terminated-by "\t" \
--query "$2"' and $CONDITIONS;' \
--as-parquetfile 
}
  • DolphinScheduler全局参数

大数据篇:数据仓库案例_第13张图片

date 不传为昨天
db_name 数据库名字
db_addr 数据库IP地址
db_user 数据库用户
db_password 数据库密码

元数据中数据开始日期为2020-03-15

如下导入数据代码片段,拼接上述的基础片段执行

  • 全量表代码片段
import_data "base_dic" "select
dic_code,
dic_name,
parent_code,
create_time,
operate_time
from base_dic
where 1=1"

import_data "base_trademark" "select
tm_id,
tm_name
from base_trademark
where 1=1"

import_data "base_category3" "select
id,
name,
category2_id
from base_category3 where 1=1"

import_data "base_category2" "select
id,
name,
category1_id
from base_category2 where 1=1"

import_data "base_category1" "select
id,
name
from base_category1 where 1=1"

import_data "activity_rule" "select
id,
activity_id,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from activity_rule
where 1=1"

import_data "activity_info" "select
id,
activity_name,
activity_type,
start_time,
end_time,
create_time
from activity_info
where 1=1"

import_data "activity_sku" "select
id,
activity_id,
sku_id,
create_time
FROM
activity_sku
where 1=1"

import_data "cart_info" "select
id,
user_id,
sku_id,
cart_price,
sku_num,
sku_name,
create_time,
operate_time,
is_ordered,
order_time
from cart_info
where 1=1"

import_data "favor_info" "select
id,
user_id,
sku_id,
spu_id,
is_cancel,
create_time,
cancel_time
from favor_info
where 1=1"

import_data "coupon_info" "select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
spu_id,
tm_id,
category3_id,
limit_num,
operate_time,
expire_time
from coupon_info
where 1=1"

import_data "sku_info" "select
id,
spu_id,
price,
sku_name,
sku_desc,
weight,
tm_id,
category3_id,
create_time
from sku_info where 1=1"

import_data "spu_info" "select
id,
spu_name,
category3_id,
tm_id
from spu_info
where 1=1"
  • 特殊表代码片段
import_data "base_province" "select
id,
name,
region_id,
area_code,
iso_code
from base_province
where 1=1"

import_data "base_region" "select
id,
region_name
from base_region
where 1=1"

import_data "date_info" "select
date_id,
week_id,
week_day,
day,
month,
quarter,
year,
is_workday,
holiday_id
from date_info
where 1=1"
  • 增量表代码片段
import_data "order_refund_info" "select
id,
user_id,
order_id,
sku_id,
refund_type,
refund_num,
refund_amount,
refund_reason_type,
create_time
from order_refund_info
where
date_format(create_time,'%Y-%m-%d')='$db_date'"

import_data "order_status_log" "select
id,
order_id,
order_status,
operate_time
from order_status_log
where
date_format(operate_time,'%Y-%m-%d')='$db_date'"

import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
alipay_trade_no,
total_amount,
subject,
payment_type,
payment_time
from payment_info
where
DATE_FORMAT(payment_time,'%Y-%m-%d')='$db_date'"

import_data "order_detail" "select
od.id,
od.order_id,
oi.user_id,
od.sku_id,
od.sku_name,
od.order_price,
od.sku_num,
od.create_time
from order_detail od
join order_info oi
on od.order_id=oi.id
where
DATE_FORMAT(od.create_time,'%Y-%m-%d')='$db_date'"

import_data "activity_order" "select
id,
activity_id,
order_id,
create_time
from activity_order
where
date_format(create_time,'%Y-%m-%d')='$db_date'"

import_data "comment_info" "select
id,
user_id,
sku_id,
spu_id,
order_id,
appraise,
comment_txt,
create_time
from comment_info
where date_format(create_time,'%Y-%m-%d')='$db_date'"
  • 增量及变化表代码片段
import_data "coupon_use" "select
id,
coupon_id,
user_id,
order_id,
coupon_status,
get_time,
using_time,
used_time
from coupon_use
where (date_format(get_time,'%Y-%m-%d')='$db_date'
or date_format(using_time,'%Y-%m-%d')='$db_date'
or date_format(used_time,'%Y-%m-%d')='$db_date')"

import_data "user_info" "select
id,
name,
birthday,
gender,
email,
user_level,
create_time,
operate_time
from user_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date'
or DATE_FORMAT(operate_time,'%Y-%m-%d')='$db_date')"

import_data "order_info" "select
id,
final_total_amount,
order_status,
user_id,
out_trade_no,
create_time,
operate_time,
province_id,
benefit_reduce_amount,
original_total_amount,
feight_fee
from order_info
where (date_format(create_time,'%Y-%m-%d')='$db_date'
or date_format(operate_time,'%Y-%m-%d')='$db_date')"

8.3 ods层数据加载

  • 脚本修改$table_name即可

注意2张埋点日志表的数据导出目录

#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ods
table_name=base_dic
hive_table_name=$APP2.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
load data inpath '/origin_data/$APP1/$table_name/$db_date' OVERWRITE into table $hive_table_name partition(dt='$db_date');
"
$hive -e "$sql"

9 dwd层构建

9.1 dwd层构建(启动-事件日志)

大数据篇:数据仓库案例_第14张图片
大数据篇:数据仓库案例_第15张图片

9.1.1 启动日志表

大数据篇:数据仓库案例_第16张图片

  • 建表
drop table if exists dwd.mall__start_log

CREATE EXTERNAL TABLE `dwd.mall__start_log`(
  `mid_id` string COMMENT '设备唯一标识',
  `user_id` string COMMENT '用户标识',
  `version_code` string COMMENT '程序版本号',
  `version_name` string COMMENT '程序版本名',
  `lang` string COMMENT '系统语言',
  `source` string COMMENT '渠道号',
  `os` string COMMENT '系统版本',
  `area` string COMMENT '区域',
  `model` string COMMENT '手机型号',
  `brand` string COMMENT '手机品牌',
  `sdk_version` string COMMENT 'sdkVersion',
  `gmail` string COMMENT 'gmail',
  `height_width` string COMMENT '屏幕宽高',
  `app_time` string COMMENT '客户端日志产生时的时间',
  `network` string COMMENT '网络模式',
  `lng` string COMMENT '经度',
  `lat` string COMMENT '纬度',
  `entry` string COMMENT '入口: push=1,widget=2,icon=3,notification=4,lockscreen_widget=5',
  `open_ad_type` string COMMENT '开屏广告类型: 开屏原生广告=1, 开屏插屏广告=2',
  `action` string COMMENT '状态:成功=1 失败=2',
  `loading_time` string COMMENT '加载时长',
  `detail` string COMMENT '失败码',
  `extend1` string COMMENT '失败的 message'
  ) COMMENT '启动日志表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/start_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=start_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	get_json_object(line,'$.mid') mid_id,
	get_json_object(line,'$.uid') user_id,
	get_json_object(line,'$.vc') version_code,
	get_json_object(line,'$.vn') version_name,
	get_json_object(line,'$.l') lang,
	get_json_object(line,'$.sr') source,
	get_json_object(line,'$.os') os,
	get_json_object(line,'$.ar') area,
	get_json_object(line,'$.md') model,
	get_json_object(line,'$.ba') brand,
	get_json_object(line,'$.sv') sdk_version,
	get_json_object(line,'$.g') gmail,
	get_json_object(line,'$.hw') height_width,
	get_json_object(line,'$.t') app_time,
	get_json_object(line,'$.nw') network,
	get_json_object(line,'$.ln') lng,
	get_json_object(line,'$.la') lat,
	get_json_object(line,'$.entry') entry,
	get_json_object(line,'$.open_ad_type') open_ad_type,
	get_json_object(line,'$.action') action,
	get_json_object(line,'$.loading_time') loading_time,
	get_json_object(line,'$.detail') detail,
	get_json_object(line,'$.extend1') extend1
from $hive_origin_table_name
where dt='$db_date';
"
$hive -e "$sql"

9.1.2 事件日志表

大数据篇:数据仓库案例_第17张图片

大数据篇:数据仓库案例_第18张图片

  • 建表
drop table if exists dwd.mall__event_log

CREATE EXTERNAL TABLE `dwd.mall__event_log`(
  `mid_id` string COMMENT '设备唯一标识',
  `user_id` string COMMENT '用户标识',
  `version_code` string COMMENT '程序版本号',
  `version_name` string COMMENT '程序版本名',
  `lang` string COMMENT '系统语言',
  `source` string COMMENT '渠道号',
  `os` string COMMENT '系统版本',
  `area` string COMMENT '区域',
  `model` string COMMENT '手机型号',
  `brand` string COMMENT '手机品牌',
  `sdk_version` string COMMENT 'sdkVersion',
  `gmail` string COMMENT 'gmail',
  `height_width` string COMMENT '屏幕宽高',
  `app_time` string COMMENT '客户端日志产生时的时间',
  `network` string COMMENT '网络模式',
  `lng` string COMMENT '经度',
  `lat` string COMMENT '纬度',
  `event_name` string COMMENT '事件名称',
  `event_json` string COMMENT '事件详情',
  `server_time` string COMMENT '服务器时间'
  ) COMMENT '事件日志表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/event_log/'
tblproperties ("parquet.compression"="snappy") 
9.2.1 制作 UDF UDTF
  • udf


import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;

public class BaseFieldUDF extends UDF {
    public String evaluate(String line, String key) throws JSONException {
        String[] log = line.split("\\|");
        if (log.length != 2 || StringUtils.isBlank(log[1])) {
            return "";
        }
        JSONObject baseJson = new JSONObject(log[1].trim());
        String result = "";
        // 获取服务器时间
        if ("st".equals(key)) {
            result = log[0].trim();
        } else if ("et".equals(key)) {
        // 获取事件数组
            if (baseJson.has("et")) {
                result = baseJson.getString("et");
            }
        } else {
            JSONObject cm = baseJson.getJSONObject("cm");
        // 获取 key 对应公共字段的 value
            if (cm.has(key)) {
                result = cm.getString(key);
            }
        }
        return result;
    }

    public static void main(String[] args) throws JSONException {
        String line = "         1588319303710|{\n" +
                "        \"cm\":{\n" +
                "            \"ln\":\"-51.5\",\"sv\":\"V2.0.7\",\"os\":\"8.0.8\",\"g\":\"[email protected]\",\"mid\":\"13\",\n" +
                "                    \"nw\":\"4G\",\"l\":\"en\",\"vc\":\"7\",\"hw\":\"640*960\",\"ar\":\"MX\",\"uid\":\"13\",\"t\":\"1588291826938\",\n" +
                "                    \"la\":\"-38.2\",\"md\":\"Huawei-14\",\"vn\":\"1.3.6\",\"ba\":\"Huawei\",\"sr\":\"Y\"\n" +
                "        },\n" +
                "        \"ap\":\"app\",\n" +
                "                \"et\":[{\n" +
                "            \"ett\":\"1588228193191\",\"en\":\"ad\",\"kv\":{\"activityId\":\"1\",\"displayMills\":\"113201\",\"entry\":\"3\",\"action\":\"5\",\"contentType\":\"0\"}\n" +
                "        },{\n" +
                "            \"ett\":\"1588300304713\",\"en\":\"notification\",\"kv\":{\"ap_time\":\"1588277440794\",\"action\":\"2\",\"type\":\"3\",\"content\":\"\"}\n" +
                "        },{\n" +
                "            \"ett\":\"1588249203743\",\"en\":\"active_background\",\"kv\":{\"active_source\":\"3\"}\n" +
                "        },{\n" +
                "            \"ett\":\"1588225856101\",\"en\":\"comment\",\"kv\":{\"p_comment_id\":0,\"addtime\":\"1588263895040\",\"praise_count\":231,\"other_id\":5,\"comment_id\":5,\"reply_count\":62,\"userid\":7,\"content\":\"骸汞\"}\n" +
                "        },{\n" +
                "            \"ett\":\"1588254200122\",\"en\":\"favorites\",\"kv\":{\"course_id\":5,\"id\":0,\"add_time\":\"1588264138625\",\"userid\":0}\n" +
                "        },{\n" +
                "            \"ett\":\"1588281152824\",\"en\":\"praise\",\"kv\":{\"target_id\":4,\"id\":3,\"type\":3,\"add_time\":\"1588307696417\",\"userid\":8}\n" +
                "        }]\n" +
                "    }";
        String s = new BaseFieldUDF().evaluate(line, "mid");
        String ss = new BaseFieldUDF().evaluate(line, "st");
        String sss = new BaseFieldUDF().evaluate(line, "et");
        System.out.println(s);
        System.out.println(ss);
        System.out.println(sss);
    }
}

结果:

13
1588319303710
[{"ett":"1588228193191","en":"ad","kv":{"activityId":"1","displayMills":"113201","entry":"3","action":"5","contentType":"0"}},{"ett":"1588300304713","en":"notification","kv":{"ap_time":"1588277440794","action":"2","type":"3","content":""}},{"ett":"1588249203743","en":"active_background","kv":{"active_source":"3"}},{"ett":"1588225856101","en":"comment","kv":{"p_comment_id":0,"addtime":"1588263895040","praise_count":231,"other_id":5,"comment_id":5,"reply_count":62,"userid":7,"content":"骸汞"}},{"ett":"1588254200122","en":"favorites","kv":{"course_id":5,"id":0,"add_time":"1588264138625","userid":0}},{"ett":"1588281152824","en":"praise","kv":{"target_id":4,"id":3,"type":3,"add_time":"1588307696417","userid":8}}]

  • udtf

大数据篇:数据仓库案例_第19张图片

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.json.JSONArray;
import org.json.JSONException;
import java.util.ArrayList;

public class EventJsonUDTF extends GenericUDTF {
    //该方法中,我们将指定输出参数的名称和参数类型:
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        ArrayList fieldNames = new ArrayList();
        ArrayList fieldOIs = new ArrayList();
        fieldNames.add("event_name");
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        fieldNames.add("event_json");
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,
                fieldOIs);
    }

    //输入 1 条记录,输出若干条结果
    @Override
    public void process(Object[] objects) throws HiveException {
        // 获取传入的 et
        String input = objects[0].toString();
        // 如果传进来的数据为空,直接返回过滤掉该数据
        if (StringUtils.isBlank(input)) {
            return;
        } else {
            try {
                // 获取一共有几个事件(ad/facoriters)
                JSONArray ja = new JSONArray(input);
                if (ja == null)
                    return;
                // 循环遍历每一个事件
                for (int i = 0; i < ja.length(); i++) {
                    String[] result = new String[2];
                    try {
                        // 取出每个的事件名称(ad/facoriters)
                        result[0] = ja.getJSONObject(i).getString("en");
                        // 取出每一个事件整体
                        result[1] = ja.getString(i);
                    } catch (JSONException e) {
                        continue;
                    }
                    // 将结果返回
                    forward(result);
                }
            } catch (JSONException e) {
                e.printStackTrace();
            }
        }
    }

    //当没有记录处理的时候该方法会被调用,用来清理代码或者产生额外的输出
    @Override
    public void close() throws HiveException {
    }
}
9.1.2.2 直接永久使用UDF
  • 上传UDF资源

将hive-function-1.0-SNAPSHOT包传到HDFS 的/user/hive/jars下

hadoop dfs -mkdir  /user/hive/jars
hadoop dfs -put hive-function-1.0-SNAPSHOT.jar /user/hive/jars/hive-function-1.0-SNAPSHOT.jar

在hive中创建永久UDF

create function base_analizer as 'com.heaton.bigdata.udf.BaseFieldUDF' using jar 'hdfs://cdh01.cm:8020/user/hive/jars/hive-function-1.0-SNAPSHOT.jar';

create function flat_analizer as 'com.heaton.bigdata.udtf.EventJsonUDTF' using jar 'hdfs://cdh01.cm:8020/user/hive/jars/hive-function-1.0-SNAPSHOT.jar';
9.1.2.3 Dolphin使用方式UDF

在DAG图创建SQL工具中选择对应UDF函数即可使用,但是目前Dolphin1.2.0中关联函数操作保存无效。

大家可以使用UDF管理功能将JAR传入到HDFS上,这样通过脚本加入临时函数,也可以很好的完成功能。

临时函数语句:

create temporary function base_analizer as 'com.heaton.bigdata.udf.BaseFieldUDF' using jar 'hdfs://cdh01.cm:8020/dolphinscheduler/dolphinscheduler/udfs/hive-function-1.0-SNAPSHOT.jar';
create temporary function flat_analizer as 'com.heaton.bigdata.udtf.EventJsonUDTF' using jar 'hdfs://cdh01.cm:8020/dolphinscheduler/dolphinscheduler/udfs/hive-function-1.0-SNAPSHOT.jar';
9.2.4 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=event_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	base_analizer(line,'mid') as mid_id,
	base_analizer(line,'uid') as user_id,
	base_analizer(line,'vc') as version_code,
	base_analizer(line,'vn') as version_name,
	base_analizer(line,'l') as lang,
	base_analizer(line,'sr') as source,
	base_analizer(line,'os') as os,
	base_analizer(line,'ar') as area,
	base_analizer(line,'md') as model,
	base_analizer(line,'ba') as brand,
	base_analizer(line,'sv') as sdk_version,
	base_analizer(line,'g') as gmail,
	base_analizer(line,'hw') as height_width,
	base_analizer(line,'t') as app_time,
	base_analizer(line,'nw') as network,
	base_analizer(line,'ln') as lng,
	base_analizer(line,'la') as lat,
	event_name,
	event_json,
	base_analizer(line,'st') as server_time
from $hive_origin_table_name lateral view flat_analizer(base_analizer(line,'et')) tmp_flat as event_name,event_json
where dt='$db_date' and base_analizer(line,'et')<>'';
"
$hive -e "$sql"

9.1.3 商品点击表

  • 建表
drop table if exists dwd.mall__display_log

CREATE EXTERNAL TABLE `dwd.mall__display_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`action` string,
	`goodsid` string,
	`place` string,
	`extend1` string,
	`category` string,
	`server_time` string
  ) COMMENT '商品点击表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/display_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=display_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
	select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.action') action,
	get_json_object(event_json,'$.kv.goodsid') goodsid,
	get_json_object(event_json,'$.kv.place') place,
	get_json_object(event_json,'$.kv.extend1') extend1,
	get_json_object(event_json,'$.kv.category') category,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='display';
"
$hive -e "$sql"

9.1.4 商品列表表

  • 建表
drop table if exists dwd.mall__loading_log

CREATE EXTERNAL TABLE `dwd.mall__loading_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`action` string,
	`loading_time` string,
	`loading_way` string,
	`extend1` string,
	`extend2` string,
	`type` string,
	`type1` string,
	`server_time` string
  ) COMMENT '商品列表表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/loading_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=loading_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.action') action,
	get_json_object(event_json,'$.kv.loading_time') loading_time,
	get_json_object(event_json,'$.kv.loading_way') loading_way,
	get_json_object(event_json,'$.kv.extend1') extend1,
	get_json_object(event_json,'$.kv.extend2') extend2,
	get_json_object(event_json,'$.kv.type') type,
	get_json_object(event_json,'$.kv.type1') type1,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='loading';
"
$hive -e "$sql"

9.1.5 广告表

  • 建表
drop table if exists dwd.mall__ad_log

CREATE EXTERNAL TABLE `dwd.mall__ad_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`entry` string,
	`action` string,
	`contentType` string,
	`displayMills` string,
	`itemId` string,
	`activityId` string,
	`server_time` string
  ) COMMENT '广告表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/ad_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=ad_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.entry') entry,
	get_json_object(event_json,'$.kv.action') action,
	get_json_object(event_json,'$.kv.contentType') contentType,
	get_json_object(event_json,'$.kv.displayMills') displayMills,
	get_json_object(event_json,'$.kv.itemId') itemId,
	get_json_object(event_json,'$.kv.activityId') activityId,
	server_time
from dwd.mall__event_log
where dt='db_date' and event_name='ad';
"
$hive -e "$sql"

9.1.6 消息通知表

  • 建表
drop table if exists dwd.mall__notification_log

CREATE EXTERNAL TABLE `dwd.mall__notification_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`action` string,
	`noti_type` string,
	`ap_time` string,
	`content` string,
	`server_time` string
  ) COMMENT '消息通知表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/notification_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=notification_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.action') action,
	get_json_object(event_json,'$.kv.noti_type') noti_type,
	get_json_object(event_json,'$.kv.ap_time') ap_time,
	get_json_object(event_json,'$.kv.content') content,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='notification';
"
$hive -e "$sql"

9.1.7 用户后台活跃表

  • 建表
drop table if exists dwd.mall__active_background_log

CREATE EXTERNAL TABLE `dwd.mall__active_background_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`active_source` string,
	`server_time` string
  ) COMMENT '用户后台活跃表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/active_background_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=active_background_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.active_source') active_source,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='active_background';
"
$hive -e "$sql"

9.1.8 评论表

  • 建表
drop table if exists dwd.mall__comment_log

CREATE EXTERNAL TABLE `dwd.mall__comment_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`comment_id` int,
	`userid` int,
	`p_comment_id` int,
	`content` string,
	`addtime` string,
	`other_id` int,
	`praise_count` int,
	`reply_count` int,
	`server_time` string
  ) COMMENT '评论表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/comment_log/'
tblproperties ("parquet.compression"="snappy")  
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=comment_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.comment_id') comment_id,
	get_json_object(event_json,'$.kv.userid') userid,
	get_json_object(event_json,'$.kv.p_comment_id') p_comment_id,
	get_json_object(event_json,'$.kv.content') content,
	get_json_object(event_json,'$.kv.addtime') addtime,
	get_json_object(event_json,'$.kv.other_id') other_id,
	get_json_object(event_json,'$.kv.praise_count') praise_count,
	get_json_object(event_json,'$.kv.reply_count') reply_count,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='comment';
"
$hive -e "$sql"

9.1.9 收藏表

  • 建表
drop table if exists dwd.mall__favorites_log

CREATE EXTERNAL TABLE `dwd.mall__favorites_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`id` int,
	`course_id` int,
	`userid` int,
	`add_time` string,
	`server_time` string
  ) COMMENT '收藏表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/favorites_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=favorites_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.id') id,
	get_json_object(event_json,'$.kv.course_id') course_id,
	get_json_object(event_json,'$.kv.userid') userid,
	get_json_object(event_json,'$.kv.add_time') add_time,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='favorites';
"
$hive -e "$sql"

9.1.10 点赞表

  • 建表
drop table if exists dwd.mall__praise_log

CREATE EXTERNAL TABLE `dwd.mall__praise_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`id` string,
	`userid` string,
	`target_id` string,
	`type` string,
	`add_time` string,
	`server_time` string
  ) COMMENT '点赞表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/praise_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=praise_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.id') id,
	get_json_object(event_json,'$.kv.userid') userid,
	get_json_object(event_json,'$.kv.target_id') target_id,
	get_json_object(event_json,'$.kv.type') type,
	get_json_object(event_json,'$.kv.add_time') add_time,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='praise';
"
$hive -e "$sql"

9.1.11 错误日志表

  • 建表
drop table if exists dwd.mall__error_log

CREATE EXTERNAL TABLE `dwd.mall__error_log`(
	`mid_id` string,
	`user_id` string,
	`version_code` string,
	`version_name` string,
	`lang` string,
	`source` string,
	`os` string,
	`area` string,
	`model` string,
	`brand` string,
	`sdk_version` string,
	`gmail` string,
	`height_width` string,
	`app_time` string,
	`network` string,
	`lng` string,
	`lat` string,
	`errorBrief` string,
	`errorDetail` string,
	`server_time` string
  ) COMMENT '错误日志表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/error_log/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
APP3=ods
table_name=error_log
hive_table_name=$APP2.mall__$table_name
hive_origin_table_name=$APP3.mall__$table_name

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	user_id,
	version_code,
	version_name,
	lang,
	source,
	os,
	area,
	model,
	brand,
	sdk_version,
	gmail,
	height_width,
	app_time,
	network,
	lng,
	lat,
	get_json_object(event_json,'$.kv.errorBrief') errorBrief,
	get_json_object(event_json,'$.kv.errorDetail') errorDetail,
	server_time
from dwd.mall__event_log
where dt='$db_date' and event_name='error';
"
$hive -e "$sql"

9.2 dwd层构建(业务库)

此层在构建之初,增量表需要动态分区来划分时间,将数据放入指定分区

事实/维度 时间 用户 地区 商品 优惠卷 活动 编码 度量
订单 件数/金额
订单详情 件数/金额
支付 次数/金额
加入购物车 件数/金额
收藏 个数
评价 个数
退款 件数/金额
优惠卷领用 个数

大数据篇:数据仓库案例_第20张图片

9.2.1 商品维度表(全量)

  • 建表
drop table if exists dwd.mall__dim_sku_info
    
CREATE EXTERNAL TABLE `dwd.mall__dim_sku_info`(
`id` string COMMENT '商品 id',
`spu_id` string COMMENT 'spuid',
`price` double COMMENT '商品价格',
`sku_name` string COMMENT '商品名称',
`sku_desc` string COMMENT '商品描述',
`weight` double COMMENT '重量',
`tm_id` string COMMENT '品牌 id',
`tm_name` string COMMENT '品牌名称',
`category3_id` string COMMENT '三级分类 id',
`category2_id` string COMMENT '二级分类 id',
`category1_id` string COMMENT '一级分类 id',
`category3_name` string COMMENT '三级分类名称',
`category2_name` string COMMENT '二级分类名称',
`category1_name` string COMMENT '一级分类名称',
`spu_name` string COMMENT 'spu 名称',
`create_time` string COMMENT '创建时间'
  ) COMMENT '商品维度表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/dim_sku_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_sku_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	sku.id,
	sku.spu_id,
	sku.price,
	sku.sku_name,
	sku.sku_desc,
	sku.weight,
	sku.tm_id,
	ob.tm_name,
	sku.category3_id,
	c2.id category2_id,
	c1.id category1_id,
	c3.name category3_name,
	c2.name category2_name,
	c1.name category1_name,
	spu.spu_name,
	from_unixtime(cast(sku.create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time
from
(
select * from ods.mall__sku_info where dt='$db_date'
)sku
join
(
select * from ods.mall__base_trademark where dt='$db_date'
)ob on sku.tm_id=ob.tm_id
join
(
select * from ods.mall__spu_info where dt='$db_date'
)spu on spu.id = sku.spu_id
join
(
select * from ods.mall__base_category3 where dt='$db_date'
)c3 on sku.category3_id=c3.id
join
(
select * from ods.mall__base_category2 where dt='$db_date'
)c2 on c3.category2_id=c2.id
join
(
select * from ods.mall__base_category1 where dt='$db_date'
)c1 on c2.category1_id=c1.id;
"
$hive -e "$sql"

9.2.2 优惠券信息维度表(全量)

  • 建表
drop table if exists dwd.mall__dim_coupon_info
    
CREATE EXTERNAL TABLE `dwd.mall__dim_coupon_info`(
`id` string COMMENT '购物券编号',
`coupon_name` string COMMENT '购物券名称',
`coupon_type` string COMMENT '购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券',
`condition_amount` string COMMENT '满额数',
`condition_num` string COMMENT '满件数',
`activity_id` string COMMENT '活动编号',
`benefit_amount` string COMMENT '减金额',
`benefit_discount` string COMMENT '折扣',
`create_time` string COMMENT '创建时间',
`range_type` string COMMENT '范围类型 1、商品 2、品类 3、品牌',
`spu_id` string COMMENT '商品 id',
`tm_id` string COMMENT '品牌 id',
`category3_id` string COMMENT '品类 id',
`limit_num` string COMMENT '最多领用次数',
`operate_time` string COMMENT '修改时间',
`expire_time` string COMMENT '过期时间'
  ) COMMENT '优惠券信息维度表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/dim_coupon_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_coupon_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	id,
	coupon_name,
	coupon_type,
	condition_amount,
	condition_num,
	activity_id,
	benefit_amount,
	benefit_discount,
	from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time,
	range_type,
	spu_id,
	tm_id,
	category3_id,
	limit_num,
	from_unixtime(cast(operate_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') operate_time,
	from_unixtime(cast(expire_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') expire_time
from ods.mall__coupon_info
where dt='$db_date';
"
$hive -e "$sql"

9.2.3 活动维度表(全量)

  • 建表
drop table if exists dwd.mall__dim_activity_info
    
CREATE EXTERNAL TABLE `dwd.mall__dim_activity_info`(
`id` string COMMENT '编号',
`activity_name` string COMMENT '活动名称',
`activity_type` string COMMENT '活动类型',
`condition_amount` string COMMENT '满减金额',
`condition_num` string COMMENT '满减件数',
`benefit_amount` string COMMENT '优惠金额',
`benefit_discount` string COMMENT '优惠折扣',
`benefit_level` string COMMENT '优惠级别',
`start_time` string COMMENT '开始时间',
`end_time` string COMMENT '结束时间',
`create_time` string COMMENT '创建时间'
  ) COMMENT '活动维度表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/dim_activity_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_activity_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	info.id,
	info.activity_name,
	info.activity_type,
	rule.condition_amount,
	rule.condition_num,
	rule.benefit_amount,
	rule.benefit_discount,
	rule.benefit_level,
	from_unixtime(cast(info.start_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') start_time,
	from_unixtime(cast(info.end_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') end_time,
	from_unixtime(cast(info.create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time
from
(
select * from ods.mall__activity_info where dt='$db_date'
)info
left join
(
select * from ods.mall__activity_rule where dt='$db_date'
)rule on info.id = rule.activity_id;
"
$hive -e "$sql"

9.2.4 地区维度表(特殊)

  • 建表
drop table if exists dwd.mall__dim_base_province
    
CREATE EXTERNAL TABLE `dwd.mall__dim_base_province`(
`id` string COMMENT 'id',
`province_name` string COMMENT '省市名称',
`area_code` string COMMENT '地区编码',
`iso_code` string COMMENT 'ISO 编码',
`region_id` string COMMENT '地区 id',
`region_name` string COMMENT '地区名称'
  ) COMMENT '地区维度表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/dim_base_province/'
tblproperties ("parquet.compression"="snappy")
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_base_province
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	bp.id,
	bp.name,
	bp.area_code,
	bp.iso_code,
	bp.region_id,
	br.region_name
from ods.mall__base_province bp
join ods.mall__base_region br
on bp.region_id=br.id;
"
$hive -e "$sql"

9.2.5 时间维度表(特殊)

  • 建表
drop table if exists dwd.mall__dim_date_info
    
CREATE EXTERNAL TABLE `dwd.mall__dim_date_info`(
`date_id` string COMMENT '日',
`week_id` int COMMENT '周',
`week_day` int COMMENT '周的第几天',
`day` int COMMENT '每月的第几天',
`month` int COMMENT '第几月',
`quarter` int COMMENT '第几季度',
`year` int COMMENT '年',
`is_workday` int COMMENT '是否是周末',
`holiday_id` int COMMENT '是否是节假日'
  ) COMMENT '时间维度表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/dim_date_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_date_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	date_id,
	week_id,
	week_day,
	day,
	month,
	quarter,
	year,
	is_workday,
	holiday_id
from ods.mall__date_info
"
$hive -e "$sql"

9.2.6 用户维度表(新增及变化-缓慢变化维-拉链表)

9.2.6.1 拉链表介绍

拉链表,记录每条信息的生命周期,一旦一条记录的生命周期结束,就重新开始一条新的记录,并把当前日期放入生效开始日期。

如果当前信息至今有效,在生效结束日期中填入一个极大值(如:9999-99-99),下表为张三的手机号变化例子

用户ID 姓名 手机号 开始日期 结束日期
1 张三 134XXXX5050 2019-01-01 2019-01-02
1 张三 139XXXX3232 2019-01-03 2020-01-01
1 张三 137XXXX7676 2020-01-02 9999-99-99
  • 适合场景:数据会发生变化,但是大部分不变(即:缓慢变化维)

比如:用户信息发生变化,但是每天变化比例不高,按照每日全量,则效率低

  • 如何使用拉链表:通过-->生效开始日期<=某个日期 且 生效结束日期>=某个日期,能够得到某个时间点的数据全量切片。

  • 拉链表形成过程

大数据篇:数据仓库案例_第21张图片

  • 制作流程

用户当日全部数据MySQL中每天变化的数据拼接在一起,形成一个<新的临时拉链表

临时拉链表覆盖旧的拉链表数据

从而解决Hive中数据不能更新的问题

9.2.6.2 用户维度表

用户表中的数据每日既有可能新增,也有可能修改,属于缓慢变化维度,此处采用拉链表存储用户维度数据。

  • 建表
drop table if exists dwd.mall__dim_user_info_his
    
CREATE EXTERNAL TABLE `dwd.mall__dim_user_info_his`(
`id` string COMMENT '用户 id',
`name` string COMMENT '姓名',
`birthday` string COMMENT '生日',
`gender` string COMMENT '性别',
`email` string COMMENT '邮箱',
`user_level` string COMMENT '用户等级',
`create_time` string COMMENT '创建时间',
`operate_time` string COMMENT '操作时间',
`start_date` string COMMENT '有效开始日期',
`end_date` string COMMENT '有效结束日期'
  ) COMMENT '用户拉链表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/dim_user_info_his/'
tblproperties ("parquet.compression"="snappy") 
  • 临时表建表(结构与主表相同)
drop table if exists dwd.mall__dim_user_info_his_tmp
    
CREATE EXTERNAL TABLE `dwd.mall__dim_user_info_his_tmp`(
`id` string COMMENT '用户 id',
`name` string COMMENT '姓名',
`birthday` string COMMENT '生日',
`gender` string COMMENT '性别',
`email` string COMMENT '邮箱',
`user_level` string COMMENT '用户等级',
`create_time` string COMMENT '创建时间',
`operate_time` string COMMENT '操作时间',
`start_date` string COMMENT '有效开始日期',
`end_date` string COMMENT '有效结束日期'
  ) COMMENT '用户拉链表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/dim_user_info_his_tmp/'
tblproperties ("parquet.compression"="snappy") 
  • 首先(主表)数据初始化,只做一次
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_user_info_his
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
select
	id,
	name,
	from_unixtime(cast(birthday/1000 as bigint),'yyyy-MM-dd HH:mm:ss') birthday,
	gender,
	email,
	user_level,
	from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time,
	from_unixtime(cast(operate_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') operate_time,
	'$db_date',
	'9999-99-99'
from ods.mall__user_info oi
where oi.dt='$db_date';
"
$hive -e "$sql"
  • 临时表数据计算导入(在主表数据之后执行)
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_user_info_his_tmp
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name

select
* 
from
(      --查询当前时间的所有信息
	select
		cast(id as string) id,
		name,
		from_unixtime(cast(birthday/1000 as bigint),'yyyy-MM-dd HH:mm:ss') birthday,
		gender,
		email,
		user_level,
		from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time,
		from_unixtime(cast(operate_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') operate_time,
		'$db_date' start_date,
		'9999-99-99' end_date
	from ods.mall__user_info where dt='$db_date'
	union all
	 --查询当前变化了的数据,修改日期
	select
		uh.id,
		uh.name,
		from_unixtime(cast(uh.birthday/1000 as bigint),'yyyy-MM-dd HH:mm:ss') birthday,
		uh.gender,
		uh.email,
		uh.user_level,
		from_unixtime(cast(uh.create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time,
		from_unixtime(cast(uh.operate_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') operate_time,
		uh.start_date,
		if(ui.id is not null and uh.end_date='9999-99-99', date_add(ui.dt,-1),uh.end_date) end_date
	from dwd.mall__dim_user_info_his uh left join
	(
        --查询当前时间的所有信息
	select
		cast(id as string) id,
		name,
		from_unixtime(cast(birthday/1000 as bigint),'yyyy-MM-dd HH:mm:ss') birthday,
		gender,
		email,
		user_level,
		from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time,
		from_unixtime(cast(operate_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') operate_time,
		dt
	from ods.mall__user_info
	where dt='$db_date'
	) ui on uh.id=ui.id
)his
order by his.id, start_date;
"
$hive -e "$sql"
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=dim_user_info_his
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
select * from dwd.mall__dim_user_info_his_tmp;
"
$hive -e "$sql"

9.2.7 订单详情事实表(事务型快照事实表-新增)

  • 建表
drop table if exists dwd.mall__fact_order_detail
    
CREATE EXTERNAL TABLE `dwd.mall__fact_order_detail`(
  `id` bigint COMMENT '编号',
  `order_id` bigint COMMENT '订单编号',
  `user_id` bigint COMMENT '用户id',
  `sku_id` bigint COMMENT 'sku_id',
  `sku_name` string COMMENT 'sku名称',
  `order_price` decimal(10,2) COMMENT '购买价格(下单时sku价格)',
  `sku_num` string COMMENT '购买个数',
  `create_time` bigint COMMENT '创建时间',
  `province_id` string COMMENT '省份ID',
  `total_amount` decimal(20,2) COMMENT '订单总金额'
  ) COMMENT '订单明细表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_order_detail/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_order_detail
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	od.id, 
	od.order_id, 
	od.user_id, 
	od.sku_id, 
	od.sku_name, 
	od.order_price, 
	od.sku_num, 
	od.create_time, 
	oi.province_id, 
	od.order_price*od.sku_num 
from (select * from ods.mall__order_detail where dt='$db_date' ) od 
join (select * from ods.mall__order_info where dt='$db_date' ) oi 
on od.order_id=oi.id;
"
$hive -e "$sql"

9.2.7 支付事实表(事务型快照事实表-新增)

  • 建表
drop table if exists dwd.mall__fact_payment_info
    
CREATE EXTERNAL TABLE `dwd.mall__fact_payment_info`(
`id` string COMMENT '',
`out_trade_no` string COMMENT '对外业务编号',
`order_id` string COMMENT '订单编号',
`user_id` string COMMENT '用户编号',
`alipay_trade_no` string COMMENT '支付宝交易流水编号',
`payment_amount` decimal(16,2) COMMENT '支付金额',
`subject` string COMMENT '交易内容',
`payment_type` string COMMENT '支付类型',
`payment_time` string COMMENT '支付时间',
`province_id` string COMMENT '省份 ID'
  ) COMMENT '支付事实表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_payment_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_payment_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	pi.id,
	pi.out_trade_no,
	pi.order_id,
	pi.user_id,
	pi.alipay_trade_no,
	pi.total_amount,
	pi.subject,
	pi.payment_type,
	from_unixtime(cast(pi.payment_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') payment_time,
	oi.province_id
from
(
select * from ods.mall__payment_info where dt='$db_date'
)pi
join
(
select id, province_id from ods.mall__order_info where dt='$db_date'
)oi
on pi.order_id = oi.id;
"
$hive -e "$sql"

9.2.8 退款事实表(事务型快照事实表-新增)

  • 建表
drop table if exists dwd.mall__fact_order_refund_info
    
CREATE EXTERNAL TABLE `dwd.mall__fact_order_refund_info`(
`id` string COMMENT '编号',
`user_id` string COMMENT '用户 ID',
`order_id` string COMMENT '订单 ID',
`sku_id` string COMMENT '商品 ID',
`refund_type` string COMMENT '退款类型',
`refund_num` bigint COMMENT '退款件数',
`refund_amount` decimal(16,2) COMMENT '退款金额',
`refund_reason_type` string COMMENT '退款原因类型',
`create_time` string COMMENT '退款时间'
  ) COMMENT '退款事实表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_order_refund_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_order_refund_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	id,
	user_id,
	order_id,
	sku_id,
	refund_type,
	refund_num,
	refund_amount,
	refund_reason_type,
	from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time
from ods.mall__order_refund_info
where dt='$db_date';
"
$hive -e "$sql"

9.2.9 评价事实表(事务型快照事实表-新增)

  • 建表
drop table if exists dwd.mall__fact_comment_info
    
CREATE EXTERNAL TABLE `dwd.mall__fact_comment_info`(
`id` string COMMENT '编号',
`user_id` string COMMENT '用户 ID',
`sku_id` string COMMENT '商品 sku',
`spu_id` string COMMENT '商品 spu',
`order_id` string COMMENT '订单 ID',
`appraise` string COMMENT '评价',
`create_time` string COMMENT '评价时间'
  ) COMMENT '评价事实表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_comment_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_comment_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	id,
	user_id,
	sku_id,
	spu_id,
	order_id,
	appraise,
	from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time
from ods.mall__comment_info
where dt='$db_date';
"
$hive -e "$sql"

9.2.10 加购事实表(周期型快照事实表-全量)

  • 建表
drop table if exists dwd.mall__fact_cart_info
    
CREATE EXTERNAL TABLE `dwd.mall__fact_cart_info`(
`id` string COMMENT '编号',
`user_id` string COMMENT '用户 id',
`sku_id` string COMMENT 'skuid',
`cart_price` string COMMENT '放入购物车时价格',
`sku_num` string COMMENT '数量',
`sku_name` string COMMENT 'sku 名称 (冗余)',
`create_time` string COMMENT '创建时间',
`operate_time` string COMMENT '修改时间',
`is_ordered` string COMMENT '是否已经下单。1 为已下单;0 为未下单',
`order_time` string COMMENT '下单时间'
  ) COMMENT '加购事实表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_cart_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_cart_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	id,
	user_id,
	sku_id,
	cart_price,
	sku_num,
	sku_name,
	from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time,
	from_unixtime(cast(operate_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') operate_time,
	is_ordered,
	from_unixtime(cast(order_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') order_time
from ods.mall__cart_info
where dt='$db_date';
"
$hive -e "$sql"

9.2.11 收藏事实表(周期型快照事实表-全量)

  • 建表
drop table if exists dwd.mall__fact_favor_info
    
CREATE EXTERNAL TABLE `dwd.mall__fact_favor_info`(
`id` string COMMENT '编号',
`user_id` string COMMENT '用户 id',
`sku_id` string COMMENT 'skuid',
`spu_id` string COMMENT 'spuid',
`is_cancel` string COMMENT '是否取消',
`create_time` string COMMENT '收藏时间',
`cancel_time` string COMMENT '取消时间'
  ) COMMENT '收藏事实表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_favor_info/'
tblproperties ("parquet.compression"="snappy") 
  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_favor_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	id,
	user_id,
	sku_id,
	spu_id,
	is_cancel,
	from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') create_time,
	from_unixtime(cast(cancel_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss') cancel_time
from ods.mall__favor_info
where dt='$db_date';
"
$hive -e "$sql"

9.2.12 优惠券领用事实表(累积型快照事实表-新增及变化)

  • 建表
drop table if exists dwd.mall__fact_coupon_use
    
CREATE EXTERNAL TABLE `dwd.mall__fact_coupon_use`(
`` string COMMENT '编号',
`coupon_id` string COMMENT '优惠券 ID',
`user_id` string COMMENT 'userid',
`order_id` string COMMENT '订单 id',
`coupon_status` string COMMENT '优惠券状态',
`get_time` string COMMENT '领取时间',
`using_time` string COMMENT '使用时间(下单)',
`used_time` string COMMENT '使用时间(支付)'
  ) COMMENT '优惠券领用事实表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_coupon_use/'
tblproperties ("parquet.compression"="snappy") 

dt 是按照优惠卷领用时间 get_time 做为分区。

get_time 为领用时间,领用过后数据就需要存在,然后在下单和支付的时候叠加更新时间

大数据篇:数据仓库案例_第22张图片

  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_coupon_use
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	if(new.id is null,old.id,new.id) id,
	if(new.coupon_id is null,old.coupon_id,new.coupon_id) coupon_id,
	if(new.user_id is null,old.user_id,new.user_id) user_id,
	if(new.order_id is null,old.order_id,new.order_id) order_id,
	if(new.coupon_status is null,old.coupon_status,new.coupon_status) coupon_status,
	from_unixtime(cast(if(new.get_time is null,old.get_time,new.get_time)/1000 as bigint),'yyyy-MM-dd') get_time,
	from_unixtime(cast(if(new.using_time is null,old.using_time,new.using_time)/1000 as bigint),'yyyy-MM-dd') using_time,
	from_unixtime(cast(if(new.used_time is null,old.used_time,new.used_time)/1000 as bigint),'yyyy-MM-dd'),
	from_unixtime(cast(if(new.get_time is null,old.get_time,new.get_time)/1000 as bigint),'yyyy-MM-dd') 
from
(
	select
		id,
		coupon_id,
		user_id,
		order_id,
		coupon_status,
		get_time,
		using_time,
		used_time
	from dwd.mall__fact_coupon_use
	where dt in
	(
	select
		from_unixtime(cast(get_time/1000 as bigint),'yyyy-MM-dd')
	from ods.mall__coupon_use
	where dt='$db_date'
	)
)old
full outer join
(
	select
		id,
		coupon_id,
		user_id,
		order_id,
		coupon_status,
		get_time,
		using_time,
		used_time
	from ods.mall__coupon_use
	where dt='$db_date'
)new
on old.id=new.id;
"
$hive -e "$sql"

9.2.13 订单事实表(累积型快照事实表-新增及变化)

  • 建表
drop table if exists dwd.mall__fact_order_info
    
CREATE EXTERNAL TABLE `dwd.mall__fact_order_info`(
`id` string COMMENT '订单编号',
`order_status` string COMMENT '订单状态',
`user_id` string COMMENT '用户 id',
`out_trade_no` string COMMENT '支付流水号',
`create_time` string COMMENT '创建时间(未支付状态)',
`payment_time` string COMMENT '支付时间(已支付状态)',
`cancel_time` string COMMENT '取消时间(已取消状态)',
`finish_time` string COMMENT '完成时间(已完成状态)',
`refund_time` string COMMENT '退款时间(退款中状态)',
`refund_finish_time` string COMMENT '退款完成时间(退款完成状态)',
`province_id` string COMMENT '省份 ID',
`activity_id` string COMMENT '活动 ID',
`original_total_amount` string COMMENT '原价金额',
`benefit_reduce_amount` string COMMENT '优惠金额',
`feight_fee` string COMMENT '运费',
`final_total_amount` decimal(10,2) COMMENT '订单金额'
  ) COMMENT '订单事实表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwd/mall/fact_order_info/'
tblproperties ("parquet.compression"="snappy") 

大数据篇:数据仓库案例_第23张图片

  • 数据导入
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwd
table_name=fact_order_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	if(new.id is null,old.id,new.id),
	if(new.order_status is null,old.order_status,new.order_status),
	if(new.user_id is null,old.user_id,new.user_id),
	if(new.out_trade_no is null,old.out_trade_no,new.out_trade_no),
	if(new.tms['1001'] is null,from_unixtime(cast(old.create_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss'),new.tms['1001']),--1001 对应未支付状态
	if(new.tms['1002'] is null,from_unixtime(cast(old.payment_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss'),new.tms['1002']),
	if(new.tms['1003'] is null,from_unixtime(cast(old.cancel_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss'),new.tms['1003']),
	if(new.tms['1004'] is null,from_unixtime(cast(old.finish_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss'),new.tms['1004']),
	if(new.tms['1005'] is null,from_unixtime(cast(old.refund_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss'),new.tms['1005']),
	if(new.tms['1006'] is null,from_unixtime(cast(old.refund_finish_time/1000 as bigint),'yyyy-MM-dd HH:mm:ss'),new.tms['1006']),
	if(new.province_id is null,old.province_id,new.province_id),
	if(new.activity_id is null,old.activity_id,new.activity_id),
	if(new.original_total_amount is null,old.original_total_amount,new.original_total_amount),
	if(new.benefit_reduce_amount is null,old.benefit_reduce_amount,new.benefit_reduce_amount),
	if(new.feight_fee is null,old.feight_fee,new.feight_fee),
	if(new.final_total_amount is null,old.final_total_amount,new.final_total_amount)
from
(
	select
		id,
		order_status,
		user_id,
		out_trade_no,
		create_time,
		payment_time,
		cancel_time,
		finish_time,
		refund_time,
		refund_finish_time,
		province_id,
		activity_id,
		original_total_amount,
		benefit_reduce_amount,
		feight_fee,
		final_total_amount
	from dwd.mall__fact_order_info
	where dt in 
	(
	select
		from_unixtime(cast(create_time/1000 as bigint),'yyyy-MM-dd')
	from ods.mall__order_info
	where dt='$db_date'
	)
)old
full outer join
(
	select
		info.id,
		info.order_status,
		info.user_id,
		info.out_trade_no,
		info.province_id,
		act.activity_id,
		log.tms,
		info.original_total_amount,
		info.benefit_reduce_amount,
		info.feight_fee,
		info.final_total_amount
	from
	(
	select
		order_id,
		str_to_map(concat_ws(',',collect_set(concat(order_status,'=',from_unixtime(cast(operate_time/1000 as bigint),'yyyy-MM-dd')))),',','=')
		tms
	from ods.mall__order_status_log
	where dt='$db_date'
	group by order_id
	)log
	join
	(
	select * from ods.mall__order_info where dt='$db_date'
	)info
	on log.order_id=info.id
	left join
	(
	select * from ods.mall__activity_order where dt='$db_date'
	)act
	on log.order_id=act.order_id
)new
on old.id=new.id;
"
$hive -e "$sql"

10 DWS层构建

不在进行压缩处理,因为压缩对于硬盘是好的,但是对于CPU计算是差的,对于DWS层的表,会被经常使用,那么讲究的是计算效率,此层主要处理每日主题行为

10.1 每日设备行为(用户行为)

  • 建表
drop table if exists dws.mall__uv_detail_daycount

CREATE EXTERNAL TABLE `dws.mall__uv_detail_daycount`(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度',
`login_count` bigint COMMENT '活跃次数'
  ) COMMENT '每日设备行为表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dws/mall/uv_detail_daycount/'
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dws
table_name=uv_detail_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
PARTITION (dt='$db_date')
select
	mid_id,
	concat_ws('|', collect_set(user_id)) user_id,
	concat_ws('|', collect_set(version_code)) version_code,
	concat_ws('|', collect_set(version_name)) version_name,
	concat_ws('|', collect_set(lang))lang,
	concat_ws('|', collect_set(source)) source,
	concat_ws('|', collect_set(os)) os,
	concat_ws('|', collect_set(area)) area,
	concat_ws('|', collect_set(model)) model,
	concat_ws('|', collect_set(brand)) brand,
	concat_ws('|', collect_set(sdk_version)) sdk_version,
	concat_ws('|', collect_set(gmail)) gmail,
	concat_ws('|', collect_set(height_width)) height_width,
	concat_ws('|', collect_set(app_time)) app_time,
	concat_ws('|', collect_set(network)) network,
	concat_ws('|', collect_set(lng)) lng,
	concat_ws('|', collect_set(lat)) lat,
	count(*) login_count
from dwd.mall__start_log
where dt='$db_date'
group by mid_id;
"
$hive -e "$sql"

10.2 每日会员行为(业务)

  • 建表
drop table if exists dws.mall__user_action_daycount

CREATE EXTERNAL TABLE `dws.mall__user_action_daycount`(
user_id string comment '用户 id',
login_count bigint comment '登录次数',
cart_count bigint comment '加入购物车次数',
cart_amount double comment '加入购物车金额',
order_count bigint comment '下单次数',
order_amount decimal(16,2) comment '下单金额',
payment_count bigint comment '支付次数',
payment_amount decimal(16,2) comment '支付金额'
  ) COMMENT '每日会员行为表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dws/mall/user_action_daycount/'
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dws
table_name=user_action_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
with
tmp_login as
(
	select
		user_id,
		count(*) login_count
	from dwd.mall__start_log
	where dt='$db_date'
	and user_id is not null
	group by user_id
),
tmp_cart as
(
	select
		user_id,
		count(*) cart_count,
		sum(cart_price*sku_num) cart_amount
	from dwd.mall__fact_cart_info
	where dt='$db_date'
	and user_id is not null
	and date_format(create_time,'yyyy-MM-dd')='$db_date'
	group by user_id
),
tmp_order as
(
	select
		user_id,
		count(*) order_count,
		sum(final_total_amount) order_amount
	from dwd.mall__fact_order_info
	where dt='$db_date'
	group by user_id
) ,
tmp_payment as
(
	select
		user_id,
		count(*) payment_count,
		sum(payment_amount) payment_amount
	from dwd.mall__fact_payment_info
	where dt='$db_date'
	group by user_id
)
insert overwrite table $hive_table_name partition(dt='$db_date')
select
	user_actions.user_id,
	sum(user_actions.login_count),
	sum(user_actions.cart_count),
	sum(user_actions.cart_amount),
	sum(user_actions.order_count),
	sum(user_actions.order_amount),
	sum(user_actions.payment_count),
	sum(user_actions.payment_amount)
from
(
	select
		user_id,
		login_count,
		0 cart_count,
		0 cart_amount,
		0 order_count,
		0 order_amount,
		0 payment_count,
		0 payment_amount
	from
	tmp_login
union all
	select
		user_id,
		0 login_count,
		cart_count,
		cart_amount,
		0 order_count,
		0 order_amount,
		0 payment_count,
		0 payment_amount
	from
	tmp_cart
union all
	select
		user_id,
		0 login_count,
		0 cart_count,
		0 cart_amount,
		order_count,
		order_amount,
		0 payment_count,
		0 payment_amount
	from tmp_order
union all
	select
		user_id,
		0 login_count,
		0 cart_count,
		0 cart_amount,
		0 order_count,
		0 order_amount,
		payment_count,
		payment_amount
	from tmp_payment
) user_actions
group by user_id;
"
$hive -e "$sql"

10.3 每日商品行为(业务)

  • 建表
drop table if exists dws.mall__sku_action_daycount

CREATE EXTERNAL TABLE `dws.mall__sku_action_daycount`(
sku_id string comment 'sku_id',
order_count bigint comment '被下单次数',
order_num bigint comment '被下单件数',
order_amount decimal(16,2) comment '被下单金额',
payment_count bigint comment '被支付次数',
payment_num bigint comment '被支付件数',
payment_amount decimal(16,2) comment '被支付金额',
refund_count bigint comment '被退款次数',
refund_num bigint comment '被退款件数',
refund_amount decimal(16,2) comment '被退款金额',
cart_count bigint comment '被加入购物车次数',
cart_num bigint comment '被加入购物车件数',
favor_count bigint comment '被收藏次数',
appraise_good_count bigint comment '好评数',
appraise_mid_count bigint comment '中评数',
appraise_bad_count bigint comment '差评数',
appraise_default_count bigint comment '默认评价数'
  ) COMMENT '每日商品行为表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dws/mall/sku_action_daycount/'
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dws
table_name=sku_action_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
with
tmp_order as
(
	select
		cast(sku_id as string) sku_id,
		count(*) order_count,
		sum(sku_num) order_num,
		sum(total_amount) order_amount
	from dwd.mall__fact_order_detail
	where dt='$db_date'
	group by sku_id
),
tmp_payment as
(
	select
		cast(sku_id as string) sku_id,
		count(*) payment_count,
		sum(sku_num) payment_num,
		sum(total_amount) payment_amount
	from dwd.mall__fact_order_detail
	where dt='$db_date'
	and order_id in
	(
	select
		id
	from dwd.mall__fact_order_info
	where (dt='$db_date' or dt=date_add('$db_date',-1))
	and date_format(payment_time,'yyyy-MM-dd')='$db_date'
	)
	group by sku_id
),
tmp_refund as
(
	select
		cast(sku_id as string) sku_id,
		count(*) refund_count,
		sum(refund_num) refund_num,
		sum(refund_amount) refund_amount
	from dwd.mall__fact_order_refund_info
	where dt='$db_date'
	group by sku_id
),
tmp_cart as
(
	select
		cast(sku_id as string) sku_id,
		count(*) cart_count,
		sum(sku_num) cart_num
	from dwd.mall__fact_cart_info
	where dt='$db_date'
	and date_format(create_time,'yyyy-MM-dd')='$db_date'
	group by sku_id
),
tmp_favor as
(
	select
		cast(sku_id as string) sku_id,
		count(*) favor_count
	from dwd.mall__fact_favor_info
	where dt='$db_date'
	and date_format(create_time,'yyyy-MM-dd')='$db_date'
	group by sku_id
),
tmp_appraise as
(
	select
		cast(sku_id as string) sku_id,
		sum(if(appraise='1201',1,0)) appraise_good_count,
		sum(if(appraise='1202',1,0)) appraise_mid_count,
		sum(if(appraise='1203',1,0)) appraise_bad_count,
		sum(if(appraise='1204',1,0)) appraise_default_count
	from dwd.mall__fact_comment_info
	where dt='$db_date'
	group by sku_id
)
insert overwrite table $hive_table_name partition(dt='$db_date')
select
	sku_id,
	sum(order_count),
	sum(order_num),
	sum(order_amount),
	sum(payment_count),
	sum(payment_num),
	sum(payment_amount),
	sum(refund_count),
	sum(refund_num),
	sum(refund_amount),
	sum(cart_count),
	sum(cart_num),
	sum(favor_count),
	sum(appraise_good_count),
	sum(appraise_mid_count),
	sum(appraise_bad_count),
	sum(appraise_default_count)
from
(
	select
		sku_id,
		order_count,
		order_num,
		order_amount,
		0 payment_count,
		0 payment_num,
		0 payment_amount,
		0 refund_count,
		0 refund_num,
		0 refund_amount,
		0 cart_count,
		0 cart_num,
		0 favor_count,
		0 appraise_good_count,
		0 appraise_mid_count,
		0 appraise_bad_count,
		0 appraise_default_count
	from tmp_order
union all
	select
		sku_id,
		0 order_count,
		0 order_num,
		0 order_amount,
		payment_count,
		payment_num,
		payment_amount,
		0 refund_count,
		0 refund_num,
		0 refund_amount,
		0 cart_count,
		0 cart_num,
		0 favor_count,
		0 appraise_good_count,
		0 appraise_mid_count,
		0 appraise_bad_count,
		0 appraise_default_count
	from tmp_payment
union all
	select
		sku_id,
		0 order_count,
		0 order_num,
		0 order_amount,
		0 payment_count,
		0 payment_num,
		0 payment_amount,
		refund_count,
		refund_num,
		refund_amount,
		0 cart_count,
		0 cart_num,
		0 favor_count,
		0 appraise_good_count,
		0 appraise_mid_count,
		0 appraise_bad_count,
		0 appraise_default_count
	from tmp_refund
union all
	select
		sku_id,
		0 order_count,
		0 order_num,
		0 order_amount,
		0 payment_count,
		0 payment_num,
		0 payment_amount,
		0 refund_count,
		0 refund_num,
		0 refund_amount,
		cart_count,
		cart_num,
		0 favor_count,
		0 appraise_good_count,
		0 appraise_mid_count,
		0 appraise_bad_count,
		0 appraise_default_count
	from tmp_cart
union all
	select
		sku_id,
		0 order_count,
		0 order_num,
		0 order_amount,
		0 payment_count,
		0 payment_num,
		0 payment_amount,
		0 refund_count,
		0 refund_num,
		0 refund_amount,
		0 cart_count,
		0 cart_num,
		favor_count,
		0 appraise_good_count,
		0 appraise_mid_count,
		0 appraise_bad_count,
		0 appraise_default_count
	from tmp_favor
union all
	select
		sku_id,
		0 order_count,
		0 order_num,
		0 order_amount,
		0 payment_count,
		0 payment_num,
		0 payment_amount,
		0 refund_count,
		0 refund_num,
		0 refund_amount,
		0 cart_count,
		0 cart_num,
		0 favor_count,
		appraise_good_count,
		appraise_mid_count,
		appraise_bad_count,
		appraise_default_count
	from tmp_appraise
)tmp
group by sku_id;
"
$hive -e "$sql"

10.4 每日优惠券统计(业务)

  • 建表
drop table if exists dws.mall__coupon_use_daycount

CREATE EXTERNAL TABLE `dws.mall__coupon_use_daycount`(
`coupon_id` string COMMENT '优惠券 ID',
`coupon_name` string COMMENT '购物券名称',
`coupon_type` string COMMENT '购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券',
`condition_amount` string COMMENT '满额数',
`condition_num` string COMMENT '满件数',
`activity_id` string COMMENT '活动编号',
`benefit_amount` string COMMENT '减金额',
`benefit_discount` string COMMENT '折扣',
`create_time` string COMMENT '创建时间',
`range_type` string COMMENT '范围类型 1、商品 2、品类 3、品牌',
`spu_id` string COMMENT '商品 id',
`tm_id` string COMMENT '品牌 id',
`category3_id` string COMMENT '品类 id',
`limit_num` string COMMENT '最多领用次数',
`get_count` bigint COMMENT '领用次数',
`using_count` bigint COMMENT '使用(下单)次数',
`used_count` bigint COMMENT '使用(支付)次数'
  ) COMMENT '每日优惠券统计表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dws/mall/coupon_use_daycount/'
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dws
table_name=coupon_use_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name partition(dt='$db_date')
select
	cu.coupon_id,
	ci.coupon_name,
	ci.coupon_type,
	ci.condition_amount,
	ci.condition_num,
	ci.activity_id,
	ci.benefit_amount,
	ci.benefit_discount,
	ci.create_time,
	ci.range_type,
	ci.spu_id,
	ci.tm_id,
	ci.category3_id,
	ci.limit_num,
	cu.get_count,
	cu.using_count,
	cu.used_count
from
(
	select
		coupon_id,
		sum(if(date_format(get_time,'yyyy-MM-dd')='$db_date',1,0))
		get_count,
		sum(if(date_format(using_time,'yyyy-MM-dd')='$db_date',1,0))
		using_count,
		sum(if(date_format(used_time,'yyyy-MM-dd')='$db_date',1,0))
		used_count
	from dwd.mall__fact_coupon_use
	where dt='$db_date'
	group by coupon_id
)cu
left join
(
	select
		*
	from dwd.mall__dim_coupon_info
	where dt='$db_date'
)ci on cu.coupon_id=ci.id;
"
$hive -e "$sql"

10.5 每日活动统计(业务)

  • 建表
drop table if exists dws.mall__activity_info_daycount

CREATE EXTERNAL TABLE `dws.mall__activity_info_daycount`(
`id` string COMMENT '编号',
`activity_name` string COMMENT '活动名称',
`activity_type` string COMMENT '活动类型',
`start_time` string COMMENT '开始时间',
`end_time` string COMMENT '结束时间',
`create_time` string COMMENT '创建时间',
`order_count` bigint COMMENT '下单次数',
`payment_count` bigint COMMENT '支付次数'
  ) COMMENT '每日活动统计表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dws/mall/activity_info_daycount/'

  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dws
table_name=activity_info_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name partition(dt='$db_date')
select
	oi.activity_id,
	ai.activity_name,
	ai.activity_type,
	ai.start_time,
	ai.end_time,
	ai.create_time,
	oi.order_count,
	oi.payment_count
from
(
	select
		activity_id,
		sum(if(date_format(create_time,'yyyy-MM-dd')='$db_date',1,0))
		order_count,
		sum(if(date_format(payment_time,'yyyy-MM-dd')='$db_date',1,0))
		payment_count
	from dwd.mall__fact_order_info
	where (dt='$db_date' or dt=date_add('$db_date',-1))
	and activity_id is not null
	group by activity_id
)oi
join
(
	select
		*
	from dwd.mall__dim_activity_info
	where dt='$db_date'
)ai
on oi.activity_id=ai.id;
"
$hive -e "$sql"

10.6 每日购买行为(业务)

  • 建表
drop table if exists dws.mall__sale_detail_daycount

CREATE EXTERNAL TABLE `dws.mall__sale_detail_daycount`(
user_id string comment '用户 id',
sku_id string comment '商品 id',
user_gender string comment '用户性别',
user_age string comment '用户年龄',
user_level string comment '用户等级',
order_price decimal(10,2) comment '商品价格',
sku_name string comment '商品名称',
sku_tm_id string comment '品牌 id',
sku_category3_id string comment '商品三级品类 id',
sku_category2_id string comment '商品二级品类 id',
sku_category1_id string comment '商品一级品类 id',
sku_category3_name string comment '商品三级品类名称',
sku_category2_name string comment '商品二级品类名称',
sku_category1_name string comment '商品一级品类名称',
spu_id string comment '商品 spu',
sku_num int comment '购买个数',
order_count bigint comment '当日下单单数',
order_amount decimal(16,2) comment '当日下单金额'
  ) COMMENT '每日购买行为表'
PARTITIONED BY (
  `dt` String COMMENT 'partition'
)
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dws/mall/sale_detail_daycount/'
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dws
table_name=sale_detail_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name partition(dt='$db_date')
select
	op.user_id,
	op.sku_id,
	ui.gender,
	months_between('$db_date', ui.birthday)/12 age,
	ui.user_level,
	si.price,
	si.sku_name,
	si.tm_id,
	si.category3_id,
	si.category2_id,
	si.category1_id,
	si.category3_name,
	si.category2_name,
	si.category1_name,
	si.spu_id,
	op.sku_num,
	op.order_count,
	op.order_amount
from
(
	select
		user_id,
		sku_id,
		sum(sku_num) sku_num,
		count(*) order_count,
		sum(total_amount) order_amount
	from dwd.mall__fact_order_detail
	where dt='$db_date'
	group by user_id, sku_id
)op
join
(
	select
		*
	from dwd.mall__dim_user_info_his
	where end_date='9999-99-99'
)ui on op.user_id = ui.id
join
(
	select
		*
	from dwd.mall__dim_sku_info
	where dt='$db_date'
)si on op.sku_id = si.id;
"
$hive -e "$sql"

11 DWT层构建

此层主要针对dws层每日数据进行汇总,不建立分区,不压缩,每日进行数据覆盖

11.1 设备主题宽表

  • 建表
drop table if exists dwt.mall__uv_topic

CREATE EXTERNAL TABLE `dwt.mall__uv_topic`(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string  COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度',
`login_date_first` string comment '首次活跃时间',
`login_date_last` string comment '末次活跃时间',
`login_day_count` bigint comment '当日活跃次数',
`login_count` bigint comment '累积活跃天数'
  ) COMMENT '设备主题宽表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwt/mall/uv_topic/'

  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwt
table_name=uv_topic
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
select
	nvl(new.mid_id,old.mid_id),
	nvl(new.user_id,old.user_id),
	nvl(new.version_code,old.version_code),
	nvl(new.version_name,old.version_name),
	nvl(new.lang,old.lang),
	nvl(new.source,old.source),
	nvl(new.os,old.os),
	nvl(new.area,old.area),
	nvl(new.model,old.model),
	nvl(new.brand,old.brand),
	nvl(new.sdk_version,old.sdk_version),
	nvl(new.gmail,old.gmail),
	nvl(new.height_width,old.height_width),
	nvl(new.app_time,old.app_time),
	nvl(new.network,old.network),
	nvl(new.lng,old.lng),
	nvl(new.lat,old.lat),
	if(old.mid_id is null,'2020-03-10',old.login_date_first),
	if(new.mid_id is not null,'2020-03-10',old.login_date_last),
	if(new.mid_id is not null, new.login_count,0),
	nvl(old.login_count,0)+if(new.login_count>0,1,0)
from
(
	select
		*
	from dwt.mall__uv_topic
)old
full outer join
(
	select
		*
	from dws.mall__uv_detail_daycount
	where dt='$db_date'
)new
on old.mid_id=new.mid_id;
"
$hive -e "$sql"

11.2 会员主题宽表

  • 建表
drop table if exists dwt.mall__user_topic

CREATE EXTERNAL TABLE `dwt.mall__user_topic`(
user_id string comment '用户 id',
login_date_first string comment '首次登录时间',
login_date_last string comment '末次登录时间',
login_count bigint comment '累积登录天数',
login_last_30d_count bigint comment '最近 30 日登录天数',
order_date_first string comment '首次下单时间',
order_date_last string comment '末次下单时间',
order_count bigint comment '累积下单次数',
order_amount decimal(16,2) comment '累积下单金额',
order_last_30d_count bigint comment '最近 30 日下单次数',
order_last_30d_amount bigint comment '最近 30 日下单金额',
payment_date_first string comment '首次支付时间',
payment_date_last string comment '末次支付时间',
payment_count decimal(16,2) comment '累积支付次数',
payment_amount decimal(16,2) comment '累积支付金额',
payment_last_30d_count decimal(16,2) comment '最近 30 日支付次数',
payment_last_30d_amount decimal(16,2) comment '最近 30 日支付金额'
  ) COMMENT '会员主题宽表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwt/mall/user_topic/'

  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwt
table_name=user_topic
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
select
	nvl(new.user_id,old.user_id),
	if(old.login_date_first is null and
	new.login_count>0,'$db_date',old.login_date_first),
	if(new.login_count>0,'$db_date',old.login_date_last),
	nvl(old.login_count,0)+if(new.login_count>0,1,0),
	nvl(new.login_last_30d_count,0),
	if(old.order_date_first is null and
	new.order_count>0,'$db_date',old.order_date_first),
	if(new.order_count>0,'$db_date',old.order_date_last),
	nvl(old.order_count,0)+nvl(new.order_count,0),
	nvl(old.order_amount,0)+nvl(new.order_amount,0),
	nvl(new.order_last_30d_count,0),
	nvl(new.order_last_30d_amount,0),
	if(old.payment_date_first is null and
	new.payment_count>0,'$db_date',old.payment_date_first),
	if(new.payment_count>0,'$db_date',old.payment_date_last),
	nvl(old.payment_count,0)+nvl(new.payment_count,0),
	nvl(old.payment_amount,0)+nvl(new.payment_amount,0),
	nvl(new.payment_last_30d_count,0),
	nvl(new.payment_last_30d_amount,0)
from
dwt.mall__user_topic old
full outer join
(
	select
		user_id,
		sum(if(dt='$db_date',login_count,0)) login_count,
		sum(if(dt='$db_date',order_count,0)) order_count,
		sum(if(dt='$db_date',order_amount,0)) order_amount,
		sum(if(dt='$db_date',payment_count,0)) payment_count,
		sum(if(dt='$db_date',payment_amount,0)) payment_amount,
		sum(if(login_count>0,1,0)) login_last_30d_count,
		sum(order_count) order_last_30d_count,
		sum(order_amount) order_last_30d_amount,
		sum(payment_count) payment_last_30d_count,
		sum(payment_amount) payment_last_30d_amount
	from dws.mall__user_action_daycount
	where dt>=date_add( '$db_date',-30)
	group by user_id
)new
on old.user_id=new.user_id;
"
$hive -e "$sql"

11.3 商品主题宽表

  • 建表
drop table if exists dwt.mall__sku_topic

CREATE EXTERNAL TABLE `dwt.mall__sku_topic`(
sku_id string comment 'sku_id',
spu_id string comment 'spu_id',
order_last_30d_count bigint comment '最近 30 日被下单次数',
order_last_30d_num bigint comment '最近 30 日被下单件数',
order_last_30d_amount decimal(16,2) comment '最近 30 日被下单金额',
order_count bigint comment '累积被下单次数',
order_num bigint comment '累积被下单件数',
order_amount decimal(16,2) comment '累积被下单金额',
payment_last_30d_count bigint comment '最近 30 日被支付次数',
payment_last_30d_num bigint comment '最近 30 日被支付件数',
payment_last_30d_amount decimal(16,2) comment '最近 30 日被支付金额',
payment_count bigint comment '累积被支付次数',
payment_num bigint comment '累积被支付件数',
payment_amount decimal(16,2) comment '累积被支付金额',
refund_last_30d_count bigint comment '最近三十日退款次数',
refund_last_30d_num bigint comment '最近三十日退款件数',
refund_last_30d_amount decimal(10,2) comment '最近三十日退款金额',
refund_count bigint comment '累积退款次数',
refund_num bigint comment '累积退款件数',
refund_amount decimal(10,2) comment '累积退款金额',
cart_last_30d_count bigint comment '最近 30 日被加入购物车次数',
cart_last_30d_num bigint comment '最近 30 日被加入购物车件数',
cart_count bigint comment '累积被加入购物车次数',
cart_num bigint comment '累积被加入购物车件数',
favor_last_30d_count bigint comment '最近 30 日被收藏次数',
favor_count bigint comment '累积被收藏次数',
appraise_last_30d_good_count bigint comment '最近 30 日好评数',
appraise_last_30d_mid_count bigint comment '最近 30 日中评数',
appraise_last_30d_bad_count bigint comment '最近 30 日差评数',
appraise_last_30d_default_count bigint comment '最近 30 日默认评价数',
appraise_good_count bigint comment '累积好评数',
appraise_mid_count bigint comment '累积中评数',
appraise_bad_count bigint comment '累积差评数',
appraise_default_count bigint comment '累积默认评价数'
  ) COMMENT '商品主题宽表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwt/mall/sku_topic/'

  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwt
table_name=sku_topic
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
select
	nvl(new.sku_id,old.sku_id), sku_info.spu_id,
	nvl(new.order_count30,0),
	nvl(new.order_num30,0),
	nvl(new.order_amount30,0),
	nvl(old.order_count,0) + nvl(new.order_count,0),
	nvl(old.order_num,0) + nvl(new.order_num,0),
	nvl(old.order_amount,0) + nvl(new.order_amount,0),
	nvl(new.payment_count30,0),
	nvl(new.payment_num30,0),
	nvl(new.payment_amount30,0),
	nvl(old.payment_count,0) + nvl(new.payment_count,0),
	nvl(old.payment_num,0) + nvl(new.payment_count,0),
	nvl(old.payment_amount,0) + nvl(new.payment_count,0),
	nvl(new.refund_count30,0),
	nvl(new.refund_num30,0),
	nvl(new.refund_amount30,0),
	nvl(old.refund_count,0) + nvl(new.refund_count,0),
	nvl(old.refund_num,0) + nvl(new.refund_num,0),
	nvl(old.refund_amount,0) + nvl(new.refund_amount,0),
	nvl(new.cart_count30,0),
	nvl(new.cart_num30,0),
	nvl(old.cart_count,0) + nvl(new.cart_count,0),
	nvl(old.cart_num,0) + nvl(new.cart_num,0),
	nvl(new.favor_count30,0),
	nvl(old.favor_count,0) + nvl(new.favor_count,0),
	nvl(new.appraise_good_count30,0),
	nvl(new.appraise_mid_count30,0),
	nvl(new.appraise_bad_count30,0),
	nvl(new.appraise_default_count30,0) ,
	nvl(old.appraise_good_count,0) + nvl(new.appraise_good_count,0),
	nvl(old.appraise_mid_count,0) + nvl(new.appraise_mid_count,0),
	nvl(old.appraise_bad_count,0) + nvl(new.appraise_bad_count,0),
	nvl(old.appraise_default_count,0) + nvl(new.appraise_default_count,0)
from
(
	select
		sku_id,
		spu_id,
		order_last_30d_count,
		order_last_30d_num,
		order_last_30d_amount,
		order_count,
		order_num,
		order_amount ,
		payment_last_30d_count,
		payment_last_30d_num,
		payment_last_30d_amount,
		payment_count,
		payment_num,
		payment_amount,
		refund_last_30d_count,
		refund_last_30d_num,
		refund_last_30d_amount,
		refund_count,
		refund_num,
		refund_amount,
		cart_last_30d_count,
		cart_last_30d_num,
		cart_count,
		cart_num,
		favor_last_30d_count,
		favor_count,
		appraise_last_30d_good_count,
		appraise_last_30d_mid_count,
		appraise_last_30d_bad_count,
		appraise_last_30d_default_count,
		appraise_good_count,
		appraise_mid_count,
		appraise_bad_count,
		appraise_default_count
	from dwt.mall__sku_topic
)old
full outer join
(
	select
		sku_id,
		sum(if(dt='$db_date', order_count,0 )) order_count,
		sum(if(dt='$db_date',order_num ,0 )) order_num,
		sum(if(dt='$db_date',order_amount,0 )) order_amount ,
		sum(if(dt='$db_date',payment_count,0 )) payment_count,
		sum(if(dt='$db_date',payment_num,0 )) payment_num,
		sum(if(dt='$db_date',payment_amount,0 )) payment_amount,
		sum(if(dt='$db_date',refund_count,0 )) refund_count,
		sum(if(dt='$db_date',refund_num,0 )) refund_num,
		sum(if(dt='$db_date',refund_amount,0 )) refund_amount,
		sum(if(dt='$db_date',cart_count,0 )) cart_count,
		sum(if(dt='$db_date',cart_num,0 )) cart_num,
		sum(if(dt='$db_date',favor_count,0 )) favor_count,
		sum(if(dt='$db_date',appraise_good_count,0 )) appraise_good_count,
		sum(if(dt='$db_date',appraise_mid_count,0 ) ) appraise_mid_count ,
		sum(if(dt='$db_date',appraise_bad_count,0 )) appraise_bad_count,
		sum(if(dt='$db_date',appraise_default_count,0 )) appraise_default_count,
		sum(order_count) order_count30 ,
		sum(order_num) order_num30,
		sum(order_amount) order_amount30,
		sum(payment_count) payment_count30,
		sum(payment_num) payment_num30,
		sum(payment_amount) payment_amount30,
		sum(refund_count) refund_count30,
		sum(refund_num) refund_num30,
		sum(refund_amount) refund_amount30,
		sum(cart_count) cart_count30,
		sum(cart_num) cart_num30,
		sum(favor_count) favor_count30,
		sum(appraise_good_count) appraise_good_count30,
		sum(appraise_mid_count) appraise_mid_count30,
		sum(appraise_bad_count) appraise_bad_count30,
		sum(appraise_default_count) appraise_default_count30
	from dws.mall__sku_action_daycount
	where dt >= date_add ('$db_date', -30)
	group by sku_id
)new
on new.sku_id = old.sku_id
left join
(
	select * from dwd.mall__dim_sku_info where dt='$db_date'
) sku_info
on nvl(new.sku_id,old.sku_id)= sku_info.id;
"
$hive -e "$sql"

11.4 优惠卷主题宽表

  • 建表
drop table if exists dwt.mall__coupon_topic

CREATE EXTERNAL TABLE `dwt.mall__coupon_topic`(
`coupon_id` string COMMENT '优惠券 ID',
`get_day_count` bigint COMMENT '当日领用次数',
`using_day_count` bigint COMMENT '当日使用(下单)次数',
`used_day_count` bigint COMMENT '当日使用(支付)次数',
`get_count` bigint COMMENT '累积领用次数',
`using_count` bigint COMMENT '累积使用(下单)次数',
`used_count` bigint COMMENT '累积使用(支付)次数'
  ) COMMENT '优惠券主题宽表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwt/mall/coupon_topic/'

  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwt
table_name=coupon_topic
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
select
	nvl(new.coupon_id,old.coupon_id),
	nvl(new.get_count,0),
	nvl(new.using_count,0),
	nvl(new.used_count,0),
	nvl(old.get_count,0)+nvl(new.get_count,0),
	nvl(old.using_count,0)+nvl(new.using_count,0),
	nvl(old.used_count,0)+nvl(new.used_count,0)
from
(
	select
		*
	from dwt.mall__coupon_topic
)old
full outer join
(
	select
		coupon_id,
		get_count,
		using_count,
		used_count
	from dws.mall__coupon_use_daycount
	where dt='$db_date'
)new
on old.coupon_id=new.coupon_id;
"
$hive -e "$sql"

11.5 活动主题宽表

  • 建表
drop table if exists dwt.mall__activity_topic

CREATE EXTERNAL TABLE `dwt.mall__activity_topic`(
`id` string COMMENT '活动 id',
`activity_name` string COMMENT '活动名称',
`order_day_count` bigint COMMENT '当日日下单次数',
`payment_day_count` bigint COMMENT '当日支付次数',
`order_count` bigint COMMENT '累积下单次数',
`payment_count` bigint COMMENT '累积支付次数'
  ) COMMENT '活动主题宽表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/dwt/mall/activity_topic/'

  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=dwt
table_name=activity_topic
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert overwrite table $hive_table_name
select
	nvl(new.id,old.id),
	nvl(new.activity_name,old.activity_name),
	nvl(new.order_count,0),
	nvl(new.payment_count,0),
	nvl(old.order_count,0)+nvl(new.order_count,0),
	nvl(old.payment_count,0)+nvl(new.payment_count,0)
from
(
	select
		*
	from dwt.mall__activity_topic
)old
full outer join
(
	select
		id,
		activity_name,
		order_count,
		payment_count
	from dws.mall__activity_info_daycount
where dt='$db_date'
)new
on old.id=new.id;
"
$hive -e "$sql"

12 ADS层构建

此层为最终数据需求层,考虑数据导出和数据数量决定是否需要压缩,不需要分区,每天刷写

12.1 设备主题

12.1.1 活跃设备数(日、周、月)

日活:当日活跃的设备数

周活:当周活跃的设备数

月活:当月活跃的设备数

  • 建表
drop table if exists ads.mall__uv_count

CREATE EXTERNAL TABLE `ads.mall__uv_count`(
`dt` string COMMENT '统计日期',
`day_count` bigint COMMENT '当日用户数量',
`wk_count` bigint COMMENT '当周用户数量',
`mn_count` bigint COMMENT '当月用户数量',
`is_weekend` string COMMENT 'Y,N 是否是周末,用于得到本周最终结果',
`is_monthend` string COMMENT 'Y,N 是否是月末,用于得到本月最终结果'
  ) COMMENT '活跃设备数表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/uv_count/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=uv_count
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date' dt,
	daycount.ct,
	wkcount.ct,
	mncount.ct,
	if(date_add(next_day('$db_date','MO'),-1)='$db_date','Y','N') ,
	if(last_day('$db_date')='$db_date','Y','N')
from
(
	select
		'$db_date' dt,
		count(*) ct
	from dwt.mall__uv_topic
	where login_date_last='$db_date'
)daycount join
(
	select
	'$db_date' dt,
	count (*) ct
	from dwt.mall__uv_topic
	where login_date_last>=date_add(next_day('$db_date','MO'),-7)
	and login_date_last<= date_add(next_day('$db_date','MO'),-1)
) wkcount on daycount.dt=wkcount.dt
join
(
	select
		'$db_date' dt,
		count (*) ct
	from dwt.mall__uv_topic
	where
	date_format(login_date_last,'yyyy-MM')=date_format('$db_date','yyyy-MM')
)mncount on daycount.dt=mncount.dt;
"
$hive -e "$sql"

12.1.2 每日新增设备

  • 建表
drop table if exists ads.mall__new_mid_count

CREATE EXTERNAL TABLE `ads.mall__new_mid_count`(
`create_date` string comment '创建时间' ,
`new_mid_count` bigint comment '新增设备数量'
  ) COMMENT '每日新增设备表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/new_mid_count/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=new_mid_count
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	login_date_first,
	count(*)
from dwt.mall__uv_topic
where login_date_first='$db_date'
group by login_date_first;
"
$hive -e "$sql"

12.1.3 沉默用户数

沉默用户:只在安装当天启动过,且启动时间是在 7 天前

  • 建表
drop table if exists ads.mall__silent_count

CREATE EXTERNAL TABLE `ads.mall__silent_count`(
`dt` string COMMENT '统计日期',
`silent_count` bigint COMMENT '沉默设备数'
  ) COMMENT '沉默用户数表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/silent_count/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=silent_count
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	count(*)
from dwt.mall__uv_topic
where login_date_first=login_date_last
and login_date_last<=date_add('$db_date',-7);
"
$hive -e "$sql"

12.1.4 本周回流用户数

本周回流用户:上周未活跃,本周活跃的设备,且不是本周新增设备

  • 建表
drop table if exists ads.mall__back_count

CREATE EXTERNAL TABLE `ads.mall__back_count`(
`wk_dt` string COMMENT '统计日期所在周',
`wastage_count` bigint COMMENT '回流设备数'
  ) COMMENT '本周回流用户数表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/back_count/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=back_count
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	count(*)
from
(
	select
		mid_id
	from dwt.mall__uv_topic
	where login_date_last>=date_add(next_day('$db_date','MO'),-7)
	and login_date_last<= date_add(next_day('$db_date','MO'),-1)
	and login_date_first=date_add(next_day('$db_date','MO'),-7*2)
	and dt<= date_add(next_day('$db_date','MO'),-7-1)
	group by mid_id
)last_wk
on current_wk.mid_id=last_wk.mid_id
where last_wk.mid_id is null;
"
$hive -e "$sql"

12.1.5 流失用户数

流失用户:最近 7 天未活跃的设备

  • 建表
drop table if exists ads.mall__wastage_count

CREATE EXTERNAL TABLE `ads.mall__wastage_count`(
`dt` string COMMENT '统计日期',
`wastage_count` bigint COMMENT '流失设备数'
  ) COMMENT '流失用户数表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/wastage_count/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=wastage_count
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	count(*)
from
(
	select
		mid_id
	from dwt.mall__uv_topic
	where login_date_last<=date_add('$db_date',-7)
	group by mid_id
)t1;
"
$hive -e "$sql"

12.1.6 留存率

  • 建表
drop table if exists ads.mall__user_retention_day_rate

CREATE EXTERNAL TABLE `ads.mall__user_retention_day_rate`(
`stat_date` string comment '统计日期',
`create_date` string comment '设备新增日期',
`retention_day` int comment '截止当前日期留存天数',
`retention_count` bigint comment '留存数量',
`new_mid_count` bigint comment '设备新增数量',
`retention_ratio` decimal(10,2) comment '留存率'
  ) COMMENT '留存率表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/user_retention_day_rate/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=user_retention_day_rate
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',--统计日期
	date_add('$db_date',-1),--新增日期
	1,--留存天数
	sum(if(login_date_first=date_add('$db_date',-1) and
	login_date_last='$db_date',1,0)),--2020-03-09 的 1 日留存数
	sum(if(login_date_first=date_add('$db_date',-1),1,0)),--2020-03-09 新增
	sum(if(login_date_first=date_add('$db_date',-1) and
	login_date_last='$db_date',1,0))/sum(if(login_date_first=date_add('$db_date',-1),1,0))*100
from dwt.mall__uv_topic
union all
select
	'$db_date',--统计日期
	date_add('$db_date',-2),--新增日期
	2,--留存天数
	sum(if(login_date_first=date_add('$db_date',-2) and
	login_date_last='$db_date',1,0)),--2020-03-08 的 2 日留存数
	sum(if(login_date_first=date_add('$db_date',-2),1,0)),--2020-03-08 新增
	sum(if(login_date_first=date_add('$db_date',-2) and
	login_date_last='$db_date',1,0))/sum(if(login_date_first=date_add('$db_date',-2),1,0))*100
from dwt.mall__uv_topic
union all
select
	'$db_date',--统计日期
	date_add('$db_date',-3),--新增日期
	3,--留存天数
	sum(if(login_date_first=date_add('$db_date',-3) and
	login_date_last='$db_date',1,0)),--2020-03-07 的 3 日留存数
	sum(if(login_date_first=date_add('$db_date',-3),1,0)),--2020-03-07 新增
	sum(if(login_date_first=date_add('$db_date',-3) and
	login_date_last='$db_date',1,0))/sum(if(login_date_first=date_add('$db_date',-3),1,0))*100
from dwt.mall__uv_topic;
"
$hive -e "$sql"

12.1.7 最近连续三周活跃用户数

  • 建表
drop table if exists ads.mall__continuity_wk_count

CREATE EXTERNAL TABLE `ads.mall__continuity_wk_count`(
`dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期',
`wk_dt` string COMMENT '持续时间',
`continuity_count` bigint COMMENT '活跃次数'
  ) COMMENT '最近连续三周活跃用户数表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/continuity_wk_count/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=continuity_wk_count
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	concat(date_add(next_day('$db_date','MO'),-7*3),'_',date_add(next_day('$db_date','MO'),-1)),
	count(*)
from
(
	select
		mid_id
	from
	(
		select
			mid_id
		from dws.mall__uv_detail_daycount
		where dt>=date_add(next_day('$db_date','monday'),-7)
		and dt<=date_add(next_day('$db_date','monday'),-1)
		group by mid_id
		union all
		select
			mid_id
		from dws.mall__uv_detail_daycount
		where dt>=date_add(next_day('$db_date','monday'),-7*2)
		and dt<=date_add(next_day('$db_date','monday'),-7-1)
		group by mid_id
		union all
		select
			mid_id
		from dws.mall__uv_detail_daycount
		where dt>=date_add(next_day('$db_date','monday'),-7*3)
		and dt<=date_add(next_day('$db_date','monday'),-7*2-1)
		group by mid_id
	)t1
	group by mid_id
	having count(*)=3
)t2
"
$hive -e "$sql"

12.1.8 最近七天内连续三天活跃用户数

  • 建表
drop table if exists ads.mall__continuity_uv_count

CREATE EXTERNAL TABLE `ads.mall__continuity_uv_count`(
`dt` string COMMENT '统计日期',
`wk_dt` string COMMENT '最近 7 天日期',
`continuity_count` bigint
  ) COMMENT '最近七天内连续三天活跃用户数表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/continuity_uv_count/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=continuity_uv_count
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	concat(date_add('db_date',-6),'_','db_date'),
	count(*)
from
(
	select 
		mid_id
	from
	(
		select 
			mid_id
		from
		(
			select
				mid_id,
				date_sub(dt,rank) date_dif
			from
			(
				select
					mid_id,
					dt,
					rank() over(partition by mid_id order by dt) rank
				from dws.mall__uv_detail_daycount
				where dt>=date_add('db_date',-6) and
				dt<='db_date'
			)t1
		)t2
		group by mid_id,date_dif
		having count(*)>=3
	)t3
	group by mid_id
)t4;
"
$hive -e "$sql"

12.2 会员主题

12.2.1 会员主题信息

  • 建表
drop table if exists ads.mall__user_topic

CREATE EXTERNAL TABLE `ads.mall__user_topic`(
`dt` string COMMENT '统计日期',
`day_users` string COMMENT '活跃会员数',
`day_new_users` string COMMENT '新增会员数',
`day_new_payment_users` string COMMENT '新增消费会员数',
`payment_users` string COMMENT '总付费会员数',
`users` string COMMENT '总会员数',
`day_users2users` decimal(10,2) COMMENT '会员活跃率',
`payment_users2users` decimal(10,2) COMMENT '会员付费率',
`day_new_users2users` decimal(10,2) COMMENT '会员新鲜度'
  ) COMMENT '会员主题信息表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/user_topic/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=user_topic
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	sum(if(login_date_last='$db_date',1,0)),
	sum(if(login_date_first='$db_date',1,0)),
	sum(if(payment_date_first='$db_date',1,0)),
	sum(if(payment_count>0,1,0)),
	count(*),
	sum(if(login_date_last='$db_date',1,0))/count(*),
	sum(if(payment_count>0,1,0))/count(*),
	sum(if(login_date_first='$db_date',1,0))/sum(if(login_date_last='$db_date',1,0))
from dwt.mall__user_topic
"
$hive -e "$sql"

12.2.2 漏斗分析

统计“浏览->购物车->下单->支付”的转化率

思路:统计各个行为的人数,然后计算比值。

  • 建表
drop table if exists ads.mall__user_action_convert_day

CREATE EXTERNAL TABLE `ads.mall__user_action_convert_day`(
`dt` string COMMENT '统计日期',
`total_visitor_m_count` bigint COMMENT '总访问人数',
`cart_u_count` bigint COMMENT '加入购物车的人数',
`visitor2cart_convert_ratio` decimal(10,2) COMMENT '访问到加入购物车转化率',
`order_u_count` bigint COMMENT '下单人数',
`cart2order_convert_ratio` decimal(10,2) COMMENT '加入购物车到下单转化率',
`payment_u_count` bigint COMMENT '支付人数',
`order2payment_convert_ratio` decimal(10,2) COMMENT '下单到支付的转化率'
  ) COMMENT '漏斗分析表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/user_action_convert_day/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=user_action_convert_day
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	uv.day_count,
	ua.cart_count,
	cast(ua.cart_count/uv.day_count as decimal(10,2)) visitor2cart_convert_ratio,
	ua.order_count,
	cast(ua.order_count/ua.cart_count as decimal(10,2)) visitor2order_convert_ratio,
	ua.payment_count,
	cast(ua.payment_count/ua.order_count as decimal(10,2)) order2payment_convert_ratio
from
(
	select
		dt,
		sum(if(cart_count>0,1,0)) cart_count,
		sum(if(order_count>0,1,0)) order_count,
		sum(if(payment_count>0,1,0)) payment_count
	from dws.mall__user_action_daycount
	where dt='$db_date'
	group by dt
)ua join ads.mall__uv_count uv on uv.dt=ua.dt;
"
$hive -e "$sql"

12.3 商品主题

12.3.1 商品个数信息

  • 建表
drop table if exists ads.mall__product_info

CREATE EXTERNAL TABLE `ads.mall__product_info`(
`dt` string COMMENT '统计日期',
`sku_num` string COMMENT 'sku 个数',
`spu_num` string COMMENT 'spu 个数'
  ) COMMENT '商品个数信息表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/product_info/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=product_info
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date' dt,
	sku_num,
	spu_num
from
(
	select
		'$db_date' dt,
		count(*) sku_num
	from
	dwt.mall__sku_topic
) tmp_sku_num
join
(
	select
		'$db_date' dt,
		count(*) spu_num
	from
	(
		select
			spu_id
		from
		dwt.mall__sku_topic
		group by
		spu_id
	) tmp_spu_id
) tmp_spu_num
on
tmp_sku_num.dt=tmp_spu_num.dt;
"
$hive -e "$sql"

12.3.2 商品销量排行

  • 建表
drop table if exists ads.mall__product_sale_topN

CREATE EXTERNAL TABLE `ads.mall__product_sale_topN`(
`dt` string COMMENT '统计日期',
`sku_num` string COMMENT 'sku 个数',
`spu_num` string COMMENT 'spu 个数'
  ) COMMENT '商品销量排名表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/product_sale_topN/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=product_sale_topN
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date' dt,
	sku_id,
	payment_amount
from
dws.mall__sku_action_daycount
where
dt='$db_date'
order by payment_amount desc
limit 10;
"
$hive -e "$sql"

12.3.3 商品收藏排名

  • 建表
drop table if exists ads.mall__product_favor_topN

CREATE EXTERNAL TABLE `ads.mall__product_favor_topN`(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品 ID',
`favor_count` bigint COMMENT '收藏量'
  ) COMMENT '商品收藏排名表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/product_favor_topN/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=product_favor_topN
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date' dt,
	sku_id,
	favor_count
from
dws.mall__sku_action_daycount
where
dt='$db_date'
order by favor_count desc
limit 10;
"
$hive -e "$sql"

12.3.4 商品加入购物车排名

  • 建表
drop table if exists ads.mall__product_cart_topN

CREATE EXTERNAL TABLE `ads.mall__product_cart_topN`(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品 ID',
`cart_num` bigint COMMENT '加入购物车数量'
  ) COMMENT '商品加入购物车排名表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/product_cart_topN/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=product_cart_topN
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date' dt,
	sku_id,
	cart_num
from
dws.mall__sku_action_daycount
where
dt='$db_date'
order by cart_num desc
limit 10;
"
$hive -e "$sql"

12.3.5 商品退款率排名(近30天)

  • 建表
drop table if exists ads.mall__product_refund_topN

CREATE EXTERNAL TABLE `ads.mall__product_refund_topN`(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品 ID',
`refund_ratio` decimal(10,2) COMMENT '退款率'
  ) COMMENT '商品退款率排名(最近 30 天)表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/product_refund_topN/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=product_refund_topN
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	sku_id,
	refund_last_30d_count/payment_last_30d_count*100 refund_ratio
from dwt.mall__sku_topic
order by refund_ratio desc
limit 10;
"
$hive -e "$sql"

12.3.6 商品差评率

  • 建表
drop table if exists ads.mall__appraise_bad_topN

CREATE EXTERNAL TABLE `ads.mall__appraise_bad_topN`(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品 ID',
`appraise_bad_ratio` decimal(10,2) COMMENT '差评率'
  ) COMMENT '商品差评率表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/appraise_bad_topN/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=appraise_bad_topN
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date' dt,
	sku_id,
	appraise_bad_count/(appraise_good_count+appraise_mid_count+appraise_bad_count+appraise_default_count) appraise_bad_ratio
from
dws.mall__sku_action_daycount
where
dt='$db_date'
order by appraise_bad_ratio desc
limit 10;
"
$hive -e "$sql"

12.4 营销主题

12.4.1 下单数目统计

  • 建表
drop table if exists ads.mall__order_daycount

CREATE EXTERNAL TABLE `ads.mall__order_daycount`(
dt string comment '统计日期',
order_count bigint comment '单日下单笔数',
order_amount bigint comment '单日下单金额',
order_users bigint comment '单日下单用户数'
  ) COMMENT '下单数目统计表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/order_daycount/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=order_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	'$db_date',
	sum(order_count),
	sum(order_amount),
	sum(if(order_count>0,1,0))
from dws.mall__user_action_daycount
where dt='$db_date';
"
$hive -e "$sql"

12.4.2 支付信息统计

  • 建表
drop table if exists ads.mall__payment_daycount

CREATE EXTERNAL TABLE `ads.mall__payment_daycount`(
dt string comment '统计日期',
order_count bigint comment '单日支付笔数',
order_amount bigint comment '单日支付金额',
payment_user_count bigint comment '单日支付人数',
payment_sku_count bigint comment '单日支付商品数',
payment_avg_time double comment '下单到支付的平均时长,取分钟数'
  ) COMMENT '支付信息统计表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/payment_daycount/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=payment_daycount
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	tmp_payment.dt,
	tmp_payment.payment_count,
	tmp_payment.payment_amount,
	tmp_payment.payment_user_count,
	tmp_skucount.payment_sku_count,
	tmp_time.payment_avg_time
from
(
	select
		'$db_date' dt,
		sum(payment_count) payment_count,
		sum(payment_amount) payment_amount,
		sum(if(payment_count>0,1,0)) payment_user_count
	from dws.mall__user_action_daycount
	where dt='$db_date'
)tmp_payment
join
(
	select
		'$db_date' dt,
		sum(if(payment_count>0,1,0)) payment_sku_count
	from dws.mall__sku_action_daycount
	where dt='$db_date'
)tmp_skucount on tmp_payment.dt=tmp_skucount.dt
join
(
	select
		'$db_date' dt,
		sum(unix_timestamp(payment_time)-unix_timestamp(create_time))/count(*)/60
		payment_avg_time
	from dwd.mall__fact_order_info
	where dt='$db_date'
	and payment_time is not null
)tmp_time on tmp_payment.dt=tmp_time.dt
"
$hive -e "$sql"

12.4.3 复购率

  • 建表
drop table if exists ads.mall__sale_tm_category1_stat_mn

CREATE EXTERNAL TABLE `ads.mall__sale_tm_category1_stat_mn`(
tm_id string comment '品牌 id',
category1_id string comment '1 级品类 id ',
category1_name string comment '1 级品类名称 ',
buycount bigint comment '购买人数',
buy_twice_last bigint comment '两次以上购买人数',
buy_twice_last_ratio decimal(10,2) comment '单次复购率',
buy_3times_last bigint comment '三次以上购买人数',
buy_3times_last_ratio decimal(10,2) comment '多次复购率',
stat_mn string comment '统计月份',
stat_date string comment '统计日期'
  ) COMMENT '复购率表'
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/ads/mall/sale_tm_category1_stat_mn/'
tblproperties ("parquet.compression"="snappy") 
  • 导入数据
#!/bin/bash

db_date=${date}
hive=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/hive
APP1=mall
APP2=ads
table_name=sale_tm_category1_stat_mn
hive_table_name=$APP2.mall__$table_name


# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "${date}" ] ;then
        db_date=${date}
else 
        db_date=`date -d "-1 day" +%F`
fi

sql=" 
insert into table $hive_table_name
select
	mn.sku_tm_id,
	mn.sku_category1_id,
	mn.sku_category1_name,
	sum(if(mn.order_count>=1,1,0)) buycount,
	sum(if(mn.order_count>=2,1,0)) buyTwiceLast,
	sum(if(mn.order_count>=2,1,0))/sum( if(mn.order_count>=1,1,0)) buyTwiceLastRatio,
	sum(if(mn.order_count>=3,1,0)) buy3timeLast ,
	sum(if(mn.order_count>=3,1,0))/sum( if(mn.order_count>=1,1,0)) buy3timeLastRatio,
	date_format('$db_date' ,'yyyy-MM') stat_mn,
	'$db_date' stat_date
from
(
	select
		user_id,
		sd.sku_tm_id,
		sd.sku_category1_id,
		sd.sku_category1_name,
		sum(order_count) order_count
	from dws.mall__sale_detail_daycount sd
	where date_format(dt,'yyyy-MM')=date_format('$db_date' ,'yyyy-MM')
	group by user_id, sd.sku_tm_id, sd.sku_category1_id, sd.sku_category1_name
) mn
group by mn.sku_tm_id, mn.sku_category1_id, mn.sku_category1_name;
"
$hive -e "$sql"

你可能感兴趣的:(大数据篇:数据仓库案例)