市场需求量大,缺少大数据分析平台,需要大数据的数据分析支持,数据访问量大,mysql数据库的瓶颈(并发量和单次查询的数据量达不到要求)
营销分析断层
产品迭代无法量化
用户运营不精准
全局运营指标监控不实时
数据采集
数据从哪来,到哪去,如何采集
数据仓库(存储)
各种中央数据的存储系统。提供数据的存储,管理和分析能力,为企业所有级别决策制定过程,提供所有类型数据支持的战略集合
为数据挖掘,多维分析,决策支持报表等系统和应用提供统一的,准确的,易用的数据
即席查询
用户可以根据自己的需求,灵活的选择查询条件,系统能够根据用户的选择生成相应的统计报表
行为域(流量)分析(行为域数据的集合)
定义用户行为(用户是否活跃,留存,流失)
整体流量概况
累积用户,每日新增用户,日访问量,网站访问次数/时长/深度
用户流量分析,产品被使用情况分析,不同类型用户分布
用户获客分析
新增用户渠道:自然流量、渠道流量
各类渠道新用户人均访问时长
异常流量分析
用户分布分析
用户地区、性别、操作系统、注册时间、浏览器类型
版本访问流量分析
留存,转化率
用户留存事件原因,用户流式分析
各类关键事件人数,分布,以及发生次数
归因分析
行为路径:事件到事件的路径
行为间隔:业务的转换情况(间隔时长)
转化漏斗分析
业务域分析(业务数据的集合)
用来描述业务:交易,营销,运营活动
购物车分析、交易金额分析、复购率分析、营销分析、广告运营分析、会员分析
用户画像:本质就是为用户打标签(核心)
基本属性:性别,地域
行为习惯:登录,浏览,分享,收藏,下单
消费习惯:金额,积分抵扣,喜欢品牌
数据采集传输工具
flume
、kafka
、sqoop、DataX
、LogStath
数据存储工具
Mysql
、HDFS
、HBase
、Redis
、MongoDB、clickHouse
数据计算
Hive
、Tez、spark
、Flink
、Storm
数仓的主体、管理核心是Hive
数据查询(即席查询)
Prosto、kylin
、lmpala、Druid
数据可视化
Echarts、SuperSet
、QuickBi、DataV
任务调度
Azkaban
、Oozie、Dolpminscheduler
集群监控
Zabbix、Ganglia
、prometheus
元数据管理
Atlas
系统数据流程设计
组件版本选型
框架选型尽量不要选择最新的框架,选择最新框架半年前左右的稳定版
服务器选型
云 or 物理机
集群资源规划设计
技术架构
数据产生
用户行为日志数据需要在业务系统的前端(或后端)中做埋点
业务数据已经在业务系统的数据库中
历史数据
其他第三方数据
数据采集
行为域
日志前端埋点,生成日志数据
日志服务器存储为日志文件
Flume采集日志文件
日志预处理后写入HDFS
落Hive数仓ODS层
业务域
业务系统增删改数据库,形成业务数据
Sqoop/DataX/Kettle数据抽取
Kettle是一些传统企业比较熟悉的ETL(extract-transfer-load)工具
落地Hive数仓ODS层
增量合并处理
数据存储
底层:HDFS
产出数据:Kylin、mysql
用户明细数据:HBase 给上层接口
报表:Mysql、HBase
多维分析:Kylin
数据处理
核心技术选型:Hive(数据仓库基础设施)
计算引擎:MapReduce+ SparkSQL
辅助工具
Azkaban/Oozie任务调度系统
Atlas元数据和血缘追溯管理(数据治理)
行为域数据流程
业务域数据流程
常见数仓分层模型
ODS层(原始数据层):存放原始数据,直接加载原始数据(日志数据、业务数据),保持数据原样不做处理,主要目的也是为了做数据备份
DWD层(数据明细层)
日志数据:非结构化数据->结构化数据(格式、逻辑、几张表)
业务数据:结构化数据,处理比较简单
处理办法:数据清洗(去空值、去脏数据、超过极限范围的数据)、脱敏
目的:为了保留明细数据,一般一行信息代表一次业务行为
DWS层(数据服务层):以DWD层数据为基础,做轻度数据聚合操作,一行信息代表一个主题对象
如:一天用户下单次数
ADS层(数据应用层):为各种统计报表提供数据支持
分层原因
拆解复杂任务:将复杂任务进行分解,每一层只处理简单的任务,方便定位问题
减少重复开发:规范数据分层,通过中间数据,能够减少极大的重复开发,增加一次计算的结果重复使用,提高复用性
隔离原始数据:不论是数据的异常还是数据敏感性,使真实的数据和统计数据进行解耦
数据集市
定义:数据集市一般从属于数据仓库,是一种微型的数据仓库,更少的数据量、更少的主题区域、更少的历史数据,面向的部门一般只能为某个局部范围的管理人员使用服务,而数据仓库则是面向企业级,能够为整个企业各个部门提供数据支持
独立数据集市:不从属于数仓,一般公司为了快速运行,会使用独立数据集市
数仓命名规范
表命名:分层名称_表名 行为表一般加 _log
如:ods_login_log
dwd_login_log
dim_user
脚本命名:数据源2/to _目标 _ db/log.sh
如:kafka_2_hdfs_log/db.sh
表字段类型(业务相关看需求)
数仓理论
范式理论
范式:设计一张表的标准级别,规范和需求
优点:降低数据的冗余
理由:1.磁盘贵,为了减少磁盘存储
2.没有分布式系统,单机增加磁盘个数有限
3.数据一致性问题,一次修改,需要改动多张表
缺点:表过多,多join 查询效率慢
函数依赖
完全函数依赖
部分函数依赖
传递函数依赖
三范式
第一范式:属性不可分割
第二范式:消除部分函数依赖
第三范式:消除传递函数依赖
OLAP与OLTP
OLTP:更多的是应用于关系性数据库,面向应用,用于基本的日常事务处理,例如:银行交易
OLAP:数据仓库系统的主要应用,支持复杂的分析操作,侧重于角色支持,并提供直观易懂的查询结果
数仓建模理论
关系建模:严格遵守三范式,松散、零碎、物理表较多,数据冗余程度低,这些数据可以被灵活的使用,功能性强,主要应用于OLTP系统,为了保证数据的一致性及避免冗余,所以走三范式设计原则
维度建模:主要应用于OLAP系统,更适合做数据分析,通常以某一个事实表为中心进行表的组织,主要面向于业务,可能有数据冗余,但是方便分析
维度表:一般是对事实的描述信息,每一张维度表对应现实中的一个对象或者概念
特征
1.范围宽 具有较多的属性,列比较多
2.行数少,与事实比较
3.较固定,与事实比较
事实表:以最低粒度来看,每行数据代表一个业务事件(下单、退款、支付、评价等等)
特征
1.数据量非常大
2.列数比较少,主要有外键id和度量值 属于窄表
3.变化比较多,每天会新增很多数据
注意事项
1.可以存在空值度量
2.事实表中外键不能存在空值
分类
1.事务事实表:保留原子属性
2.周期性快照事实表:不保留所有数据,只保留固定时间间隔,增量更新,新增后不做修改
3.累积性快照事实表:用于跟踪业务实时的变化,新增及变化
维度模型分类:星型模型、雪花模型、星座模型
数据仓库建模(重要)
ODS层
行为数据,业务数据
保持数据原貌不做任何修改,起到备份数据的作用,数据采用压缩,减少磁盘的存储空间
压缩:snappy 、Lzo 列式存储:ORC、Parquet
创建分区表,防止全表扫描
DWD层(业务驱动)
Hive建表(json解析包),只能解析内容统一的数据,dwd层一般采用星型模型构建维度
维度建模步骤
选择业务过程:选择需要分析的业务线,一个业务对应一张事实表
声明粒度:精确定义一行数据表示什么,尽可能选择最小粒度,为了应对将来的各种需求场景
确定维度:用来描述事实,人、时间、地点、确定维度外键
确定事实:确定业务中的度量值
DWS层(需求驱动)
数据服务层,为ADS层的分析,指标做服务
数据源:DWD层
有重复计算问题,构建DWS层聚合数据表,避免重复计算
dws层汇总方法
1.先找维度
2.找到相关的业务(事实表DWD)
3.确定需求
4.确定粒度,在汇总的情况下保持细粒度(比如:一天的购买量,减少数据扫描量)
特点:dws层表是大宽表,方便计算直接在宽表中拿数据
ADS层(需求驱动)
数据应用层
数据源:DWS层优先,若DWS层无可用数据可以去DWD层找
对系统各大主题指标分别进行分析
flume收集数据落地ODS层HDFS
Flume执行配置命令
bin/flume-ng agent -n a2 -c conf/ -f myconf/avro_2_hdfs.conf -Dflume.root.logger=INFO,console
第一版
俩个上游,一个下游
上游flume配置如下
# 命名此代理上的组件 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 描述/配置源 a1.sources.r1.channels = c1 a1.sources.r1.type = TAILDIR a1.sources.r1.filegroups = g1 a1.sources.r1.filegroups.g1 = /var/logdata/xxx_online_data.* a1.sources.r1.batchSize = 5000 # 使用在内存中缓冲事件的通道 a1.channels.c1.type = file a1.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a1.channels.c1.dataDir = /opt/data/flume-data/data/ a1.channels.c1.transactionCapacity = 10000 # 描述接收器 a1.sinks.k1.channel = c1 a1.sinks.k1.type = AVRO a1.sinks.k1.hostname = node03 a1.sinks.k1.port = 4646
下游flume配置
# 命名此代理上的组件 a2.sources = r1 a2.sinks = k1 a2.channels = c1 #描述/配置源 a2.sources.r1.channels = c1 a2.sources.r1.type = avro a2.sources.r1.bind = node03 a2.sources.r1.port = 4646 #使用在内存中缓冲事件的通道 a2.channels.c1.type = file a2.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a2.channels.c1.dataDir = /opt/data/flume-data/data/ a2.channels.c1.transactionCapacity = 10000 #描述接收器 a2.sinks.k1.channel = c1 a2.sinks.k1.type = HDFS a2.sinks.k1.hdfs.path = hdfs://hdfs-xxx/logdata/applog/%Y-%m-%d/ a2.sinks.k1.hdfs.filePrefix = xxx- a2.sinks.k1.hdfs.fileSuffix = .log a2.sinks.k1.hdfs.rollInterval = 300 a2.sinks.k1.hdfs.rollSize = 104857600 a2.sinks.k1.hdfs.rollCount = 50000 a2.sinks.k1.hdfs.fileType = CompressedStream a2.sinks.k1.hdfs.codeC = snappy a2.sinks.k1.hdfs.useLocalTimeStamp = true
第二版
需要配置拦截器,拦截事件时间戳
上游Flume sink 配置2个 分别对应下游两台Flume ,下游Flume实现高可用
上游Flume配置
# 命名此代理上的组件 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 # 描述/配置源 a1.sources.r1.channels = c1 a1.sources.r1.type = TAILDIR a1.sources.r1.filegroups = g1 a1.sources.r1.filegroups.g1 = /var/logdata/xxx_online_data.* a1.sources.r1.batchSize = 5000 a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = cn.yjx.flume.interceptor.EventTimeStampInterceptor$EventTimeStampInterceptorBuilder a1.sources.r1.interceptors.i1.timestamp_fieldname = timeStamp a1.sources.r1.interceptors.i1.event_header = timestamp # 使用在内存中缓冲事件的通道 a1.channels.c1.type = file a1.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a1.channels.c1.dataDir = /opt/data/flume-data/data/ a1.channels.c1.transactionCapacity = 10000 # 描述接收器 a1.sinks.k1.channel = c1 a1.sinks.k1.type = AVRO a1.sinks.k1.hostname = node02 a1.sinks.k1.port = 4646 a1.sinks.k2.channel = c1 a1.sinks.k2.type = AVRO a1.sinks.k2.hostname = node03 a1.sinks.k2.port = 4646 a1.sinkgroups = p1 a1.sinkgroups.p1.sinks = k1 k2 a1.sinkgroups.p1.processor.type = failover a1.sinkgroups.p1.processor.priority.k1 = 15 a1.sinkgroups.p1.processor.priority.k2 = 10 a1.sinkgroups.p1.processor.maxpenalty = 10000
下游Flume配置
下游有两台flume,flume在哪台节点上需要修改当前节点的a2.sources.r1.bind = node02 若是node03节点,需要修改a2.sources.r1.bind = node03
# 命名此代理上的组件 a2.sources = r1 a2.sinks = k1 a2.channels = c1 #描述/配置源 a2.sources.r1.channels = c1 a2.sources.r1.type = AVRO a2.sources.r1.bind = node02 a2.sources.r1.port = 4646 #使用在内存中缓冲事件的通道 a2.channels.c1.type = file a2.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a2.channels.c1.dataDir = /opt/data/flume-data/data/ a2.channels.c1.transactionCapacity = 10000 #描述接收器 a2.sinks.k1.channel = c1 a2.sinks.k1.type = HDFS a2.sinks.k1.hdfs.path = hdfs://hdfs-xxx/logdata/applog/%Y-%m-%d/ a2.sinks.k1.hdfs.filePrefix = xxx- a2.sinks.k1.hdfs.fileSuffix = .log.snappy a2.sinks.k1.hdfs.rollInterval = 300 a2.sinks.k1.hdfs.rollSize = 104857600 a2.sinks.k1.hdfs.rollCount = 50000 a2.sinks.k1.hdfs.fileType = CompressedStream a2.sinks.k1.hdfs.codeC = snappy a2.sinks.k1.hdfs.useLocalTimeStamp = false
拦截器代码
将拦截器代码打包导入flume,lib目录下。
package com.xxx.flume.interceptor; import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.JSONObject; import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.interceptor.Interceptor; import java.util.List; import java.util.Map; public class EventTimeStampInterceptor implements Interceptor { String timestamp_fileName; String headerName; public EventTimeStampInterceptor(String timestamp_FieldName,String headerName){ this.timestamp_fileName = timestamp_fileName; this.headerName = headerName; } @Override public void initialize() { } @Override public Event intercept(Event event) { // 从日志数据提取事件时间戳 byte[] eventBody = event.getBody(); //{"account":"abc","phone":"13883165412","timeStamp":1675265247815,"eventId":"addCart","properties":{"pid":"p001"}} String json = new String(eventBody); JSONObject jsonObject = JSON.parseObject(json); Long eventTimeStamp = jsonObject.getLong(this.timestamp_fileName); // 放入event的headerMap headers = event.getHeaders(); headers.put(this.headerName,eventTimeStamp+""); return event; } @Override public List intercept(List list) { for (Event event : list) { intercept(event); } return list; } @Override public void close() { } public static class EventTimeStampInterceptorBuilder implements Interceptor.Builder{ String timestamp_fileName; String headerName; @Override public Interceptor build() { return new EventTimeStampInterceptor(timestamp_fileName,headerName); } @Override public void configure(Context context) { timestamp_fileName = context.getString("timestamp_fieldname"); headerName = context.getString("headerName"); } } }
第三版
脚本
mock数据
!/bin/bash while true do echo '{"account":"abc","phone":"13333445555","timeStamp":1234567891234,"eventId":"addCart","properties":{"pid":"p001"}}' >> /var/logdata/app/xxx_online_data .logecho '{"account":"bdc","phone":"18758437345","timeStamp":1234534242343,"eventId":"wxapp","properties":{"pid":"p001"}}' >> /var/logdata/wxapp/xxx_online_dat. logecho '{"account":"def","phone":"45345345555","timeStamp":1234567343433,"eventId":"web","properties":{"pid":"p001"}}' >> /var/logdata/web/xxx_online_data.log sleep 0.1 done
一个上游,上游Flume收集多个数据来源
上游flume配置
# 命名此代理上的组件 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 # 描述/配置源 a1.sources.r1.channels = c1 a1.sources.r1.type = TAILDIR a1.sources.r1.filegroups = g1 g2 g3 a1.sources.r1.filegroups.g1 = /var/logdata/app/xxx_online_data.* a1.sources.r1.filegroups.g2 = /var/logdata/wxapp/xxx_online_data.* a1.sources.r1.filegroups.g3 = /var/logdata/web/xxx_online_data.* a1.sources.r1.headers.g1.logtype = app a1.sources.r1.headers.g2.logtype = wxapp a1.sources.r1.headers.g3.logtype = web a1.sources.r1.batchSize = 5000 a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = cn.yjx.flume.interceptor.EventTimeStampInterceptor$EventTimeStampInterceptorBuilder a1.sources.r1.interceptors.i1.timestamp_fieldname = timeStamp a1.sources.r1.interceptors.i1.event_header = timestamp # 使用在内存中缓冲事件的通道 a1.channels.c1.type = file a1.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a1.channels.c1.dataDir = /opt/data/flume-data/data/ a1.channels.c1.transactionCapacity = 10000 # 描述接收器 a1.sinks.k1.channel = c1 a1.sinks.k1.type = AVRO a1.sinks.k1.hostname = node02 a1.sinks.k1.port = 4646 a1.sinks.k2.channel = c1 a1.sinks.k2.type = AVRO a1.sinks.k2.hostname = node03 a1.sinks.k2.port = 4646 a1.sinkgroups = p1 a1.sinkgroups.p1.sinks = k1 k2 a1.sinkgroups.p1.processor.type = failover a1.sinkgroups.p1.processor.priority.k1 = 15 a1.sinkgroups.p1.processor.priority.k2 = 10 a1.sinkgroups.p1.processor.maxpenalty = 10000
下游flume配置
下游两台flume高可用
# 命名此代理上的组件 a2.sources = r1 a2.sinks = k1 a2.channels = c1 #描述/配置源 a2.sources.r1.channels = c1 a2.sources.r1.type = avro a2.sources.r1.bind = node03 a2.sources.r1.port = 4646 #使用在内存中缓冲事件的通道 a2.channels.c1.type = file a2.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a2.channels.c1.dataDir = /opt/data/flume-data/data/ a2.channels.c1.transactionCapacity = 10000 #描述接收器 a2.sinks.k1.channel = c1 a2.sinks.k1.type = HDFS a2.sinks.k1.hdfs.path = hdfs://hdfs-xxx/logdata/%{logtype}/%Y-%m-%d/ a2.sinks.k1.hdfs.filePrefix = xxx- a2.sinks.k1.hdfs.fileSuffix = .log.snappy a2.sinks.k1.hdfs.rollInterval = 300 a2.sinks.k1.hdfs.rollSize = 104857600 a2.sinks.k1.hdfs.rollCount = 50000 a2.sinks.k1.hdfs.fileType = CompressedStream a2.sinks.k1.hdfs.codeC = snappy a2.sinks.k1.hdfs.useLocalTimeStamp = false
数据入库
数据在哪?什么时候入库?怎么入库?
flume取日志到HDFS
上游
# 命名此代理上的组件 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 # 描述/配置源 a1.sources.r1.channels = c1 a1.sources.r1.type = TAILDIR a1.sources.r1.filegroups = g1 a1.sources.r1.filegroups.g1 = /opt/data/logdata/app/event_log_.* a1.sources.r1.batchSize = 5000 a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = cn.xxx.flume.interceptor.EventTimeStampInterceptor$EventTimeStampInterceptorBuilder a1.sources.r1.interceptors.i1.timestamp_fieldname = timeStamp a1.sources.r1.interceptors.i1.event_header = timestamp # 使用在内存中缓冲事件的通道 a1.channels.c1.type = file a1.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a1.channels.c1.dataDir = /opt/data/flume-data/data/ a1.channels.c1.transactionCapacity = 10000 # 描述接收器 a1.sinks.k1.channel = c1 a1.sinks.k1.type = AVRO a1.sinks.k1.hostname = node02 a1.sinks.k1.port = 4646 a1.sinks.k2.channel = c1 a1.sinks.k2.type = AVRO a1.sinks.k2.hostname = node03 a1.sinks.k2.port = 4646 a1.sinkgroups = p1 a1.sinkgroups.p1.sinks = k1 k2 a1.sinkgroups.p1.processor.type = failover a1.sinkgroups.p1.processor.priority.k1 = 15 a1.sinkgroups.p1.processor.priority.k2 = 10 a1.sinkgroups.p1.processor.maxpenalty = 10000
下游
# 命名此代理上的组件 a2.sources = r1 a2.sinks = k1 a2.channels = c1 #描述/配置源 a2.sources.r1.channels = c1 a2.sources.r1.type = AVRO a2.sources.r1.bind = node02 a2.sources.r1.port = 4646 #使用在内存中缓冲事件的通道 a2.channels.c1.type = file a2.channels.c1.checkpointDir = /opt/data/flume-data/checkpoint/ a2.channels.c1.dataDir = /opt/data/flume-data/data/ a2.channels.c1.transactionCapacity = 10000 #描述接收器 a2.sinks.k1.channel = c1 a2.sinks.k1.type = HDFS a2.sinks.k1.hdfs.path = hdfs://hdfs-xxx/app/ods/ods_app_event_log/dt=%Y-%m-%d/ a2.sinks.k1.hdfs.filePrefix = xxx- a2.sinks.k1.hdfs.fileSuffix = .log.snappy a2.sinks.k1.hdfs.rollInterval = 300 a2.sinks.k1.hdfs.rollSize = 104857600 a2.sinks.k1.hdfs.rollCount = 50000 a2.sinks.k1.hdfs.fileType = CompressedStream a2.sinks.k1.hdfs.codeC = snappy a2.sinks.k1.hdfs.useLocalTimeStamp = false
数据通过flume采集到HDFS落地之后交给hive接管
HIVE接管
由于原始数据为json格式,需要使用hive内置的JsonSerDe
官方文档
官方文档: LanguageManual DDL - Apache Hive - Apache Software Foundation DDL-JSON
也可使用第三方JsonSerde工具包
JsonSerde 的 github 地址: GitHub - rcongiu/Hive-JSON-Serde: Read - Write JSON SerDe for Apache Hive.
JsonSerde 的 jar下载地址: Index of /hive-json-serde
下载 json-serde-1.3.7-jar-with-dependencies.jar 并上传到 Hive的/lib库目录下
org.apache.hive.hcatalog.data.JsonSerDe
修改ods_app_event_log的表信息
alter table ods.ods_app_event_log set serde 'org.apache.hive.hcatalog.data.JsonSerDe';
查看表的创建详细信息
show create table ods.ods_app_event_log;
如果需要,也可以把本jar包安装到本地maven库
bin\mvn install:install-file -Dfile=d:/json-serde.1.3.8.jar -DgroupId="org.openx.data" -DartifactId=json-serde -Dversion="1.3.8" -Dpackaging=jar
hive建表语句
create external table if not exists ods.ods_app_event_log ( account string, appId string, appVersion string, carrier string, deviceId string, deviceType string, eventId string, ip string, latitude double, longitude double, netType string, osName string, osVersion string, properties map, -- 事件属性 releaseChannel string, -- 下载渠道 resolution string, -- 分辨率 sessionId string, -- 会话id `timeStamp` bigint -- 事件时间 ) partitioned by (dt string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' stored as textfile location 'hdfs://hdfs-xxx/app/ods/ods_app_event_log' ;
建表之后因为是按照日期分区,需要更新元数据信息,需要使用hive分区修复
msck repair table ods_app_event_log
查看表的详细信息
desc formatted ods_app_event_log
每天进行入仓 如2022-06-01 00:00 加载2022-05-30全天数据到hive
hive load -> hive>load data inpath '/app/ods/ods_app_event_log/2022-05-30' into table ods.ods_app_event_log partition(dt='2022-06-01')
需要每天手动执行,所以需要放入定时调度 使用shell脚本
入仓脚本 定时执行
入仓脚本
#!/bin/bash # @date:2022-06-29 # @auther:cyx # @desc:加载app日志数据到数据仓库ODS层 # @department:xxx export export HIVE_HOME=/opt/xxx/apache-hive-3.1.2-bin pre_day=$(date -d'-1 day' +'%Y-%m-%d') sql="load data inpath '/app/ods/ods_app_event_log/dt=${pre_day}' into table ods.ods_app_event_log partition(dt='${pre_day}')" ${HIVE_HOME}/bin/hive -e "${sql}" # 判断任务执行是否成功 if [ $? -eq 0 ];then echo "日志数据日期: $pre_day ; 数据源目录:/app/ods/ods_app_event_log/dt=${pre_day} ; 目标表:ods.ods_app_event_log" | mail -s "app日志入库数据加载任务成功" [email protected] exit 0 else echo "日志数据日期: $pre_day ; 数据源目录:/app/ods/ods_app_event_log/dt=${pre_day} ; 目标表:ods.ods_app_event_log" | mail -s "app日志入库数据加载任务失败" [email protected] exit 1 fi
业务域日志收集工具DataX
hive建表语句
create external table ods.ods_oms_order( id string, member_id string, coupon_id string, order_sn string, create_time string, member_username string, total_amount string, pay_amount string, freight_amount string, promotion_amount string, integration_amount string, coupon_amount string, discount_amount string, pay_type string, source_type string, status string, order_type string, delivery_company string, delivery_sn string, auto_confirm_day string, integration string, growth string, promotion_info string, bill_type string, bill_header string, bill_content string, bill_receiver_phone string, bill_receiver_email string, receiver_name string, receiver_phone string, receiver_post_code string, receiver_province string, receiver_city string, receiver_region string, receiver_detail_address string, note string, confirm_status string, delete_status string, use_integration string, payment_time string, delivery_time string, receive_time string, comment_time string, modify_time string ) partitioned by (dt string) stored as orc location 'hdfs://hdfs-xxx/app/ods/ods_oms_order/' tblproperties( "orc.compress"="snappy" );
创建日期分区
hadoop dfs -mkdir /app/ods/ods_oms_order/dt=2022-06-29
datax导入json
{ "job":{ "setting":{ "speed":{ "channel":1 } }, "content":[ { "reader":{ "name":"mysqlreader", "parameter":{ "username":"root", "password":"123456", "connection":[ { "querySql":[ "select id ,member_id ,coupon_id ,order_sn,create_time ,member_username ,total_amount ,pay_amount ,freight_amount,promotion_amount ,integration_amount,coupon_amount ,discount_amount ,pay_type,source_type ,status,order_type ,delivery_company ,delivery_sn ,auto_confirm_day,integration ,growth,promotion_info,bill_type ,bill_header,bill_content,bill_receiver_phone,bill_receiver_email ,receiver_name ,receiver_phone,receiver_post_code ,receiver_province ,receiver_city,receiver_region,receiver_detail_address ,note ,confirm_status,delete_status,use_integration ,payment_time ,delivery_time ,receive_time,comment_time ,modify_time from oms_order where modify_time>='2022-06-28 00:00:00' and modify_time<'2022-06-29 00:00:00'" ], "jdbcUrl":[ "jdbc:mysql://node01:3306/dw" ] } ] } }, "writer":{ "name":"hdfswriter", "parameter":{"defaultFS":"hdfs://node01:8020", "fileType":"orc", "path":"/app/ods/ods_oms_order/dt=2022-06-28", "fileName":"oms_order", "column":[ { "name":"id", "type":"STRING" }, { "name":"member_id", "type":"STRING" }, { "name":"coupon_id", "type":"STRING" }, { "name":"order_sn", "type":"STRING" }, { "name":"create_time", "type":"STRING" }, { "name":"member_username", "type":"STRING" }, { "name":"total_amount", "type":"STRING" }, { "name":"pay_amount", "type":"STRING" }, { "name":"freight_amount", "type":"STRING" }, { "name":"promotion_amount", "type":"STRING" }, { "name":"integration_amount", "type":"STRING" }, { "name":"coupon_amount", "type":"STRING" }, { "name":"discount_amount", "type":"STRING" }, {"name":"pay_type", "type":"STRING" }, { "name":"source_type", "type":"STRING" }, { "name":"status", "type":"STRING" }, { "name":"order_type", "type":"STRING" }, { "name":"delivery_company", "type":"STRING" }, { "name":"delivery_sn", "type":"STRING" }, { "name":"auto_confirm_day", "type":"STRING" }, { "name":"integration", "type":"STRING" }, { "name":"growth", "type":"STRING" }, { "name":"promotion_info", "type":"STRING" }, { "name":"bill_type", "type":"STRING" }, { "name":"bill_header", "type":"STRING" }, { "name":"bill_content", "type":"STRING" }, { "name":"bill_receiver_phone", "type":"STRING" }, { "name":"bill_receiver_email", "type":"STRING"}, { "name":"receiver_name", "type":"STRING" }, { "name":"receiver_phone", "type":"STRING" }, { "name":"receiver_post_code", "type":"STRING" }, { "name":"receiver_province", "type":"STRING" }, { "name":"receiver_city", "type":"STRING" }, { "name":"receiver_region", "type":"STRING" }, { "name":"receiver_detail_address", "type":"STRING" }, { "name":"note", "type":"STRING" }, { "name":"confirm_status", "type":"STRING" }, { "name":"delete_status", "type":"STRING" }, { "name":"use_integration", "type":"STRING" }, { "name":"payment_time", "type":"STRING" }, { "name":"delivery_time", "type":"STRING" }, { "name":"receive_time", "type":"STRING" }, {"name":"comment_time", "type":"STRING" }, { "name":"modify_time", "type":"STRING" } ], "writeMode":"append", "fieldDelimiter":",", "compress":"snappy" } } } ] } }
修改分区
msck repair table ods.ods_oms_order
执行入仓脚本
```bash #!/bin/bash export DATAX_HOME=/opt/xxx/datax CUR=`date -d'-1 day' +%Y-%m-%d` NEXT=`date_order.json oms_order_${CUR}.json #修改json配置文件中的日期 sed -i "s/_CUR_/${CUR}/g" oms_order_${CUR}.json sed -i "s/_NEXT_/${NEXT}/g" oms_order_${CUR}.json ${HADOOP_HOME}/bin/hdfs dfs -mkdir -p /xxx/app/ods/ods_oms_order/dt=${NEXT}/ python ${DATAX_HOME}/bin/datax.py oms_order.json #判断任务是否执行成功,发送邮件 if [ $? -ne 0 ];then echo "数据抽取任务失败,oms_order增量数据抽取任务,抽取的数据日期:${CUR}" | mail -s "datax任务失败" [email protected] else echo "数据抽取任务成功,oms_order增量数据抽取任务,抽取的数据日期:${CUR}" | mail -s "datax任务成功" [email protected]
需求分析
ODS到DWS需要进行数据的清洗过滤,去除废弃字段,脏数据,缺少关键字段的记录等
数据解析:将json打平,解析成扁平格式
Session分割
数据的规范处理
数据集成
访客标记
方案设计
GPS地理位置解析
经纬度X公里内坐标点筛选与geohash_bohu83的博客-CSDN博客_经纬度x
GEOHASH编码介绍
Geohash编码是一种地理位置编码技术,它可将一个gps坐标(含经、纬度)点,转化为一个字符串
如:gps坐标数据形如: (130.89892350983459, 38.239879283598) 地理位置: 河北省,石家庄市,裕华区
在地球经纬度范围内,不断通过二分来划分矩形范围,通过观察gps坐标点所落的范围,来反复生成0/1二进制码
根据经纬度计算GeoHash二进制编码
地球纬度区间是[-90,90], 北海公园的纬度是39.928167,可以通过下面算法对纬度39.928167进行逼近编码:
1.区间[-90,90]进行二分为[-90,0),[0,90],称为左右区间,可以确定39.928167属于右区间[0,90],给标记为1
2.接着将区间[0,90]进行二分为 [0,45),[45,90],可以确定39.928167属于左区间 [0,45),给标记为0
3.递归上述过程39.928167总是属于某个区间[a,b]。随着每次迭代区间[a,b]总在缩小,并越来越逼近39.928167
4.如果给定的纬度x(39.928167)属于左区间,则记录0,如果属于右区间则记录1,这样随着算法的进行会产生一个序列1011100,序列的长度跟给定的区间划分次数有关
组合码
通过上述计算,纬度产生的编码为10111 00011,经度产生的编码为11010 01011。偶数位放经度,奇数位放纬度,把2串编码组合生成新串:11100 11101 00100 01111,最后使用用0-9、b-z(去掉a, i, l, o)这32个字母进行base32编码,首先将11100 11101 00100 01111转成十进制,对应着28、29、4、15,十进制对应的编码就是wx4g。同理,将编码转换成经纬度的解码算法与之相反
当geohash base32编码长度为8时,精度在19米左右,而当编码长度为9时,精度在2米左右,编码长度需要根据数据情况进行选择
GEOHASH编码工具包依赖
ch.hsr geohash 1.3.0
api调用示例:
String geohashcode = GeoHash.withCharacterPrecision(45.667, 160.876547, 6).toBase32();
IP地址地理位置解析(关联设备ID和登录ID)
ip地址数据形如:202.102.36.87 -> 地理位置: 江苏省,南京市,电信
IP查找算法:将字典中的起始ip和结束ip,都设法转成整数,这样,ip地址段就变成了整数段接下来,将未知区域的ip按照相同方法转换成整数,则能相对方便地查找到字典数据了具体的搜索算法,可以使用二分查找算法
IP地理位置处理工具包
开源工具包ip2region(含ip数据库)
项目地址: ip2region: Ip2region (2.0 - xdb) 是一个离线 IP 数据管理框架和定位库,支持亿级别的数据段,10微秒级别的查询性能,提供了许多主流编程语言的 xdb 数据管理引擎的实现。
使用方法
1.引入 jar 包依赖
org.lionsoul ip2region 1.7.2
2.Api调用代码
package com.xxx.dataetl.utils import org.lionsoul.ip2region.{DbConfig, DbSearcher} object Ip2RegionCode { def main(args: Array[String]): Unit = { // 初始化配置参数 val config = new DbConfig // 构造搜索器,dbFile是ip地址库字典文件所在路径 val searcher = new DbSearcher(config, "data/ip2region.db") // 使用搜索器,调用查找算法获取地理位置信息 val block = searcher.memorySearch("202.102.36.87") println(block) }
ID_MAPPING
在登录状态下,日志中会采集到用户的登录id(account),可以做到用户身份的精确标识;而在匿名状态下,日志中没有采集到用户的登录id
如何准确标识匿名状态下的用户?
1.方案一:只使用deviceid来作为判断用户的唯一标识
一部设备上可能出现A账号和B账号 那么就会被认为是一个人
同一个账号,在不同设备上登录使用,那么这些行为数据就会被认为是多个用户
2.方案二:只使用account来作为判断用户的唯一标识
一部设备上A和B登录过,会被正确识别
同一个账号,在不同设备上登录,行为数据也会被正确识别为一个人
匿名行为日志如何处理?
动态绑定
看设备上那个账号出现的次数最多,就把当前当前账号绑定给设备(设备被动态绑定给一个账号)
一个设备ID被绑定到某个登陆ID(A)之后,如果该设备在后续一段时间(比如一个月内)被一个新的登陆ID(B)更频繁使用,则该设备ID会被调整至绑定登陆ID(B)
设计图
开发代码
行为域地理信息维表(DIM)
从mysql读取数据组合字段写成地理位置维度表导入hive
创建hive中维度表建表语句
create database dim; create external table if not exists dim.dim_area_dict ( geohash string, province string, city string, region string ) stored as textfile location 'hdfs://hdfs-xxx/app/dim/dim_area_dict' tblproperties ('parquet.compression'='snappy');
地理位置信息spark转换
spark代码实操
package com.xxx.dataetl.utils import ch.hsr.geohash.GeoHash import org.apache.spark.sql.SparkSession import java.util.Properties object AreaDicUtil { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("地理位置字典加工") .master("local") .enableHiveSupport() //开启hive支持 .config("spark.sql.shuffle.partition", "1") .getOrCreate() //链接数据库 val properties = new Properties() properties.put("user", "root") properties.put("password", "123456") val frame = spark.read .jdbc("jdbc:mysql://node01:3306/dw?useUnicode=true&characterEncoding=utf-8", "t_md_areas", properties) frame.createTempView("area") //spark geo自定义函数 var gps2geo = (lat:Double,lng:Double) => { GeoHash.geoHashStringWithCharacterPrecision(lat,lng,6) } //注册自定义函数 spark.udf.register("geo",gps2geo) spark.sql( """ |insert into dim.dim_area_dict |select | geo(l4.bd09_lat,l4.bd09_lng), | l1.areaname, | l2.areaname, | l3.areaname |from | area l4 join area l3 on l4.level=4 and l4.parentid=l3.id | join area l2 on l3.parentid=l2.id | join area l1 on l2.parentid=l1.id | |""".stripMargin ) spark.close() } }
DWS层用户设备用户绑定权重表
hive建表语句
create external table if not exists dws.dws_device_account_bind_score( deviceid string ,account string ,score double ,last_login string ) partitioned by (dt string) stored as orcfile location 'hdfs://hdfs-xxx/app/dws/dws_device_account_bind_score' tblproperties("orc.compress"="snappy")
将当日绑定评分表插入到DWS层dws_device_account_bind_score中
spark代码实操:拉取数据到DWS层
package com.xxx.dataetl.etl import org.apache.spark.sql.SparkSession object DeviceAccountBind { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("设备账号绑定计算") .master("local") .config("spark.sql.shuffle.partition", "1") .enableHiveSupport() .getOrCreate() //1.读取当天日志信息表 val curDayLog = spark.read.table("ods.ods_app_event_log") .where("dt='2022-06-29' ") .selectExpr("deviceid", "if(account='',null,account) as account", "sessionid", "timestamp") curDayLog.createTempView("curDayLog") //2.聚合当天日志,得出每一种 设备-账号 组合数 val curDayCombineSessionCnt = spark.sql( """ | |select | deviceid, | account, | 100*count(distinct sessionid) as score, | MAX(timestamp) as last_login |from | curDayLog |group by | deviceid,account |""".stripMargin) curDayCombineSessionCnt.createTempView("cur") //3.加载前一日的绑定评分表 spark.read.table("dws.dws_device_account_bind_score") .where("dt='2022-06-28' ") .createTempView("pre") //4.当日(设备-账号)聚合结果 full join 前一日绑定评分表 /** * 前一日出现的组合,今日没出现,则分数衰减 * 前一日出现的组合,今日也出现,则分数累加,并更新时间戳 * 前一日没出现的组合,今日出现,则取今日的得分 * 今日 : cur 前一日: pre */ val curBindScoreResult = spark.sql( """ |insert into table dws.dws_device_account_bind_score partition(dt='2022-06-29') |select | nvl(cur.deviceid,pre.deviceid) as deviceid, | nvl(cur.account,pre.account) as account, | case | when cur.score is null and pre.score is not null then pre.score * 0.6 | when cur.score is not null and pre.score is not null then cur.score + pre.score | when cur.score is not null and pre.score is null then cur.score | end as score, | nvl(cur.last_login,pre.last_login) as last_login |from | cur full join pre on cur.deviceid = pre.deviceid | and if(cur.account is null,'',cur.account) = if(pre.account is null,'',pre.account) |""".stripMargin) //curBindScoreResult.show(100, false) } }
日志数据ETL从ODS层写入DWD层
分为5步进行数据的ETL
一.数据的清洗过滤,1.过滤掉日志中缺少的关键字段(deviceid/properties/eventid/sessionid)
2.过滤掉日志中不符合时间段的记录
//1.数据清洗过滤 val curLog: Dataset[Row] = spark.read.table("ods.ods_app_event_log") .where("dt='2022-06-29'") //curLog.show() //过滤掉日志中缺少的关键字段(deviceid/properties/eventid/sessionid) import spark.implicits._ import org.apache.spark.sql.functions._ //自定义函数除去空值 val isNotBlank = udf((s: String) => { StringUtils.isNotBlank(s) }) //过滤掉日志中不符合时间段的记录(app上报延迟等原因) val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") val startTime = format.parse("2022-06-29 00:00:00").getTime val endTime = format.parse("2022-06-30 00:00:00").getTime //过滤后的数据filtered val filtered = curLog.where(isNotBlank($"deviceid") and col("properties").isNotNull and isNotBlank($"eventid") and isNotBlank($"sessionid")) .where(s"timestamp >=$startTime and timestamp <= $endTime")
二.数据规范化处理
生成一份要聚合到dwd层的数据表,对表中数据进行补充,给予默认值等操作,封装成bean对象,方便下一步进行计算
filtered.createTempView("filtered") val regulared: DataFrame = spark.sql( """ |select |if(account='',null,account) as account , |appId , |appVersion , |carrier , |deviceId , |deviceType , |eventId , |ip , |latitude , |longitude , |netType , |osName , |osVersion , |properties , |releaseChannel , |resolution , |sessionId , |timeStamp , | null as splitedSessionId , | null as filledAccount , | null as province , | null as city , | null as region , | -1 as guid , | 0 as isnew |from filtered | |""".stripMargin) //将数据转成Bean的形式方便操作 val logBean:Dataset[LogBean] = regulared.as[LogBean]
三.session分割:相同用户如果一个sessionid的连续访问记录前一条与后一条的时间超过30分钟,那么我们要分成两条
//3.session分割 val splitedSessionId = logBean.rdd .groupBy((bean: LogBean) => { bean.sessionId }) .flatMap(tp => { val actions = tp._2.toList.sortBy(_.timeStamp) var newSessionId = UUID.randomUUID().toString for (i <- 0 until actions.size) { //为bean中的 新会话id赋值 actions(i).splitedSessionId = newSessionId //如果 下一条日志的时间戳距离当前这条的时间戳 大于 30分钟 就更换newSessionId的值 if (i < actions.size - 1 && (actions(i + 1).timeStamp - actions(i).timeStamp > 30 * 60 * 1000)) { newSessionId = UUID.randomUUID().toString } } actions }).toDS() //splitedSessionId.show()
四.数据集成(gps、ip集成省市区),把dim层算好的地理位置数据插入到dwd层的表中
spark实操,gps能匹配就用gps,若gps不能匹配就用ip集成
//4.数据集成(gps、ip集成省市区) //读取hive中的gps参考点知识库数据 val areaDict: Dataset[Row] = spark.read.table("dim.dim_area_dict").where("geohash is not null and geohash != ''") val gpsDictMap = areaDict.rdd.map({ case Row(geohash: String, province: String, city: String, region: String) => (geohash, (province, city, region)) }).collectAsMap() //将数据广播出去 val bc = spark.sparkContext.broadcast(gpsDictMap) //读取ip2region库文件 val configuration = new Configuration() val fs = FileSystem.get(configuration) val path = new Path("/xxx/dict/ip2region.db") val inputStream = fs.open(path) val len = fs.getFileStatus(path).getLen val bytes = new Array[Byte](len.toInt) IOUtils.readFully(inputStream, bytes) IOUtils.closeQuietly(inputStream) //广播出去 val bc2 = spark.sparkContext.broadcast(bytes) //地理位置集成 val areaed = logBean.mapPartitions(iter => { //构造 val gpsDict = bc.value val ip2regionBytes = bc2.value val config = new DbConfig() val searcher = new DbSearcher(config, ip2regionBytes) //逐条映射 iter.map( bean => { //判断gps能不能使用,如果不能使用则使用ip解析 var flag = false try { val lat = bean.latitude val lng = bean.longitude //计算gps数据 val geoStr = GeoHash.geoHashStringWithCharacterPrecision(lat, lng, 6) //用geo知识库匹配数据 if (gpsDict.contains(geoStr)) { flag = true val areaInfo: (String, String, String) = gpsDict.getOrElse(geoStr, ("", "", "")) bean.province = areaInfo._1 bean.city = areaInfo._2 bean.region = areaInfo._3 } } catch { case exception: Exception => exception.printStackTrace() } //通过ip匹配数据 if (!flag) { val block = searcher.memorySearch(bean.ip) val strings = block.getRegion.split("|") if (strings.length >= 6) { bean.province = strings(3) bean.city = strings(4) } } bean }) }) areaed.show(100,false)
五.idmapping操作
按照topN取出用户account,填充filledAccount
使用开窗函数取出按deviceid分组按照score和last_login(登录时长)排序后的topN deviceid和account 和数据集成后的areaed进行left join 取出对应字段
spark代码实操
//加载设备账号绑定评分表,进行过滤处理(只留账号且分数最高的那一条 TopN) areaed.createTempView("areaed") spark.sql( """ |select | areaed.account , | areaed.appId , | areaed.appVersion , | areaed.carrier , | areaed.deviceId , | areaed.deviceType , | areaed.eventId , | areaed.ip , | areaed.latitude , | areaed.longitude , | areaed.netType , | areaed.osName , | areaed.osVersion , | areaed.properties , | areaed.releaseChannel , | areaed.resolution , | areaed.sessionId , | areaed.timeStamp , | areaed.splitedSessionId , | nvl(areaed.account,t2.account) as filledAccount, | areaed.province , | areaed.city , | areaed.region , | areaed.guid , | areaed.isnew |from | areaed |left join | (select | deviceid, | account | from | (select | deviceid, | account, | row_number() over(partition by deviceid order by score desc,last_login desc) as r | from | dws.dws_device_account_bind_score | where | dt = '2022-06-29' and account is not null) t1 | where | r=1) t2 |on areaed.deviceid = t2.deviceid | | |""".stripMargin).show()
idmapping
填充guid spark代码实操
//idmapping /* * 若T-1日(昨天)存在用户id 用户登录 不做处理 直接取出用户id做guid * 若T-1日(昨天)不存在用户id 拿绑定关系里的设备id查,存在设备id,就将用户id将其替换 * 若T-1日(昨天)不存在用户id和设备id,就新增一个设备id,guid自增 * */ val anonymousBean: Dataset[LogBean] = anonymousFilled.as[LogBean] val res: Dataset[LogBean] = anonymousBean.mapPartitions(iter => { //构造 val jedis = new Jedis("node02", 6389) //加密码 jedis.auth("123456") //映射 iter.map(bean => { //先用账号取guid var guidStr: String = null try { guidStr = jedis.get(bean.filledAccount) } catch { case exception: Exception => } if (bean.filledAccount != null && guidStr != null) { bean.guid = guidStr.toLong } else { //如果账号没有找到,则用deviceid取 guidStr = jedis.get(bean.deviceId) if (guidStr != null) { bean.guid = guidStr.toLong //将redis中的deviceid数据key换成account if (bean.filledAccount != null) { jedis.del(bean.deviceId) jedis.set(bean.filledAccount, guidStr) } } else { //用户deviceid查不到 //使用计数器获得一个新的guid val newGuid = jedis.incr("guid_cnt") //并将结果插入至redis val key = if (bean.filledAccount == null) bean.deviceId else bean.filledAccount //将guid设置到bean jedis.set(key, newGuid + "") bean.guid = newGuid.toLong } } //设置新老访客:1(新用户) //TODO 新老访客 if (bean.guid > lastMaxGuid) bean.isnew = 1 bean }) })
判断新老用户
在redis中拿到最大的guid,若新的guid大于redis中的guid则认为是新用户
//获取redis中的最打Guid //在程序的开始获取lastMaxGuid val jedis1 = new Jedis("node02", 6389) val str = jedis1.get("guid_cut") var lastMaxGuid = 0L if (str != null) lastMaxGuid = str.toLong
六.将结果写入hive的DWD层
res.createTempView("res") spark.sql( """ | |insert into table dwd.dwd_app_event_detail partition(dt='2022-06-29') |select | account , | appId , | appVersion , | carrier , | deviceId , | deviceType , | eventId , | ip , | latitude , | longitude , | netType , | osName , | osVersion , | properties , | releaseChannel , | resolution , | sessionId , | timeStamp , | splitedSessionId , | filledAccount, | province , | city , | region , | guid , | isnew |from | res |""".stripMargin) }
DWD层dwd.dwd_app_event_detail 建表语句
create external table dwd.dwd_app_event_detail( account string ,appId string ,appVersion string ,carrier string ,deviceId string ,deviceType string ,eventId string ,ip string ,latitude double ,longitude double ,netType string ,osName string ,osVersion string ,properties map,releaseChannel string ,resolution string ,sessionId string ,`timeStamp` string ,splitedSessionId string ,filledAccount string ,province string ,city string ,region string ,guid bigint ,isnew int ) partitioned by (dt string) stored as parquet location 'hdfs://hdfs-xxx/app/dwd/dwd_app_event_detail' tblproperties('parquet.compress'='snappy');
任务提交
spark-submit \ --master yarn \ --deploy-mode cluster \ --class com.xxx.dataetl.etl.OdsApp2DwsApp \ --name "app端日志ODS数据到DWD" \ --conf spark.sql.shuffle.paritions=5 \ --driver-memory 1024M \ --executor-memory 2G \ --driver-cores 1 \ --queue default \ --num-executors 2 /opt/xxx/tasks/dataxxx.jar node02 2022-06-29 2022-06-30
app日志的dwd明细表的质量检查
质量检查的手段:sql统计质量指标
质量指标:某条件下的数据条数(null值,非null值,全局总条数,被切割的会话数等...)
地理位置填充质量建表语句
create table dq.dq_dwd_app_event_detail_area( dt string, province_null bigint, province_notnull bigint, city_null bigint, city_notnull bigint, region_null bigint, region_notnull bigint, table_record_cnt bigint ) stored as parquet location 'hdfs://hdfs-xxx/app/dq/dq_dwd_app_event_detail_area'
sql统计地理位置填充质量,有多少没有被填充
insert into dq.dq_dwd_app_event_detail_area select '2022-06-29' as dt, count(if(province is null,1,null)) as province_null, count(province) as province_notnull, count(if(city is null,1,null)) as city_null, count(city) as city_notnull, count(if(region is null,1,null)) as region_null, count(region) as region_notnull, count(*) as table_record_cnt from dwd.dwd_app_event_detail where dt='2022-06-29';
sessionid分割率建表语句
create table dq.dq_dwd_app_event_detail_ssp( dt string, origin_sess_cnt bigint, split_sess_cnt bigint, origin_splited_cnt bigint, origin_not_splited_cnt bigint ) stored as parquet location 'hdfs://hdfs-xxx/app/dq/dq_dwd_app_event_detail_ssp'
sql统计session 切割比
原生会话数 sessionid
切割后的会话总数 splitedSessionId
被切割的原生会话数 一个sessionid下的splitedSessionId >=2 说明被切割
没被切割的原生会话数 一个sessionid下的splitedSessionId=1 说明没被切割
insert into dq.dq_dwd_app_event_detail_ssp select count(1) orgin_scnt ,sum(splited_scnt) split_scnt ,count(if(splited_scnt >1,1,null)) as orgin_split_cnt ,count(if(splited_scnt =1,1,null)) as orgin_not_split_cnt from (select sessionId, count(distinct splitedSessionId) as splited_scnt from dwd.dwd_app_event_detail where dt="2022-06-29" group by sessionId) o;
主要表结构
存储各业务表的全量快照
存储各业务表的拉链表
增量合并成全量快照
滚动更新策略
为了便于后续的统计分析方便,用增量抽取策略抽取过来的增量数据,都要每天进行滚动合并
合并的技术手段:
方便起见,可以使用 sqoop merge 命令进行
如果有特别情况,可以自己写 spark程序 来实现
直接用hive的sql来实现(分组top1模式 或者 FULL JOIN模式)
筛选增量数据
根据业务表中的midify_time类似字段进行限定
或者根据业务表中的自增id来限定(只有新增的数据也称为流水表)
缺点:
如果历史数据不删,会产生大量的冗余,空间浪费
如果删了,就只有最新的快照表,找不到往日的数据
拉链表
好处:
可以查询任意一条记录在任意一天的数据
可以查询到整个表的任意一天的全量状态
省空间,如果一条数据没有修改,也就只存在一条记录
实现代码:
select zip.orderid, zip.guid, zip.step, zip.amount, zip.start_date if(zip.end_date='9999-12-31' and inc.orderid is not null, '2022-06-29', zip.end_date) as end_date from ( select orderid,guid,step,amount,start_date,end_date from oms_order_zip where dt='2022-06-29' ) as zip left join ( select orderid,guid,step,amount from order_inc ) as inc using(orderid,guid) union all select orderid,guid,step,amount,'2022-06-29' as start_date,'9999-12-31' as end_date from oms_order_inc
全量快照表建表语句
DROP TABLE IF EXISTS DWD.oms_order; CREATE TABLE DWD.`oms_order` ( `id` bigint COMMENT '订单id', `member_id` bigint , `coupon_id` bigint , `order_sn` string COMMENT '订单编号', `create_time` timestamp COMMENT '提交时间', `member_username` string COMMENT '用户帐号', `total_amount` decimal COMMENT '订单总金额', `pay_amount` decimal COMMENT '应付金额(实际支付金额)', `freight_amount` decimal COMMENT '运费金额', `promotion_amount` decimal COMMENT '促销优化金额(促销价、满减、阶梯价)', `integration_amount` decimal COMMENT '积分抵扣金额', `coupon_amount` decimal COMMENT '优惠券抵扣金额', `discount_amount` decimal COMMENT '管理员后台调整订单使用的折扣金额', `pay_type` int COMMENT '支付方式:0->未支付;1->支付宝;2->微信', `source_type` int COMMENT '订单来源:0->PC订单;1->app订单', `status` int COMMENT '订单状态:0->待付款;1->待发货;2->已发货;3->已完成;4->已关闭;5-> 无效订单', `order_type` int COMMENT '订单类型:0->正常订单;1->秒杀订单', `delivery_company` string COMMENT '物流公司(配送方式)', `delivery_sn` string COMMENT '物流单号', `auto_confirm_day` int COMMENT '自动确认时间(天)', `integration` int COMMENT '可以获得的积分', `growth` int COMMENT '可以活动的成长值', `promotion_info` string COMMENT '活动信息', `bill_type` int COMMENT '发票类型:0->不开发票;1->电子发票;2->纸质发票', `bill_header` string COMMENT '发票抬头', `bill_content` string COMMENT '发票内容', `bill_receiver_phone` string COMMENT '收票人电话', `bill_receiver_email` string COMMENT '收票人邮箱', `receiver_name` string COMMENT '收货人姓名', `receiver_phone` string COMMENT '收货人电话', `receiver_post_code` string COMMENT '收货人邮编', `receiver_province` string COMMENT '省份/直辖市', `receiver_city` string COMMENT '城市', `receiver_region` string COMMENT '区', `receiver_detail_address` string COMMENT '详细地址', `note` string COMMENT '订单备注', `confirm_status` int COMMENT '确认收货状态:0->未确认;1->已确认', `delete_status` int COMMENT '删除状态:0->未删除;1->已删除', `use_integration` int COMMENT '下单时使用的积分', `payment_time` timestamp COMMENT '支付时间', `delivery_time` timestamp COMMENT '发货时间', `receive_time` timestamp COMMENT '确认收货时间', `comment_time` timestamp COMMENT '评价时间', `modify_time` timestamp COMMENT '修改时间' ) partitioned by (dt string) stored as PARQUET TBLPROPERTIES("parquet.compress"="snappy")
增量数据合并到全量快照
#!/bin/bash export JAVA_HOME=/opt/apps/jdk1.8.0_191/ export HIVE_HOME=/opt/apps/hive-3.1.2/ export HADOOP_HOME=/opt/apps/hadoop-3.1.1/ export SPARK_HOME=/opt/apps/spark-2.4.4/ export SQOOP_HOME=/opt/apps/sqoop-1.4.7/ DT_CALC=$(date -d'-1 day' +%Y-%m-%d) DT_PRE=$(date -d'-2 day' +%Y-%m-%d) if [ $1 ] then DT_CALC=$1 DT_PRE=`date -d"${DT_CALC} -1day" +%Y-%m-%d` fi ${HIVE_HOME}/bin/hive -e " -- 全量快照 with a as ( SELECT * FROM dwd.oms_order where dt='${DT_PRE}' ) -- 增量数据 ,b as ( SELECT * FROM ods.oms_order where dt='${DT_CALC}' ) INSERT INTO TABLE dwd.oms_order PARTITION(dt='${DT_CALC}') SELECT nvl(b.id,a.id) as id , nvl(b.member_id,a.member_id) as member_id , nvl(b.coupon_id,a.coupon_id) as coupon_id , nvl(b.order_sn,a.order_sn) as order_sn , nvl(b.create_time,a.create_time) as create_time , nvl(b.member_username,a.member_username) as member_username , nvl(b.total_amount,a.total_amount) as total_amount , nvl(b.pay_amount,a.pay_amount) as pay_amount , nvl(b.freight_amount,a.freight_amount) as freight_amount , nvl(b.promotion_amount,a.promotion_amount) as promotion_amount , nvl(b.integration_amount,a.integration_amount) as integration_amount , nvl(b.coupon_amount,a.coupon_amount) as coupon_amount , nvl(b.discount_amount,a.discount_amount) as discount_amount , nvl(b.pay_type,a.pay_type) as pay_type , nvl(b.source_type,a.source_type) as source_type , nvl(b.status,a.status) as status , nvl(b.order_type,a.order_type) as order_type , nvl(b.delivery_company,a.delivery_company) as delivery_company , nvl(b.delivery_sn,a.delivery_sn) as delivery_sn , nvl(b.auto_confirm_day,a.auto_confirm_day) as auto_confirm_day , nvl(b.integration,a.integration) as integration , nvl(b.growth,a.growth) as growth , nvl(b.promotion_info,a.promotion_info) as promotion_info , nvl(b.bill_type,a.bill_type) as bill_type , nvl(b.bill_header,a.bill_header) as bill_header , nvl(b.bill_content,a.bill_content) as bill_content , nvl(b.bill_receiver_phone,a.bill_receiver_phone) as bill_receiver_phone , nvl(b.bill_receiver_email,a.bill_receiver_email) as bill_receiver_email , nvl(b.receiver_name,a.receiver_name) as receiver_name , nvl(b.receiver_phone,a.receiver_phone) as receiver_phone , nvl(b.receiver_post_code,a.receiver_post_code) as receiver_post_code , nvl(b.receiver_province,a.receiver_province) as receiver_province , nvl(b.receiver_city,a.receiver_city) as receiver_city , nvl(b.receiver_region,a.receiver_region) as receiver_region , nvl(b.receiver_detail_address,a.receiver_detail_address) as receiver_detail_address , nvl(b.note,a.note) as note , nvl(b.confirm_status,a.confirm_status) as confirm_status , nvl(b.delete_status,a.delete_status) as delete_status , nvl(b.use_integration,a.use_integration) as use_integration , nvl(b.payment_time,a.payment_time) as payment_time , nvl(b.delivery_time,a.delivery_time) as delivery_time , nvl(b.receive_time,a.receive_time) as receive_time , nvl(b.comment_time,a.comment_time) as comment_time , nvl(b.modify_time,a.modify_time) as modify_time FROM a full join b on a.id=b.id " if [ $? -eq 0 ] then echo "任务执行成功" echo "任务成功: $(date) :oms_order全量快照生成, T日: ${DT_CALC} " | mail -s "hive任务 成功" [email protected] else echo "任务失败" echo "任务失败: $(date) :oms_order全量快照生成, T日: ${DT_CALC} " | mail -s "hive任务 成功" [email protected] fi
拉链表建表语句
从增量数据更新到拉链表
#!/bin/bash export JAVA_HOME=/opt/apps/jdk1.8.0_191/ export HIVE_HOME=/opt/apps/hive-3.1.2/ export HADOOP_HOME=/opt/apps/hadoop-3.1.1/ export SPARK_HOME=/opt/apps/spark-2.4.4/ export SQOOP_HOME=/opt/apps/sqoop-1.4.7/ DT_CALC=$(date -d'-1 day' +%Y-%m-%d) DT_PRE=$(date -d'-2 day' +%Y-%m-%d) if [ $1 ] then DT_CALC=$1 DT_PRE=`date -d"${DT_CALC} -1day" +%Y-%m-%d` fi ${HIVE_HOME}/bin/hive -e " -- 全量快照 with a as ( SELECT * FROM dwd.dwd_oms_order_zip where dt='${DT_PRE}' ) -- 增量数据 ,b as ( SELECT * FROM ods.ods_oms_order where dt='${DT_CALC}' ) INSERT INTO TABLE dwd.dwd_oms_order_zip PARTITION(dt='${DT_CALC}') SELECT a.id , a.member_id , a.coupon_id , a.order_sn , a.create_time , a.member_username , a.total_amount , a.pay_amount , a.freight_amount , a.promotion_amount , a.integration_amount , a.coupon_amount , a.discount_amount , a.pay_type , a.source_type , a.status , a.order_type , a.delivery_company , a.delivery_sn , a.auto_confirm_day , a.integration , a.growth , a.promotion_info , a.bill_type , a.bill_header , a.bill_content , a.bill_receiver_phone , a.bill_receiver_email , a.receiver_name , a.receiver_phone , a.receiver_post_code , a.receiver_province , a.receiver_city , a.receiver_region , a.receiver_detail_address , a.note , a.confirm_status , a.delete_status , a.use_integration , a.payment_time , a.delivery_time , a.receive_time , a.comment_time , a.modify_time , a.start_dt , if(a.end_dt='9999-12-31' and b.id is not null,a.dt,a.end_dt) as end_dt FROM a left join b on a.id=b.id UNION ALL select id , member_id , coupon_id , order_sn , create_time , member_username , total_amount , pay_amount , freight_amount , promotion_amount , integration_amount , coupon_amount , discount_amount , pay_type , source_type , status , order_type , delivery_company , delivery_sn , auto_confirm_day , integration , growth , promotion_info , bill_type , bill_header , bill_content , bill_receiver_phone , bill_receiver_email , receiver_name , receiver_phone ,receiver_post_code , receiver_province , receiver_city , receiver_region , receiver_detail_address , note , confirm_status , delete_status , use_integration , payment_time , delivery_time , receive_time , comment_time , modify_time , dt as start_dt , '9999-12-31' as end_dt from b " if [ $? -eq 0 ] then echo "任务执行成功" echo "任务成功: $(date) :oms_order拉链表生成, T日: ${DT_CALC} " | mail -s "hive任务成 功" [email protected] # 计算完成后,删除过期的拉链分区 ${HIVE_HOME}/bin/hive -e "alter table dwd.oms_order_zip drop partition (dt='${DT_PRE}')" if [ $? -eq 0 ] then echo "过期拉链表分区清理成功: ${DT_PRE} " | mail -s "hive任务成功" [email protected] else echo "过期拉链表分区清理失败: ${DT_PRE} " | mail -s "hive任务失败" [email protected] fi else echo "任务失败" echo "任务失败: $(date) :oms_order拉链表生成, T日: ${DT_CALC} " | mail -s "hive任务成 功" [email protected] fi
订单商品明细表
购物车明细表
优惠卷使用明细表
秒杀活动订阅明细表
需求分析
流量多维分析
用户活跃度分析
方案设计
流量多维分析
一.核心度量
UV数(去重访客数)
PV数
IP数(不精准)
用户访问次数(会话次数)
跳出次数(只访问一个页面就退出的会话)
用户访问时长
二.统计维度
时间维度(日,周,月,销售季,每月第一天,每周第一天,每季第一天,一年中的第几周)
访客维度
入口页面
退出页面
被访问的商品
访问所属的时间段(小时段,日,周,月,年)
三.维度数据生成
从业务表中抽取需要的维度表(商品信息表,用户注册信息表,商品品类表,品牌信息表……)
从用户画像表中抽取需要的维度(用户消费等级画像表,消费偏好画像表,活跃等级画像表)
从dwd数据中为本需求特别构建维度(如入口页面,跳出页面等)
四.得出结过
流量基础指标统计用的大宽表(dws层)
五.需求案例
各手机型号 PV、UV、会话数、访问时长、IP、 跳出会话数
select devicetype ,count(1) pv_cnt ,count(distinct guid) as uv_cnt ,count(distinct splitedsessionid) as ses_cnt ,sum(page_acc_tml) as acc_tml ,count(distinct ip) as ip_cnt ,count(distinct if(is_jumpout='Y',sessionid,null)) as jpt_ses_cnt from dws.dws_app_tfc_dkb --粒度,一条数据就是一次pageView where dt='2022-06-29' group by devicetype
高阶聚合函数
group by a,b,c,d with cube 把所有的组合可能性全都计算了
group by a,b,c ,d grouping sets((a),(abc),(ac)) 取出想要的维度组合
group by a,b,c ,d with rollup 上卷
kylin 麒麟(维度剪枝)
维度的基数 = 维度字段的个数
is_jumpout(Y,N)
province(40)
高基维
特高基维
开发代码
DWS层宽表开发
会话信息聚合维表
create external table dws.dws_app_session_agr( guid bigint, splitedsessionid string, enter_page_id string, exit_page_id string, start_time bigint, end_time bigint, is_jumpout string, pv_cnt bigint ) partitioned by (dt string) stored as parquet location 'hdfs://hdfs-xxx/app/dws/dws_app_session_agr';
将数据写入维表
INSERT INTO TABLE dws.dws_app_session_agr PARTITION(dt='2022-06-29') SELECT guid, splitedsessionid, split(min(if(eventid='pageView',concat_ws('_',`timestamp`,eventid,properties['pageId']),'9')),'_')[2] as enter_page_id,-- 入口页 split(max(if(eventid='pageView',concat_ws('_',`timestamp`,eventid,properties['pageId']),'0')),'_')[2] as exit_page_id,-- 跳出页 min(`timestamp`) as start_time , -- 起始时间 max(`timestamp`) as end_time , -- 结束时间 if(count(if(eventid='pageView',1,null)) < 2,'Y','N') as is_jumpout , -- 是否跳出页 count(if(eventid='pageView',1,null)) as pv_cnt -- PV总数 FROM dwd.dwd_app_event_detail WHERE dt='2022-06-29' GROUP BY guid,splitedsessionid;
建立app端流量明细大宽表
create external table dws.dws_app_tfc_dkb( account string ,appId string ,appVersion string ,carrier string ,deviceId string ,deviceType string ,eventId string ,ip string ,latitude double ,longitude double ,netType string ,osName string ,osVersion string ,properties map,releaseChannel string ,resolution string ,sessionId string ,`timeStamp` string ,splitedSessionId string ,filledAccount string ,province string ,city string ,region string ,guid bigint ,isnew int ,page_acc_tml string ,enter_page_id string ,exit_page_id string ,start_time bigint ,end_time bigint ,is_jumpout string ,pv_cnt bigint ) partitioned by (dt string) stored as parquet location 'hdfs://hdfs-xxx/app/dws/dws_app_tfc_dkb';
向大宽表中写入数据
with event as ( select account ,appId ,appVersion ,carrier ,deviceId ,deviceType ,eventId ,ip ,latitude ,longitude ,netType ,osName ,osVersion ,properties ,releaseChannel ,resolution ,sessionId ,`timeStamp` ,splitedSessionId ,filledAccount ,province ,city ,region ,guid ,isnew from dwd.dwd_app_event_detail where dt='2022-06-29' and eventid='pageView' ), session_agr as ( select guid ,splitedSessionId ,enter_page_id ,exit_page_id ,start_time ,end_time ,is_jumpout ,pv_cnt from dws.dws_app_session_agr where dt='2022-06-29' ) insert into table dws.dws_app_tfc_dkb partition(dt='2022-06-29') select event.account ,event.appId ,event.appVersion ,event.carrier ,event.deviceId ,event.deviceType ,event.eventId ,event.ip ,event.latitude ,event.longitude ,event.netType ,event.osName ,event.osVersion ,event.properties ,event.releaseChannel ,event.resolution ,event.sessionId ,event.`timeStamp` ,event.splitedSessionId ,event.filledAccount ,event.province ,event.city ,event.region ,event.guid ,event.isnew ,lead(event.`timestamp`,1,session_agr.end_time) over(partition by session_agr.splitedSessionId order by `timestamp`) - event.`timestamp` ,session_agr.enter_page_id ,session_agr.exit_page_id ,session_agr.start_time ,session_agr.end_time ,session_agr.is_jumpout ,session_agr.pv_cnt from event left join session_agr on event.splitedSessionId=session_agr.splitedSessionId;
创建ads多维流量分析cube表
create external table ads.ads_app_tfc_cube( appid string ,appversion string ,carrier string ,devicetype string ,nettype string ,osname string ,releasechannel string ,province string ,city string ,region string ,isnew int ,enter_page_id string ,exit_page_id string ,is_jumpout string ,pv_cnt bigint ,uv_cnt bigint ,ses_cnt bigint ,acc_tml bigint ,avg_ses_tml bigint ,ip_cnt bigint ,jpt_ses_cnt bigint ) partitioned by (dt string) stored as parquet location 'hdfs://hdfs-xxx/app/ads/ads_app_tfc_cube' tblproperties( 'parquet.compress'='snappy');
将数据写入ads层cube表
with tmp as( select nvl(account ,'UNKOWN')as account ,nvl(appid ,'UNKOWN')as appid ,nvl(appversion ,'UNKOWN')as appversion ,nvl(carrier ,'UNKOWN')as carrier ,nvl(deviceid ,'UNKOWN')as deviceid ,nvl(devicetype ,'UNKOWN')as devicetype ,nvl(eventid ,'UNKOWN')as eventid ,nvl(ip ,'UNKOWN')as ip ,nvl(latitude ,'UNKOWN')as latitude ,nvl(longitude ,'UNKOWN')as longitude ,nvl(nettype ,'UNKOWN')as nettype ,nvl(osname ,'UNKOWN')as osname ,nvl(osversion ,'UNKOWN')as osversion ,nvl(releasechannel ,'UNKOWN')as releasechannel ,nvl(resolution ,'UNKOWN')as resolution ,nvl(sessionid ,'UNKOWN')as sessionid ,nvl(`timestamp` ,'UNKOWN')as `timestamp` ,nvl(filledaccount ,'UNKOWN')as filledaccount ,nvl(province ,'UNKOWN')as province ,nvl(city ,'UNKOWN')as city ,nvl(region ,'UNKOWN')as region ,nvl(guid ,'UNKOWN')as guid ,nvl(isnew ,'UNKOWN')as isnew ,nvl(page_acc_tml ,'UNKOWN')as page_acc_tml ,nvl(splitedsessionid,'UNKOWN')as splitedsessionid ,nvl(enter_page_id ,'UNKOWN')as enter_page_id ,nvl(exit_page_id ,'UNKOWN')as exit_page_id ,nvl(start_time ,'UNKOWN')as start_time ,nvl(end_time ,'UNKOWN')as end_time ,nvl(is_jumpout ,'UNKOWN')as is_jumpout ,nvl(pv_cnt ,'UNKOWN')as pv_cnt from dws.dws_app_tfc_dkb where dt='2022-06-29') insert into table ads.ads_app_tfc_cube partition(dt='2022-06-29') select appid ,appversion ,carrier ,devicetype ,nettype ,osname ,releasechannel ,province ,city ,region ,isnew ,enter_page_id ,exit_page_id ,is_jumpout ,count(1) pv_cnt ,count(distinct guid) as uv_cnt ,count(distinct splitedsessionid) as ses_cnt ,sum(page_acc_tml) as acc_tml ,sum(page_acc_tml)/count(distinct splitedsessionid) as avg_ses_tml ,count(distinct ip) as ip_cnt ,count(distinct if(is_jumpout='Y',sessionid,null)) as jpt_ses_cnt from tmp group by appid ,appversion ,carrier ,devicetype ,nettype ,osname ,releasechannel ,province ,city ,region ,isnew ,enter_page_id ,exit_page_id ,is_jumpout grouping sets( (appid) ,(appid,appversion) ,(devicetype) ,(osname) ,(carrier,nettype) ,(releasechannel) ,(province) ,(province,city) ,(province,city,region) ,(isnew) ,(is_jumpout) ,(province,isnew) ,(enter_page_id) ,(exit_page_id) ,());
多维层级(再聚合)的方案
count(distinct)类型的指标,是非可加事实,不能直接进行累加,这里采用bitmap数据结构实现聚合
bitmap:就是用一个 bit 位来标记某个元素对应的 value,而 key 即是这个元素。由于采用bit为单位来存储数据,因此在可以大大的节省存储空间。
在做上卷时无法distinct后的数据,不能直接进行累加,我们就可以采用bigmap的方式对uid进行存储,bigmap都是二进制的形式,在存储上所占空间较小,查询速度也有提升。
模拟数据源
对设备类型,省份,是否为新用户进行聚合,还要计算uv和页面访问时长
在这个基础上,进行二次聚合,只看设备类型和省份
使用
import org.roaringbitmap.RoaringBitma
定义UDAF函数
第一次聚合自定义函数代码
//三个参数,对应 in,buf,out,buf相当于是中间计算的结果,用buf和接下来的任务进行计算 bject BitmapUDAF extends Aggregator[Int,Array[Byte],Array[Byte]]{ /** * 初始化Buffer,本质上是做了一个空的Bitmap并序列化得到字节数组 * @return */ override def zero: Array[Byte] = { val bitmap = RoaringBitmap.bitmapOf() //序列化 serBitmap(bitmap) } /** * 聚合逻辑,本质上就是我们函数接收一组guid中的一个,将这个guid添加到buf的bitmap上 * @param b * @param a * @return */ override def reduce(b: Array[Byte], guid: Int): Array[Byte] = { //反序列化出buf的bitmap对象 val bitmap = desBitmap(b) //添加新的元素 bitmap.add(guid) //将添加元素的bitmap序列化成字节数组 serBitmap(bitmap) } override def merge(b1: Array[Byte], b2: Array[Byte]): Array[Byte] = { //or或运算 val bitmap1 = desBitmap(b1) val bitmap2 = desBitmap(b2) bitmap1.or(bitmap2) serBitmap(bitmap1) } override def finish(reduction: Array[Byte]): Array[Byte] = reduction override def bufferEncoder: Encoder[Array[Byte]] = Encoders.BINARY override def outputEncoder: Encoder[Array[Byte]] = Encoders.BINARY def serBitmap(bitmap:RoaringBitmap):Array[Byte]={ val baout = new ByteArrayOutputStream() val dout = new DataOutputStream(baout) bitmap.serialize(dout) baout.toByteArray } def desBitmap(b:Array[Byte]):RoaringBitmap={ val bitmap = RoaringBitmap.bitmapOf() val baIn = new ByteArrayInputStream(b) val dIn = new DataInputStream(baIn) bitmap.deserialize(dIn) bitmap } }
第二次聚合自定义函数代码
object AreaDicUtil { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("地理位置字典加工") .master("local") .enableHiveSupport() //开启hive支持 .config("spark.sql.shuffle.partition", "1") .getOrCreate() //链接数据库 val properties = new Properties() properties.put("user", "root") properties.put("password", "123456") val frame = spark.read .jdbc("jdbc:mysql://node01:3306/dw?useUnicode=true&characterEncoding=utf-8", "t_md_areas", properties) frame.createTempView("area") //spark geo自定义函数 var gps2geo = (lat:Double,lng:Double) => { GeoHash.geoHashStringWithCharacterPrecision(lat,lng,6) } //注册自定义函数 spark.udf.register("geo",gps2geo) spark.sql( """ |insert into dim.dim_area_dict |select | geo(l4.bd09_lat,l4.bd09_lng), | l1.areaname, | l2.areaname, | l3.areaname |from | area l4 join area l3 on l4.level=4 and l4.parentid=l3.id | join area l2 on l3.parentid=l2.id | join area l1 on l2.parentid=l1.id | |""".stripMargin ) spark.close() } }
使用UDAF函数,进行聚合操作
object RollupBitmapDemo { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("rollup test") .master("local") .config("fs.defaultFS", "file:///") .getOrCreate() val logDF = spark.read.option("header", "true").option("inferSchema","true").csv("data/rollup.csv") logDF.createTempView("log") //调用自定义聚合函数,求bitmap import org.apache.spark.sql.functions._ val mybitmap = udaf(BitmapUDAF) spark.udf.register("agr2bitmap",mybitmap) val getCardinality = (b:Array[Byte])=>{ val bitmap = RoaringBitmap.bitmapOf() val baIn = new ByteArrayInputStream(b) val dIn = new DataInputStream(baIn) bitmap.deserialize(dIn) bitmap.getCardinality } spark.udf.register("getcnt",getCardinality) val res = spark.sql( """ | |select | devicetype,province,isnew, | getcnt(agr2bitmap(guid)) as uv, | agr2bitmap(guid) as uv_bitmap, | sum(acc_tml) as acc_tml | |from log |group by devicetype,province,isnew | |""".stripMargin) res.show(10,false) res.createTempView("dim_base_cuboid") //从维度多的聚合结果中,直接计算出维度少的聚合结果 spark.udf.register("bm_aggr",udaf(BitmapAggr)) val lev2Cuboid = spark.sql( """ | |select | devicetype,province, | bm_aggr(uv_bitmap) as uv_bitmap, | getcnt(bm_aggr(uv_bitmap)) as uv, | sum(acc_tml) as acc_tml |from | dim_base_cuboid |group by | devicetype,province | |""".stripMargin) lev2Cuboid.show() spark.close() }
得出结果
用户活跃分析
利用一个类似拉链表的用户连续活跃区间记录表,记录着每个人每天活跃状态,但是又不用每天都存储一条
字段:guid,首访日期,连续活跃起始日,连续活跃结束日
需求分析案例一:
最近一个月内,有过连续活跃10+天的人
最近一个月内,每个用户的平均活跃天数
最近一个月内,连续活跃[1-10)天的人数,[10-20)天的人数,[20+ 天的人数
任意指定的一段日期范围内,连续活跃5+天的人(比如,12-01,12-02,12-03连续3天活跃的人)
最近30天内,沉默天数超过3的有多少人,超过5天有多少人
建表语句
CREATE TABLE ads.ads_app_useract_stat_m( calc_date string, month string, continuous_5days int, -- 本月内连续活跃>=5天的人数 continuous_7days int, -- 本月内连续活跃>=7天的人数 continuous_14days int, -- 本月内连续活跃>=14天的人数 continuous_20days int, continuous_30days int ) STORED AS PARQUET;
数据计算
WITH tmp AS ( SELECT guid, max(datediff(if(rng_end='9999-12-31','2020-10-07',rng_end),if(rng_start<'2020-10- 01','2020-10-01',rng_start))+1) as max_continuous_days FROM dws.dws_app_useract_range WHERE dt='2020-10-07' AND rng_end >= '2020-10-01' GROUP BY guid ) -- 举例 /* +---------------+------------------+ | guid | continuous_days | +---------------+------------------+ | 0TaMafscth6X | 1 | | 1ANCwzVmNPZo | 1 | | 1hyVlok7H49P | 5 | | 1iCCizS18Us0 | 1 | | 1kFxP0IfmhSf | 6 | | 1ly5nTS29obZ | 1 | | 2KiHc1ur1hEG | 1 | | 2MSQj5R4ITn3 | 1 | | 2OOMY1MjfK89 | 10 | | 2Q5lMJtXYuQX | 25 | +---------------+------------------+ */ INSERT INTO TABLE ads.ads_app_useract_stat_m SELECT '2020-10-07' as calc_date, month('2020-10-07') as month, count(if(max_continuous_days>=5,1,null)) as continuous_5days , -- 本月内连续活 跃>=5天的人数 count(if(max_continuous_days>=7,1,null)) as continuous_7days , -- 本月内连续活 跃>=7天的人数 count(if(max_continuous_days>=14,1,null)) as continuous_14days , -- 本月内连续活 跃>=14天的人数 count(if(max_continuous_days>=20,1,null)) as continuous_20days , count(if(max_continuous_days>=30, 1,null)) as continuous_30days FROM tmp;
需求案例分析二:
DWS层连续活跃区间表开发
方案一
区间记录表
-- 2021-08-09 连续活跃区间记录表 1,2021-07-01,2021-07-01,2021-07-08 1,2021-07-01,2021-08-01,2021-08-02 2,2021-07-05,2021-07-05,2021-07-15 2,2021-07-05,2021-08-05,9999-12-31 3,2021-08-01,2021-08-01,2021-08-05 3,2021-08-01,2021-08-08,2021-08-08 4,2021-07-05,2021-07-05,2021-07-15 4,2021-07-05,2021-08-05,9999-12-31 create table dws.dws_app_user_ctnu_rng( guid bigint, first_acc_dt string, rng_stat_dt string, rng_end_dt string ) partitioned by ( dt string) row format delimited fields terminated by ','; load data local inpath '/root/rng.csv' into table dws.dws_app_user_ctnu_rng partition(dt='2021-08-09')
日活跃表
-- 2021-08-10 日活表 3,0,2021-08-01 4,0,2021-07-05 5,1,2021-08-10 create table dws.dws_app_user_dau( guid bigint, isnew int, first_acc_dt string ) partitioned by (dt string) row format delimited fields terminated by ','; load data local inpath '/root/dau.csv' into table dws.dws_app_user_dau partition(dt='2021-08-10')
核心逻辑
-- 2021-08-10 连续活跃区间记录表 --先过滤出所有不带9999-12-31的区间记录 1,2021-07-01,2021-07-01,2021-07-08 1,2021-07-01,2021-08-01,2021-08-02 2,2021-07-05,2021-07-05,2021-07-15 3,2021-08-01,2021-08-01,2021-08-05 3,2021-08-01,2021-08-08,2021-08-08 4,2021-07-05,2021-07-05,2021-07-15 --过滤出所有带9999-12-31的区间记录,并full join 当天日活数据 -- 只有“左边”有,右边“没有”的数据,区间end=CURRENT-1day -- 其他的,区间END 都取“9999-12-31”即可 4,2021-07-05,2021-08-05,9999-12-31 4 2,2021-07-05,2021-08-05,9999-12-31->2021-08-09 null null, 9999-12-31 3 null, 9999-12-31 5 -- 最后,将两部分数据union all合并即可 1.左有 右没有 更新9999到当前日期 --没有活跃 2.左有 右有 一直在活跃 不用改动 3.左没有 右有 新增 新增一条数据 结束时间为9999-12-31
代码开发
with tmp1 as ( select guid ,first_acc_dt ,rng_stat_dt ,rng_end_dt from dws.dws_app_user_ctnu_rng where dt='2021-08-09' and rng_end_dt !='9999-12-31' ), tmp2 as ( select nvl(his.guid,cur.guid) as guid ,nvl(his.first_acc_dt,cur.first_acc_dt) as first_acc_dt ,nvl(his.rng_start_dt,'2021-08-10') as rng_start_dt ,if(his.guid is not null and cur.guid is null, date_sub('2021-08-10',1),'9999- 12-31') as rng_end_dt from (select guid ,first_acc_dt ,rng_stat_dt ,rng_end_dt from dws.dws_app_user_ctnu_rng where dt='2021-08-09' and rng_end_dt ='9999-12-31') his full join (select * from dws.dws_app_user_dau where dt='2021-08-10') cur on his.guid = cur.guid ) insert into table dws.dws_app_user_ctnu_rng partition(dt='2021-08-10') select * from tmp1 union all tmp2
方案二
核心要点:利用一个bitmap来记录一个人在最近30天内的活跃情况
代表用户在该脚标所指示日期上的活跃状态,0表示没有活跃,1表示有活跃 该表的本质: 记录了最近N天(31)内,每个用户的每天的活跃状态;
建表语句
drop table test.active_user_day; create table test.active_user_day( dt string, guid string ) row format delimited fields terminated by ',' ; load data local inpath '/root/dau.txt' into table test.active_user_day;
计算,初始化活跃状态bitmap表
--01-21号--- -- g001,358 000000000....111 -- g002,103 -- 初始化活跃状态bitmap表 with tmp as ( select guid, cast(sum(pow(2,datediff('2021-01-21',dt))) as int) as bitmap from test.active_user_day where dt between date_sub('2021-01-21',30) and '2021-01-21' group by guid )
更新活跃状态bitmap表
-- 更新算法,先join,判断用户是否活跃,如果活跃,则将原来的bitmap 2+1,如果没活跃,则 2
create table test.user_active_bitmap as select nvl(a.guid,b.guid) as guid, case when a.guid is not null and b.guid is not null then (1073741823&bitmap)*2+1 when a.guid is not null and b.guid is null then (1073741823&bitmap)*2 else 1 end as bitmap from tmp a full join (select * from test.active_user_day where dt='2021-01-22') b on a.guid = b.guid
用户留存分析
通过俩张表查询
开发代码
select '2022-09-09' as cal_dt, first_acc_dt, datediff('2022-09-09',first_acc_dt) as retention_days, count(1) as retention_cnt from dws.dws_app_user_ctnu_rng where dt='2022-09-09' and datediff('2022-09-09',first_acc_dt) <= 30 and rng_end_dt='9999-12-31' group by datediff('2022-09-09',first_acc_dt)
漏斗分析
事件归因分析
判断一个业务目标的达成,是由于哪些原因促成的,或者哪些事件促成的,占的比重是多少
归因计算策略
首次触点归因:待归因事件中,最早发生的事,被认为是导致业务结果的唯一因素
末次触点归因:待归因事件中,最近发生的事,被认为是导致业务结果的唯一因素
线性归因:待归因事件中,每一个事件都被认为对业务结果产生了影响,影响力平均分摊
位置归因:定义一个规则,比如最早、最晚事件占40%影响力,中间事件平摊影响力
时间衰减归因:越晚发生的待归因事件,对业务结果的影响力越大
线性归因分析代码
package com.xxx.dataetl.etl import org.apache.commons.lang3.StringUtils import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import scala.collection.mutable /** * @author cyx * @date 2022/7/7 */ object EventAttribute { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("归因事件分析") .enableHiveSupport() .master("local") .config("spark.sql.shuffle.partitions", 1) .getOrCreate() import spark.implicits._ val events = spark.sql( """ | |select | guid, | sort_array(collect_list(concat_ws('_',timestamp,eventid))) as events |from | dwd.dwd_app_event_detail |where | dt='2022-06-29' |and | ((eventid='fetchCoupon') | or | (eventid='adShow') | or | (eventid='productView') | or | (eventid='addCart') | ) |group by guid | |""".stripMargin) val rdd: RDD[(Long, String, String)] = events.rdd.flatMap(row => { //拿guid val guid = row.getAs[Long](0) //取出拼接字段 需要用mutable.WrappedArray val events = row.getAs[mutable.WrappedArray[String]]("events") //将事件列表变形,只留下每一个事件的eventid,将整个列表变为一个整体字符串 val str = events.map(s => s.split("_")(1)).mkString(",") //按目标事件名切割 val eventStrArray = str.split("fetchCoupon") .filter(s => StringUtils.isNotBlank(s.replaceAll(",", ""))) eventStrArray.map(str => (guid, "fetchCoupon", str.substring(0, str.length - 1))) }) //计算归因权重 val res = linearAttribute(rdd).toDF("strategy", "guid", "dest_event","attr_event", "weight") res.show() spark.close() } /** * 线性归因策略 * * @param rdd * @return */ def linearAttribute(rdd: RDD[(Long, String, String)]): RDD[(String, Long, String, String, Double)] = { rdd.flatMap(tp => { val guid = tp._1 val destEvent = tp._2 val events = tp._3 val eventArr = events.split(",").filter(s => StringUtils.isNotBlank(s)) val size = events.size eventArr.map(e => ("线性归因", guid, destEvent, e, 100.0 / size)) }) } }
订单金额分析
事实表:oms_order
核心度量
GMV金额(GMV包含所有已提交订单--订单金额)
应付/实付金额
优惠券抵扣金额(一些订单中会有使用优惠券)
促销折扣金额
积分抵扣金额
维度分析
时段(小时段,日,周,月,季,.....)
会员等级
订单类型(普通订单,团购订单,秒杀订单)
订单来源(pc,app,微信小程序,H5)
关联语句
select .... from oms_order(订单主表) join ums_member(会员信息) join ums_member_level(等级定义表) group by .....
订单数量、人数、退货单数、人数件数分析(dws.dws_oms_order_and_return)
订单单数 (oms_order订单主表) count(1)
订单人数 (oms_order订单主表) count(distinct member_id)
取消单数 (oms_order订单主表) count(if(status = 5,1,null))
取消人数 (oms_order订单主表) count(distinct if(status = 5,member_id,null))
退货单数 (关联表) count(if(右表.orderid is not null,1,null)) (统计口径:按退单申请日作为时间条件)
退货人数 (关联表) count(distinct if(右表.orderid is not null,memberid,null))
退货件数 (关联表) sum(右表.退货件数)
退货金额 (关联表) sum(右表.退货金额)
维度分析
时段(小时段,日,周,月,季,.....)
会员等级
订单类型(普通订单,团购订单,秒杀订单)
订单来源(pc,app,微信小程序,H5)
设计一张DWS层的服务表:订单表 + 退货申请记录表 + 会员等级信息
先从退货申请表中,聚合出每一个“退货订单”、“退货件数”、“退货金额”、“申请时间”
订单主表 left join 上面的聚合表(一定意义上的维度表)
复购分析
购物车分析dws表 dws.dws_oms_cart_detail
优惠券领取数量,使用数量,使用人数分析
秒杀订阅人数,成单数,成单金额多维度分析
用户消费统计画像表
给用户打上一些消费相关(下单、退货、金额、客单价)的统计数据标签
建表语句
drop table if exists ads_user_order_tag; create table ads_user_order_tag( member_id bigint ,--用户 first_order_time string ,--首单日期 last_order_time string ,--末单日期 first_order_ago bigint ,--首单距今时间 last_order_ago bigint ,--末单距今时间 month1_order_cnt bigint ,--近30天下单次数 month1_order_amt double ,--近30天购买金额(总金额) month2_order_cnt bigint ,--近60天购买次数 month2_order_amt double ,--近60天购买金额 month3_order_cnt bigint ,--近90天购买次数 month3_order_amt double ,--近90天购买金额 max_order_amt double ,--最大订单金额 min_order_amt double ,--最小订单金额 total_order_cnt bigint ,--累计消费次数(不含退拒) total_order_amt double ,--累计消费金额(不含退拒) total_coupon_amt double ,--累计使用代金券金额 user_avg_order_amt double ,--平均订单金额(含退拒) month3_user_avg_amt double ,--近90天平均订单金额(含退拒) common_address string ,--常用收货地址 common_paytype string ,--常用支付方式 month1_cart_goods_cnt_30 bigint ,--最近30天加购商品件数 month1_cart_goods_amt_30 bigint ,--最近30天加购商品金额 month1_cart_cancel_cnt bigint ,--最近30天取消商品件数 month1_cart_cancel_amt bigint ,--最近30天取消商品金额 dw_date string --计算日期 ) partitioned by (dt string) ;
计算数据
-- 订单金额,单数等 with tmp1 as ( SELECT od.member_id, to_date(min(od.create_time)) as first_order_time, to_date(max(od.create_time)) as last_order_time , datediff(od.dt,to_date(min(od.create_time))) as first_order_ago, datediff(od.dt,to_date(max(od.create_time))) as last_order_ago , count(if(datediff(od.dt,to_date(od.create_time))<30,1,null)) as month1_order_cnt, sum(if(datediff(od.dt,to_date(od.create_time))<30,od.total_amount,0)) as month1_order_amt, count(if(datediff(od.dt,to_date(od.create_time))<60,1,null)) as month2_order_cnt, sum(if(datediff(od.dt,to_date(od.create_time))<60,od.total_amount,0)) as month2_order_amt, count(if(datediff(od.dt,to_date(od.create_time))<90,1,null)) as month3_order_cnt, sum(if(datediff(od.dt,to_date(od.create_time))<90,od.total_amount,0)) as month3_order_amt, max(od.total_amount) as max_order_amt, min(od.total_amount) as min_order_amt, count(if(rt.order_id is null,1,null)) as total_order_cnt, sum(od.total_amount-nvl(rt.return_amount,0)) as total_order_amt, sum(od.coupon_amount) as total_coupon_amt, avg(od.total_amount) as user_avg_order_amt, avg(if(datediff(od.dt,to_date(od.create_time))<90,total_amount,null)) as month3_user_avg_amt FROM ( SELECT * FROM dwd.oms_order WHERE dt='2021-01-21' ) od LEFT JOIN ( SELECT order_id, sum(return_amount) as return_amount FROM dwd.oms_order_return_apply WHERE dt='2021-01-21' GROUP BY order_id ) rt ON od.id=rt.order_id GROUP BY od.member_id,od.dt ) -- 常用收货地址 ,tmp2 as ( SELECT member_id, addr as common_address FROM( SELECT member_id, concat_ws(',',receiver_province,receiver_city,receiver_region,receiver_detail_address) as addr, row_number() over(PARTITION BY member_id,concat_ws(',',receiver_province,receiver_city,receiver_region,receiver_detail_address order by count(1) desc) as rn FROM ( SELECT * FROM dwd.oms_order WHERE dt='2021-01-21' ) od GROUP BY member_id,concat_ws(',',receiver_province,receiver_city,receiver_region,receiver_detai l_address) ) o1 WHERE rn=1 ) -- 常用支付方式 ,tmp3 as ( SELECT member_id, pay_type as common_paytype FROM ( SELECT member_id, pay_type, row_number() over(PARTITION BY member_id order by count(1) desc) as rn FROM ( SELECT * FROM dwd.oms_order WHERE dt='2021-01-21' ) od GROUP BY member_id,pay_type ) o1 WHERE rn=1 ) -- 购物车 ,tmp4 as ( SELECT member_id, sum(quantity) as month1_cart_goods_cnt, sum(if(delete_status=1,quantity,0)) as month1_cart_cancel_cnt FROM ( SELECT * FROM dwd.oms_cart_item WHERE dt='2021-01-21' and datediff(dt,to_date(create_date))<30 ) ctGROUP BY member_id ) -- 订单表和购物车表拥有的所有member_id ,ids as ( SELECT member_id FROM dwd.oms_order WHERE dt='2021-01-21' GROUP BY member_id UNION SELECT member_id FROM dwd.oms_cart_item WHERE dt='2021-01-21' and datediff(dt,to_date(create_date))<30 GROUP BY member_id ) -- 整合到一起 INSERT INTO TABLE dws.profile_order_tag PARTITION(dt='2021-01-21') SELECT ids.member_id , tmp1.first_order_time , tmp1.last_order_time , tmp1.first_order_ago , tmp1.last_order_ago , tmp1.month1_order_cnt , tmp1.month1_order_amt , tmp1.month2_order_cnt , tmp1.month2_order_amt , tmp1.month3_order_cnt , tmp1.month3_order_amt , tmp1.max_order_amt , tmp1.min_order_amt , tmp1.total_order_cnt , tmp1.total_order_amt , tmp1.total_coupon_amt , tmp1.user_avg_order_amt , tmp1.month3_user_avg_amt , tmp2.common_address , tmp3.common_paytype , tmp4.month1_cart_goods_cnt , tmp4.month1_cart_cancel_cnt FROM ids LEFT JOIN tmp1 on ids.member_id=tmp1.member_id LEFT JOIN tmp2 on ids.member_id=tmp2.member_id LEFT JOIN tmp3 on ids.member_id=tmp3.member_id LEFT JOIN tmp4 on ids.member_id=tmp4.member_id
用户画像数据写入HBase
package cn.xxx.dataload import org.apache.commons.lang3.StringUtils import org.apache.hadoop.fs.Path import org.apache.hadoop.hbase.client.ConnectionFactory import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2 import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase.{HBaseConfiguration, KeyValue, TableName} import org.apache.hadoop.mapreduce.Job import org.apache.spark.rdd.RDD import org.apache.spark.sql.{Row, SparkSession} import scala.collection.mutable.ListBuffer /** * 需要先在hbase中建表 * hbase(main):002:0> create 'user_profile','f' * * @desc 用户画像表(活跃属性标签)生成hfile * 将生成好的hfile,加载到hbase的命令为: * hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /hfile_tmp/user_profile/ user_profile */ object UserProfileLoader { def main(args: Array[String]): Unit = { System.setProperty("HADOOP_USER_NAME", "root") val spark = SparkSession.builder() .appName("活跃画像模拟") .config("spark.sql.shuffle.partitions", 2) .master("local") .enableHiveSupport() .getOrCreate() /** * 数据整理 * 一、 将输入数据,加工成有序的 RDD[(ImmutableBytesWritable, KeyValue) ] * guid ,guid:f:标签名:标签值 */ // 读取活跃属性标签表 val hivetable1 = spark.read.table("dws.user_tags_active") val tags1: RDD[(String, String, String, String)] = hivetable1.rdd.flatMap({ case Row( guid: Long, first_dt: String, pv_30: Int, se_30: Int, se_per_day_30: Float, pv_per_se_30: Float, act_days_30: Int, max_ctu_30: Int, max_slc_30: Int, ol_time_amt: Long, ol_time_per_day: Float, dt: String ) => { val buffer: ListBuffer[(String, String, String, String)] = new ListBuffer() val rk = StringUtils.leftPad(guid + "", 8,"0") buffer += ((rk, "f", "first_dt", first_dt)) buffer += ((rk, "f", "pv_30", pv_30.toString)) buffer += ((rk, "f", "se_30", se_30.toString)) buffer += ((rk, "f", "se_per_day_30", se_per_day_30.toString)) buffer += ((rk, "f", "pv_per_se_30", pv_per_se_30.toString)) buffer += ((rk, "f", "act_days_30", act_days_30.toString)) buffer += ((rk, "f", "max_ctu_30", max_ctu_30.toString)) buffer += ((rk, "f", "max_slc_30", max_slc_30.toString)) buffer += ((rk, "f", "ol_time_amt", ol_time_amt.toString)) buffer += ((rk, "f", "ol_time_per_day", ol_time_per_day.toString)) buffer } }) // 读取交互行为画像标签表 val hivetable2 = spark.read.table("dws.user_tags_interact") val tags2: RDD[(String, String, String, String)] = hivetable2.rdd.flatMap(row => { val guid = row.getAs[Long]("guid") val dzs_30 = row.getAs[Int]("dzs_30") val zfs_30 = row.getAs[Int]("zfs_30") val scs_30 = row.getAs[Int]("scs_30") val fxs_30 = row.getAs[Float]("fxs_30") val hps_30 = row.getAs[Float]("hps_30") val cps_30 = row.getAs[Int]("cps_30") val gzs_30 = row.getAs[Int]("gzs_30") val dzs_14 = row.getAs[Int]("dzs_14") val zfs_14 = row.getAs[Int]("zfs_14") val scs_14 = row.getAs[Int]("scs_14") val fxs_14 = row.getAs[Float]("fxs_14") val hps_14 = row.getAs[Float]("hps_14") val cps_14 = row.getAs[Int]("cps_14") val gzs_14 = row.getAs[Int]("gzs_14") val dzs_7 = row.getAs[Int]("dzs_7") val zfs_7 = row.getAs[Int]("zfs_7") val scs_7 = row.getAs[Int]("scs_7") val fxs_7 = row.getAs[Float]("fxs_7") val hps_7 = row.getAs[Float]("hps_7") val cps_7 = row.getAs[Int]("cps_7") val gzs_7 = row.getAs[Int]("gzs_7") val lstBuffer = new ListBuffer[(String, String, String, String)]() val rk = StringUtils.leftPad(guid + "", 8, "0") lstBuffer += ((rk, "f", "dzs_30", dzs_30 + "")) lstBuffer += ((rk, "f", "zfs_30", zfs_30 + "")) lstBuffer += ((rk, "f", "scs_30", scs_30 + "")) lstBuffer += ((rk, "f", "fxs_30", fxs_30 + "")) lstBuffer += ((rk, "f", "hps_30", hps_30 + "")) lstBuffer += ((rk, "f", "cps_30", cps_30 + "")) lstBuffer += ((rk, "f", "gzs_30", gzs_30 + "")) lstBuffer += ((rk, "f", "dzs_14", dzs_14 + "")) lstBuffer += ((rk, "f", "zfs_14", zfs_14 + "")) lstBuffer += ((rk, "f", "scs_14", scs_14 + "")) lstBuffer += ((rk, "f", "fxs_14", fxs_14 + "")) lstBuffer += ((rk, "f", "hps_14", hps_14 + "")) lstBuffer += ((rk, "f", "cps_14", cps_14 + "")) lstBuffer += ((rk, "f", "gzs_14", gzs_14 + "")) lstBuffer += ((rk, "f", "dzs_7", dzs_7 + "")) lstBuffer += ((rk, "f", "zfs_7", zfs_7 + "")) lstBuffer += ((rk, "f", "scs_7", scs_7 + "")) lstBuffer += ((rk, "f", "fxs_7", fxs_7 + "")) lstBuffer += ((rk, "f", "hps_7", hps_7 + "")) lstBuffer += ((rk, "f", "cps_7", cps_7 + "")) lstBuffer += ((rk, "f", "gzs_7", gzs_7 + "")) lstBuffer } ) // 将各类标签四元组数据 union到一起 val sorted = tags1.union(tags2) .sortBy(tp => { (tp._1, tp._2, tp._3) }) val data: RDD[(ImmutableBytesWritable, KeyValue)] = sorted .map(tp => { (new ImmutableBytesWritable(Bytes.toBytes(tp._1)), new KeyValue(Bytes.toBytes(tp._1), Bytes.toBytes(tp._2), Bytes.toBytes(tp._3), Bytes.toBytes(tp._4))) }) /** * 将整理好的数据,输出为 HFile文件 */ // job参数配置 val conf = HBaseConfiguration.create() conf.set("fs.defaultFS", "hdfs://node01:8020") conf.set("hbase.zookeeper.quorum", "node01,node02,node03") val job = Job.getInstance(conf) // hbase表连接构造 val conn = ConnectionFactory.createConnection(conf) val tableName = TableName.valueOf("user_profile") val table = conn.getTable(tableName) val locator = conn.getRegionLocator(tableName) // 输出 HFileOutputFormat2.configureIncrementalLoad(job, table, locator) data.saveAsNewAPIHadoopFile("/hfile_tmp/user_profile/", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) /** * 将生成的HFile文件,用Bulkloader工具加载到hbase中 */ //new BulkLoadHFilesTool(conf).bulkLoad(tableName, new Path("/hfile_tmp/user_profile/")) //new LoadIncrementalHFiles(job.getConfiguration).doBulkLoad(new Path("/hfile_tmp/user_profile/"),conn.getAdmin,table,locator) conn.close() } }