数据仓库是为企业所有决策制定过程,提供所有系统数据支持的战略集合。包括对数据的:清洗,转义,分类,合并,拆分,统计等等。
A. 需求分析.
1)数据采集平台搭建
2)实现用户行为数据仓库的分层搭建
3)实现业务数据仓库的分层搭建
4)针对数据仓库中的数据进行留存、转化率分析、GMV、复购率、活跃等报表分析。
B. 技术选型
1)数据采集传输:Flume,Kafka,Sqoop
2) 数据存储:Mysql,HDFS
3)数据计算:Hive,Tez,Spark
4) 数据查询:Presto,Druid
C. 系统数据流程设计
D. 框架版本选择
E. 服务器选型:
物理机 or 云主机
F. 集群资源规划设计
1)确认集群规模
2) 集群服务器规划
本项目采用3台虚拟机模拟测试集群。
A . 埋点数据基本格式
B.事件日志数据
1)商品列表页
2)商品点击页
3)商品详情页
4)广告
5)消息通知
6)用户前台活跃
7)用户后台活跃
8)评论
9)收藏
10)点赞
11)错误日志
C. 启动日志数据
与事件日志格式不同,结构更简单。
D.数据生成脚本
b) 创建一个包名:bigdata.dw
c) 在com.atguigu.appclient包下创建一个类,AppMain
4.0.0
bigdata.dw
log-collector
1.0-SNAPSHOT
1.7.20
1.0.7
com.alibaba
fastjson
1.2.51
ch.qos.logback
logback-core
${logback.version}
ch.qos.logback
logback-classic
${logback.version}
maven-compiler-plugin
2.3.2
1.8
maven-assembly-plugin
jar-with-dependencies
bigdata.dw.appclient.AppMain
make-assembly
package
single
2)模拟生成日志数据
a)bean类对象编写
创建beab包,编写bean程序
b) main函数编写
在AppMain类中加入以下代码:
package bigdata.dw.appclient;
import java.io.UnsupportedEncodingException;
import java.util.Random;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import bigdata.dw.bean.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* 日志行为数据模拟
*/
public class AppMain {
private final static Logger logger = LoggerFactory.getLogger(AppMain.class);
private static Random rand = new Random();
// 设备id
private static int s_mid = 0;
// 用户id
private static int s_uid = 0;
// 商品id
private static int s_goodsid = 0;
public static void main(String[] args) {
// 参数一:控制发送每条的延时时间,默认是0
Long delay = args.length > 0 ? Long.parseLong(args[0]) : 0L;
// 参数二:循环遍历次数
int loop_len = args.length > 1 ? Integer.parseInt(args[1]) : 1000;
// 生成数据
generateLog(delay, loop_len);
}
private static void generateLog(Long delay, int loop_len) {
for (int i = 0; i < loop_len; i++) {
int flag = rand.nextInt(2);
switch (flag) {
case (0):
//应用启动
AppStart appStart = generateStart();
String jsonString = JSON.toJSONString(appStart);
//控制台打印
logger.info(jsonString);
break;
case (1):
JSONObject json = new JSONObject();
json.put("ap", "app");
json.put("cm", generateComFields());
JSONArray eventsArray = new JSONArray();
// 事件日志
// 商品点击,展示
if (rand.nextBoolean()) {
eventsArray.add(generateDisplay());
json.put("et", eventsArray);
}
// 商品详情页
if (rand.nextBoolean()) {
eventsArray.add(generateNewsDetail());
json.put("et", eventsArray);
}
// 商品列表页
if (rand.nextBoolean()) {
eventsArray.add(generateNewList());
json.put("et", eventsArray);
}
// 广告
if (rand.nextBoolean()) {
eventsArray.add(generateAd());
json.put("et", eventsArray);
}
// 消息通知
if (rand.nextBoolean()) {
eventsArray.add(generateNotification());
json.put("et", eventsArray);
}
// 用户前台活跃
if (rand.nextBoolean()) {
eventsArray.add(generatbeforeground());
json.put("et", eventsArray);
}
// 用户后台活跃
if (rand.nextBoolean()) {
eventsArray.add(generateBackground());
json.put("et", eventsArray);
}
//故障日志
if (rand.nextBoolean()) {
eventsArray.add(generateError());
json.put("et", eventsArray);
}
// 用户评论
if (rand.nextBoolean()) {
eventsArray.add(generateComment());
json.put("et", eventsArray);
}
// 用户收藏
if (rand.nextBoolean()) {
eventsArray.add(generateFavorites());
json.put("et", eventsArray);
}
// 用户点赞
if (rand.nextBoolean()) {
eventsArray.add(generatePraise());
json.put("et", eventsArray);
}
//时间
long millis = System.currentTimeMillis();
//控制台打印
logger.info(millis + "|" + json.toJSONString());
break;
}
// 延迟
try {
Thread.sleep(delay);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
/**
* 公共字段设置
*/
private static JSONObject generateComFields() {
AppBase appBase = new AppBase();
//设备id
appBase.setMid(s_mid + "");
s_mid++;
// 用户id
appBase.setUid(s_uid + "");
s_uid++;
// 程序版本号 5,6等
appBase.setVc("" + rand.nextInt(20));
//程序版本名 v1.1.1
appBase.setVn("1." + rand.nextInt(4) + "." + rand.nextInt(10));
// 安卓系统版本
appBase.setOs("8." + rand.nextInt(3) + "." + rand.nextInt(10));
// 语言 es,en,pt
int flag = rand.nextInt(3);
switch (flag) {
case (0):
appBase.setL("es");
break;
case (1):
appBase.setL("en");
break;
case (2):
appBase.setL("pt");
break;
}
// 渠道号 从哪个渠道来的
appBase.setSr(getRandomChar(1));
// 区域
flag = rand.nextInt(2);
switch (flag) {
case 0:
appBase.setAr("BR");
case 1:
appBase.setAr("MX");
}
// 手机品牌 ba ,手机型号 md,就取2位数字了
flag = rand.nextInt(3);
switch (flag) {
case 0:
appBase.setBa("Sumsung");
appBase.setMd("sumsung-" + rand.nextInt(20));
break;
case 1:
appBase.setBa("Huawei");
appBase.setMd("Huawei-" + rand.nextInt(20));
break;
case 2:
appBase.setBa("HTC");
appBase.setMd("HTC-" + rand.nextInt(20));
break;
}
// 嵌入sdk的版本
appBase.setSv("V2." + rand.nextInt(10) + "." + rand.nextInt(10));
// gmail
appBase.setG(getRandomCharAndNumr(8) + "@gmail.com");
// 屏幕宽高 hw
flag = rand.nextInt(4);
switch (flag) {
case 0:
appBase.setHw("640*960");
break;
case 1:
appBase.setHw("640*1136");
break;
case 2:
appBase.setHw("750*1134");
break;
case 3:
appBase.setHw("1080*1920");
break;
}
// 客户端产生日志时间
long millis = System.currentTimeMillis();
appBase.setT("" + (millis - rand.nextInt(99999999)));
// 手机网络模式 3G,4G,WIFI
flag = rand.nextInt(3);
switch (flag) {
case 0:
appBase.setNw("3G");
break;
case 1:
appBase.setNw("4G");
break;
case 2:
appBase.setNw("WIFI");
break;
}
// 拉丁美洲 西经34°46′至西经117°09;北纬32°42′至南纬53°54′
// 经度
appBase.setLn((-34 - rand.nextInt(83) - rand.nextInt(60) / 10.0) + "");
// 纬度
appBase.setLa((32 - rand.nextInt(85) - rand.nextInt(60) / 10.0) + "");
return (JSONObject) JSON.toJSON(appBase);
}
/**
* 商品展示事件
*/
private static JSONObject generateDisplay() {
AppDisplay appDisplay = new AppDisplay();
boolean boolFlag = rand.nextInt(10) < 7;
// 动作:曝光商品=1,点击商品=2,
if (boolFlag) {
appDisplay.setAction("1");
} else {
appDisplay.setAction("2");
}
// 商品id
String goodsId = s_goodsid + "";
s_goodsid++;
appDisplay.setGoodsid(goodsId);
// 顺序 设置成6条吧
int flag = rand.nextInt(6);
appDisplay.setPlace("" + flag);
// 曝光类型
flag = 1 + rand.nextInt(2);
appDisplay.setExtend1("" + flag);
// 分类
flag = 1 + rand.nextInt(100);
appDisplay.setCategory("" + flag);
JSONObject jsonObject = (JSONObject) JSON.toJSON(appDisplay);
return packEventJson("display", jsonObject);
}
/**
* 商品详情页
*/
private static JSONObject generateNewsDetail() {
AppNewsDetail appNewsDetail = new AppNewsDetail();
// 页面入口来源
int flag = 1 + rand.nextInt(3);
appNewsDetail.setEntry(flag + "");
// 动作
appNewsDetail.setAction("" + (rand.nextInt(4) + 1));
// 商品id
appNewsDetail.setGoodsid(s_goodsid + "");
// 商品来源类型
flag = 1 + rand.nextInt(3);
appNewsDetail.setShowtype(flag + "");
// 商品样式
flag = rand.nextInt(6);
appNewsDetail.setShowtype("" + flag);
// 页面停留时长
flag = rand.nextInt(10) * rand.nextInt(7);
appNewsDetail.setNews_staytime(flag + "");
// 加载时长
flag = rand.nextInt(10) * rand.nextInt(7);
appNewsDetail.setLoading_time(flag + "");
// 加载失败码
flag = rand.nextInt(10);
switch (flag) {
case 1:
appNewsDetail.setType1("102");
break;
case 2:
appNewsDetail.setType1("201");
break;
case 3:
appNewsDetail.setType1("325");
break;
case 4:
appNewsDetail.setType1("433");
break;
case 5:
appNewsDetail.setType1("542");
break;
default:
appNewsDetail.setType1("");
break;
}
// 分类
flag = 1 + rand.nextInt(100);
appNewsDetail.setCategory("" + flag);
JSONObject eventJson = (JSONObject) JSON.toJSON(appNewsDetail);
return packEventJson("newsdetail", eventJson);
}
/**
* 商品列表
*/
private static JSONObject generateNewList() {
AppLoading appLoading = new AppLoading();
// 动作
int flag = rand.nextInt(3) + 1;
appLoading.setAction(flag + "");
// 加载时长
flag = rand.nextInt(10) * rand.nextInt(7);
appLoading.setLoading_time(flag + "");
// 失败码
flag = rand.nextInt(10);
switch (flag) {
case 1:
appLoading.setType1("102");
break;
case 2:
appLoading.setType1("201");
break;
case 3:
appLoading.setType1("325");
break;
case 4:
appLoading.setType1("433");
break;
case 5:
appLoading.setType1("542");
break;
default:
appLoading.setType1("");
break;
}
// 页面 加载类型
flag = 1 + rand.nextInt(2);
appLoading.setLoading_way("" + flag);
// 扩展字段1
appLoading.setExtend1("");
// 扩展字段2
appLoading.setExtend2("");
// 用户加载类型
flag = 1 + rand.nextInt(3);
appLoading.setType("" + flag);
JSONObject jsonObject = (JSONObject) JSON.toJSON(appLoading);
return packEventJson("loading", jsonObject);
}
/**
* 广告相关字段
*/
private static JSONObject generateAd() {
AppAd appAd = new AppAd();
// 入口
int flag = rand.nextInt(3) + 1;
appAd.setEntry(flag + "");
// 动作
flag = rand.nextInt(5) + 1;
appAd.setAction(flag + "");
// 状态
flag = rand.nextInt(10) > 6 ? 2 : 1;
appAd.setContent(flag + "");
// 失败码
flag = rand.nextInt(10);
switch (flag) {
case 1:
appAd.setDetail("102");
break;
case 2:
appAd.setDetail("201");
break;
case 3:
appAd.setDetail("325");
break;
case 4:
appAd.setDetail("433");
break;
case 5:
appAd.setDetail("542");
break;
default:
appAd.setDetail("");
break;
}
// 广告来源
flag = rand.nextInt(4) + 1;
appAd.setSource(flag + "");
// 用户行为
flag = rand.nextInt(2) + 1;
appAd.setBehavior(flag + "");
// 商品类型
flag = rand.nextInt(10);
appAd.setNewstype("" + flag);
// 展示样式
flag = rand.nextInt(6);
appAd.setShow_style("" + flag);
JSONObject jsonObject = (JSONObject) JSON.toJSON(appAd);
return packEventJson("ad", jsonObject);
}
/**
* 启动日志
*/
private static AppStart generateStart() {
AppStart appStart = new AppStart();
//设备id
appStart.setMid(s_mid + "");
s_mid++;
// 用户id
appStart.setUid(s_uid + "");
s_uid++;
// 程序版本号 5,6等
appStart.setVc("" + rand.nextInt(20));
//程序版本名 v1.1.1
appStart.setVn("1." + rand.nextInt(4) + "." + rand.nextInt(10));
// 安卓系统版本
appStart.setOs("8." + rand.nextInt(3) + "." + rand.nextInt(10));
//设置日志类型
appStart.setEn("start");
// 语言 es,en,pt
int flag = rand.nextInt(3);
switch (flag) {
case (0):
appStart.setL("es");
break;
case (1):
appStart.setL("en");
break;
case (2):
appStart.setL("pt");
break;
}
// 渠道号 从哪个渠道来的
appStart.setSr(getRandomChar(1));
// 区域
flag = rand.nextInt(2);
switch (flag) {
case 0:
appStart.setAr("BR");
case 1:
appStart.setAr("MX");
}
// 手机品牌 ba ,手机型号 md,就取2位数字了
flag = rand.nextInt(3);
switch (flag) {
case 0:
appStart.setBa("Sumsung");
appStart.setMd("sumsung-" + rand.nextInt(20));
break;
case 1:
appStart.setBa("Huawei");
appStart.setMd("Huawei-" + rand.nextInt(20));
break;
case 2:
appStart.setBa("HTC");
appStart.setMd("HTC-" + rand.nextInt(20));
break;
}
// 嵌入sdk的版本
appStart.setSv("V2." + rand.nextInt(10) + "." + rand.nextInt(10));
// gmail
appStart.setG(getRandomCharAndNumr(8) + "@gmail.com");
// 屏幕宽高 hw
flag = rand.nextInt(4);
switch (flag) {
case 0:
appStart.setHw("640*960");
break;
case 1:
appStart.setHw("640*1136");
break;
case 2:
appStart.setHw("750*1134");
break;
case 3:
appStart.setHw("1080*1920");
break;
}
// 客户端产生日志时间
long millis = System.currentTimeMillis();
appStart.setT("" + (millis - rand.nextInt(99999999)));
// 手机网络模式 3G,4G,WIFI
flag = rand.nextInt(3);
switch (flag) {
case 0:
appStart.setNw("3G");
break;
case 1:
appStart.setNw("4G");
break;
case 2:
appStart.setNw("WIFI");
break;
}
// 拉丁美洲 西经34°46′至西经117°09;北纬32°42′至南纬53°54′
// 经度
appStart.setLn((-34 - rand.nextInt(83) - rand.nextInt(60) / 10.0) + "");
// 纬度
appStart.setLa((32 - rand.nextInt(85) - rand.nextInt(60) / 10.0) + "");
// 入口
flag = rand.nextInt(5) + 1;
appStart.setEntry(flag + "");
// 开屏广告类型
flag = rand.nextInt(2) + 1;
appStart.setOpen_ad_type(flag + "");
// 状态
flag = rand.nextInt(10) > 8 ? 2 : 1;
appStart.setAction(flag + "");
// 加载时长
appStart.setLoading_time(rand.nextInt(20) + "");
// 失败码
flag = rand.nextInt(10);
switch (flag) {
case 1:
appStart.setDetail("102");
break;
case 2:
appStart.setDetail("201");
break;
case 3:
appStart.setDetail("325");
break;
case 4:
appStart.setDetail("433");
break;
case 5:
appStart.setDetail("542");
break;
default:
appStart.setDetail("");
break;
}
// 扩展字段
appStart.setExtend1("");
return appStart;
}
/**
* 消息通知
*/
private static JSONObject generateNotification() {
AppNotification appNotification = new AppNotification();
int flag = rand.nextInt(4) + 1;
// 动作
appNotification.setAction(flag + "");
// 通知id
flag = rand.nextInt(4) + 1;
appNotification.setType(flag + "");
// 客户端弹时间
appNotification.setAp_time((System.currentTimeMillis() - rand.nextInt(99999999)) + "");
// 备用字段
appNotification.setContent("");
JSONObject jsonObject = (JSONObject) JSON.toJSON(appNotification);
return packEventJson("notification", jsonObject);
}
/**
* 前台活跃
*/
private static JSONObject generatbeforeground() {
AppActive_foreground appActive_foreground = new AppActive_foreground();
// 推送消息的id
int flag = rand.nextInt(2);
switch (flag) {
case 1:
appActive_foreground.setAccess(flag + "");
break;
default:
appActive_foreground.setAccess("");
break;
}
// 1.push 2.icon 3.其他
flag = rand.nextInt(3) + 1;
appActive_foreground.setPush_id(flag + "");
JSONObject jsonObject = (JSONObject) JSON.toJSON(appActive_foreground);
return packEventJson("active_foreground", jsonObject);
}
/**
* 后台活跃
*/
private static JSONObject generateBackground() {
AppActive_background appActive_background = new AppActive_background();
// 启动源
int flag = rand.nextInt(3) + 1;
appActive_background.setActive_source(flag + "");
JSONObject jsonObject = (JSONObject) JSON.toJSON(appActive_background);
return packEventJson("active_background", jsonObject);
}
/**
* 错误日志数据
*/
private static JSONObject generateError() {
AppErrorLog appErrorLog = new AppErrorLog();
String[] errorBriefs = {"at cn.lift.dfdf.web.AbstractBaseController.validInbound(AbstractBaseController.java:72)", "at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)"}; //错误摘要
String[] errorDetails = {"java.lang.NullPointerException\\n " + "at cn.lift.appIn.web.AbstractBaseController.validInbound(AbstractBaseController.java:72)\\n " + "at cn.lift.dfdf.web.AbstractBaseController.validInbound", "at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67)\\n " + "at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n" + " at java.lang.reflect.Method.invoke(Method.java:606)\\n"}; //错误详情
//错误摘要
appErrorLog.setErrorBrief(errorBriefs[rand.nextInt(errorBriefs.length)]);
//错误详情
appErrorLog.setErrorDetail(errorDetails[rand.nextInt(errorDetails.length)]);
JSONObject jsonObject = (JSONObject) JSON.toJSON(appErrorLog);
return packEventJson("error", jsonObject);
}
/**
* 为各个事件类型的公共字段(时间、事件类型、Json数据)拼接
*/
private static JSONObject packEventJson(String eventName, JSONObject jsonObject) {
JSONObject eventJson = new JSONObject();
eventJson.put("ett", (System.currentTimeMillis() - rand.nextInt(99999999)) + "");
eventJson.put("en", eventName);
eventJson.put("kv", jsonObject);
return eventJson;
}
/**
* 获取随机字母组合
*
* @param length 字符串长度
*/
private static String getRandomChar(Integer length) {
StringBuilder str = new StringBuilder();
Random random = new Random();
for (int i = 0; i < length; i++) {
// 字符串
str.append((char) (65 + random.nextInt(26)));// 取得大写字母
}
return str.toString();
}
/**
* 获取随机字母数字组合
*
* @param length 字符串长度
*/
private static String getRandomCharAndNumr(Integer length) {
StringBuilder str = new StringBuilder();
Random random = new Random();
for (int i = 0; i < length; i++) {
boolean b = random.nextBoolean();
if (b) { // 字符串
// int choice = random.nextBoolean() ? 65 : 97; 取得65大写字母还是97小写字母
str.append((char) (65 + random.nextInt(26)));// 取得大写字母
} else { // 数字
str.append(String.valueOf(random.nextInt(10)));
}
}
return str.toString();
}
/**
* 收藏
*/
private static JSONObject generateFavorites() {
AppFavorites favorites = new AppFavorites();
favorites.setCourse_id(rand.nextInt(10));
favorites.setUserid(rand.nextInt(10));
favorites.setAdd_time((System.currentTimeMillis() - rand.nextInt(99999999)) + "");
JSONObject jsonObject = (JSONObject) JSON.toJSON(favorites);
return packEventJson("favorites", jsonObject);
}
/**
* 点赞
*/
private static JSONObject generatePraise() {
AppPraise praise = new AppPraise();
praise.setId(rand.nextInt(10));
praise.setUserid(rand.nextInt(10));
praise.setTarget_id(rand.nextInt(10));
praise.setType(rand.nextInt(4) + 1);
praise.setAdd_time((System.currentTimeMillis() - rand.nextInt(99999999)) + "");
JSONObject jsonObject = (JSONObject) JSON.toJSON(praise);
return packEventJson("praise", jsonObject);
}
/**
* 评论
*/
private static JSONObject generateComment() {
AppComment comment = new AppComment();
comment.setComment_id(rand.nextInt(10));
comment.setUserid(rand.nextInt(10));
comment.setP_comment_id(rand.nextInt(5));
comment.setContent(getCONTENT());
comment.setAddtime((System.currentTimeMillis() - rand.nextInt(99999999)) + "");
comment.setOther_id(rand.nextInt(10));
comment.setPraise_count(rand.nextInt(1000));
comment.setReply_count(rand.nextInt(200));
JSONObject jsonObject = (JSONObject) JSON.toJSON(comment);
return packEventJson("comment", jsonObject);
}
/**
* 生成单个汉字
*/
private static char getRandomChar() {
String str = "";
int hightPos; //
int lowPos;
Random random = new Random();
//随机生成汉子的两个字节
hightPos = (176 + Math.abs(random.nextInt(39)));
lowPos = (161 + Math.abs(random.nextInt(93)));
byte[] b = new byte[2];
b[0] = (Integer.valueOf(hightPos)).byteValue();
b[1] = (Integer.valueOf(lowPos)).byteValue();
try {
str = new String(b, "GBK");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
System.out.println("错误");
}
return str.charAt(0);
}
/**
* 拼接成多个汉字
*/
private static String getCONTENT() {
StringBuilder str = new StringBuilder();
for (int i = 0; i < rand.nextInt(100); i++) {
str.append(getRandomChar());
}
return str.toString();
}
}
A. 服务器准备
1)下载VMware及centos镜像,安装centos虚拟机
2)配置虚拟机
a.cd /etc/sysconfig/network-scripts
sudo vi ifcfg-eth0
sudo chkconfig iptables off
c.关闭linux安全加载机制
cd /etc/selinux/
sudo vi config
sudo vi /etc/hosts
cd /etc/udev/rules.d/
sudo rm -f 70-persistent-net.rules
f.关闭虚拟机
拍摄快照
克隆3台basic虚拟机,
g.配置克隆出的3台虚拟机的ip地址和主机名
sudo vi /etc/sysconfig/network-scripts/ifcfg-eth0
sudo vi /etc/sysconfig/network
配置windows本机的hosts文件:
h.禁用图形界面节省资源
sudo vi /etc/inittab
i. 关机,拍摄快照副本。
j.安装securecrt,连接3台虚拟机
B. Hadoop安装(配置8个文件)
1)在node1 的/opt目录下创建module、software文件夹
cd /opt
sudo mkdir module
sudo mkdir software
在node2,node3/opt目录下创建module文件夹
修改module、software文件夹的所有者
sudo chown hadoop:hadoop module/ software/
2)安装java
a.上传jdk压缩包和hadoop jar包,hadoop lzo jar包
b.安装java
卸载现有JDK
(1)查询是否安装Java软件:
rpm -qa | grep java
(2)如果安装的版本低于1.7,卸载该JDK:
sudo rpm -e 软件包
安装jdk
tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt/module/
配置jdk环境变量:
sudo vi /etc/profile加入以下内容:
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
export PATH= P A T H : PATH: PATH:JAVA_HOME/bin
c. 安装hadoop
tar -zxvf hadoop-2.7.2.tar.gz -C /opt/module/
配置hadoop 环境变量
sudo vi /etc/profile
加入以下内容:
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-2.7.2
export PATH= P A T H : PATH: PATH:HADOOP_HOME/bin
export PATH= P A T H : PATH: PATH:HADOOP_HOME/sbin
d. 编写集群分发脚本:
在家目录下创建bin文件夹
mkdir bin
编写xsync脚本:
vim xsnc;
加入以下内容:
#!/bin/bash
#1 获取输入参数个数,如果没有参数,直接退出
pcount=$#
if((pcount==0)); then
echo no args;
exit;
fi
#2 获取文件名称
p1=$1
fname=basename $p1
echo fname=$fname
#3 获取上级目录到绝对路径
pdir=cd -P $(dirname $p1); pwd
echo pdir=$pdir
#4 获取当前用户名称
user=whoami
#5 循环
for((host=2; host<4; host++)); do
echo ------------------- node$host --------------
rsync -rvl p d i r / pdir/ pdir/fname u s e r @ n o d e user@node user@nodehost:$pdir
done
赋予权限:
chmod 777 xsync
3)配置集群:
a.配置core-site.xml:
vim core-site.xml,加入以下内容:
fs.defaultFS
hdfs://node1:9000
hadoop.tmp.dir
/opt/module/hadoop-2.7.2/data/tmp
b.hdfs-site.xml
配置hadoop-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置hdfs-site.xml
dfs.replication
1
dfs.namenode.secondary.http-address
node3:50090
c.YARN配置文件
配置yarn-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.resourcemanager.hostname
node2
d. MapReduce配置文件
配置mapred-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置mapred-site.xml
cp mapred-site.xml.template mapred-site.xml
e.配置slaves
4)分发:
a.配置ssh
node1 上:
cd .ssh
ssh-keygen -t rsa 三次回车
ssh-copy-id node1
ssh-copy-id node2
ssh-copy-id node3
node2(yarn)上:
cd .ssh
ssh-keygen -t rsa 三次回车
ssh-copy-id node1
ssh-copy-id node2
ssh-copy-id node3
b.分发脚本
node1上执行:
xsync jdk1.8.0_144/
xsync hadoop-2.7.2/
sudo scp /etc/profile root@node3:/etc/profile
sudo scp /etc/profile root@node2:/etc/profile
5)启动集群
node1 上执行:
格式化:bin/hdfs namenode format
启动hdfs: sbin/start-dfs.sh
node2上启动yarn: sbin/start-yarn.sh
6)hdfs调优:
a.支持LZO压缩配置
将之前上传的hadoop-lzo-0.4.20.jar 放入hadoop-2.7.2/share/hadoop/common/
cp hadoop-lzo-0.4.20.jar /opt/module/hadoop-2.7.2/share/hadoop/common/
分发到其他节点:
xsync hadoop-lzo-0.4.20.jar
配置core-site.xml,加入以下内容:
io.compression.codecs
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec
分发core-site.xml:
xsync core-site.xml
C. Zookeeper安装
1)上传zookeeper压缩包
sftp> put E:\工作\电商数仓项目\2.资料\01_jars\03_zookeeper\zookeeper-3.4.10.tar.gz
2)解压安装zookeeper:
tar -zxvf zookeeper-3.4.10.tar.gz -C /opt/module/
分发:
xsync zookeeper-3.4.10/
3)配置服务器编号
a.在/opt/module/zookeeper-3.4.10/这个目录下创建zkData
mkdir zkData
b. 在/opt/module/zookeeper-3.4.10/zkData目录下创建一个myid的文件
touch myid
c.编辑myid文件,在文件中添加与server对应的编号(1):
vi myid
d.分发 :
xsync myid
将node2、node3上修改myid文件中内容为2、3
4)配置zoo.cfg文件
a.重命名/opt/module/zookeeper-3.4.10/conf这个目录下的zoo_sample.cfg为zoo.cfg
mv zoo_sample.cfg zoo.cfg
b.打开zoo.cfg文件
vim zoo.cfg
修改数据存储路径配置
dataDir=/opt/module/zookeeper-3.4.10/zkData
增加如下配置
#######################cluster##########################
server.1=node1:2888:3888
server.2=node2:2888:3888
server.3=node3:2888:3888
分发:
xsync zoo.cfg
5)编写zookeeper启动停止脚本:
在node1的/home/hadoop/bin目录下创建脚本:
vim zk.sh
加入以下代码:
#! /bin/bash
case $1 in
"start"){
for i in node1 node2 node3
do
ssh $i "/opt/module/zookeeper-3.4.10/bin/zkServer.sh start"
done
};;
"stop"){
for i in node1 node2 node3
do
ssh $i "/opt/module/zookeeper-3.4.10/bin/zkServer.sh stop"
done
};;
"status"){
for i in node1 node2 node3
do
ssh $i "/opt/module/zookeeper-3.4.10/bin/zkServer.sh status"
done
};;
esac
b.增加脚本执行权限
chmod 777 zk.sh
c.配置linux环境变量:
把/etc/profile里面的环境变量追加到~/.bashrc目录
cat /etc/profile >> ~/.bashrc
cat /etc/profile >> ~/.bashrc
cat /etc/profile >> ~/.bashrc
d.启动停止集群:
zk.sh stop
zk.sh start
zk.sh status
D.日志生成
1)将之前生成的jar包log-collector-0.0.1-SNAPSHOT-jar-with-dependencies.jar拷贝到node1 /opt/modul上,
put E:\DWproject\logcollector\target\log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar
分发:
xsync log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar
删除3号节点上的jar包(节省资源)
node3 :rm -rf log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar
2)在node2上执行jar程序:
java -classpath log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar bigdata.dw.appclient.AppMain >/opt/module/test.log
3)在/tmp/logs路径下查看生成的日志文件
cd /tmp/logs/
ls
3)集群日志生成启动脚本:
a.在bin目录下创建脚本:
vim lg.sh
添加如下内容:
#! /bin/bash
for i in hadoop102 hadoop103
do
ssh $i "java -classpath /opt/module/log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar bigdata.dw.appclient.AppMain $1 $2 >/opt/module/test.log &"
done
~
b.修改权限:
chmod 777 lg.sh
c.启动脚本
lg.sh
4)集群时间同步修改脚本
a.在bin目录下创建脚本dt.sh
vim dt .sh
在脚本中编写如下内容:
#!/bin/bash
log_date=$1
for i in node1 node2 node3
do
ssh -t $i “sudo date -s $log_date”
done
chmod 777dt.sh
执行脚本:
dt.sh 2019-3-10
5)集群所有进程查看脚本
vim xcall.sh
在脚本中编写如下内容:
#! /bin/bash
for i in node1 node2 node3
do
echo --------- $i ----------
ssh i " i " i"*"
done
chmod 777 xcall.sh
执行脚本: xcall.sh jps
E.采集日志
1)Flume安装:
a.上传安装包
put E:\工作\电商数仓项目\2.资料\01_jars\04_flume\apache-flume-1.7.0-bin.tar.gz
b.解压apache-flume-1.7.0-bin.tar.gz到/opt/module/目录下
tar -zxvf apache-flume-1.7.0-bin.tar.gz -C /opt/module/
c.修改apache-flume-1.7.0-bin的名称为flume
mv apache-flume-1.7.0-bin flume
d.将flume/conf下的flume-env.sh.template文件修改为flume-env.sh,并配置flume-env.sh文件
mv flume-env.sh.template flume-env.sh
vi flume-env.sh
2)Flume具体配置:
a.在/opt/module/flume/conf目录下创建file-flume-kafka.conf文件
vim file-flume-kafka.conf
加入以下内容:
a1.sources=r1
a1.channels=c1 c2
# configure source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/module/flume/test/log_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /tmp/logs/app.+
a1.sources.r1.fileHeader = true
a1.sources.r1.channels = c1 c2
#interceptor
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = bigdata.dw.flume.interceptor.LogETLInterceptor$Builder
a1.sources.r1.interceptors.i2.type = bigdata.dw.flume.interceptor.LogTypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = topic
a1.sources.r1.selector.mapping.topic_start = c1
a1.sources.r1.selector.mapping.topic_event = c2
# configure channel
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = node1:9092,node2:9092,node3:9092
a1.channels.c1.kafka.topic = topic_start
a1.channels.c1.parseAsFlumeEvent = false
a1.channels.c1.kafka.consumer.group.id = flume-consumer
a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.kafka.bootstrap.servers = node1:9092,node2:9092,node3:9092
a1.channels.c2.kafka.topic = topic_event
a1.channels.c2.parseAsFlumeEvent = false
a1.channels.c2.kafka.consumer.group.id = flume-consumer
3)Flume的ETL和分类型拦截器
本项目中自定义了两个拦截器,分别是:ETL拦截器、日志类型区分拦截器。
ETL拦截器主要用于,过滤时间戳不合法和Json数据不完整的日志
日志类型区分拦截器主要用于,将启动日志和事件日志区分开来,方便发往Kafka的不同Topic。
a.创建Maven工程flume-interceptor
b.创建包名:bigdata.dw.flume.interceptor
c.在pom.xml文件中添加如下配置
org.apache.flume
flume-ng-core
1.7.0
maven-compiler-plugin
2.3.2
1.8
maven-assembly-plugin
jar-with-dependencies
make-assembly
package
single
d.在bigdata.dw.flume.interceptor包下创建LogETLInterceptor类名
Flume ETL拦截器LogETLInterceptor
写入代码:
package com.atguigu.flume.interceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;
public class LogETLInterceptor implements Interceptor {
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
// 1 获取数据
byte[] body = event.getBody();
String log = new String(body, Charset.forName("UTF-8"));
// 2 判断数据类型并向Header中赋值
if (log.contains("start")) {
if (LogUtils.validateStart(log)){
return event;
}
}else {
if (LogUtils.validateEvent(log)){
return event;
}
}
// 3 返回校验结果
return null;
}
@Override
public List intercept(List events) {
ArrayList interceptors = new ArrayList<>();
for (Event event : events) {
Event intercept1 = intercept(event);
if (intercept1 != null){
interceptors.add(intercept1);
}
}
return interceptors;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new LogETLInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
e.创建Flume日志过滤工具类,加入代码:
package com.atguigu.flume.interceptor;
import org.apache.commons.lang.math.NumberUtils;
public class LogUtils {
public static boolean validateEvent(String log) {
// 服务器时间 | json
// 1549696569054 | {"cm":{"ln":"-89.2","sv":"V2.0.4","os":"8.2.0","g":"[email protected]","nw":"4G","l":"en","vc":"18","hw":"1080*1920","ar":"MX","uid":"u8678","t":"1549679122062","la":"-27.4","md":"sumsung-12","vn":"1.1.3","ba":"Sumsung","sr":"Y"},"ap":"weather","et":[]}
// 1 切割
String[] logContents = log.split("\\|");
// 2 校验
if(logContents.length != 2){
return false;
}
//3 校验服务器时间
if (logContents[0].length()!=13 || !NumberUtils.isDigits(logContents[0])){
return false;
}
// 4 校验json
if (!logContents[1].trim().startsWith("{") || !logContents[1].trim().endsWith("}")){
return false;
}
return true;
}
public static boolean validateStart(String log) {
if (log == null){
return false;
}
// 校验json
if (!log.trim().startsWith("{") || !log.trim().endsWith("}")){
return false;
}
return true;
}
}
f.创建Flume日志类型区分拦截器LogTypeInterceptor
package com.atguigu.flume.interceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
public class LogTypeInterceptor implements Interceptor {
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
// 区分日志类型: body header
// 1 获取body数据
byte[] body = event.getBody();
String log = new String(body, Charset.forName("UTF-8"));
// 2 获取header
Map headers = event.getHeaders();
// 3 判断数据类型并向Header中赋值
if (log.contains("start")) {
headers.put("topic","topic_start");
}else {
headers.put("topic","topic_event");
}
return event;
}
@Override
public List intercept(List events) {
ArrayList interceptors = new ArrayList<>();
for (Event event : events) {
Event intercept1 = intercept(event);
interceptors.add(intercept1);
}
return interceptors;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new LogTypeInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
g.打包:
4)选上面一个jar包放入到node1的/opt/module/flume/lib文件夹下面。
put E:\DWproject\flumeinterceptor\target\flume-interceptor-1.0-SNAPSHOT.jar
5)分发:
xsync flume/
6)日志采集Flume启动停止脚本
bin目录下创建脚本f1.sh
vim f1.sh
在脚本中填写如下内容
#! /bin/bash
case $1 in
"start"){
for i in node1 node2
do
echo " --------启动 $i 采集flume-------"
ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/dev/null 2>&1 &"
done
};;
"stop"){
for i in node1 node2
do
echo " --------停止 $i 采集flume-------"
ssh $i "ps -ef | grep file-flume-kafka | grep -v grep |awk '{print \$2}' | xargs kill"
done
};;
esac
开始采集:
fl.sh start
停止采集:
fl.sh stop
F.Kafka安装:
1) Kafka集群安装
a.上传安装包:
put E:\工作\电商数仓项目\2.资料\01_jars\05_kafka\kafka_2.11-0.11.0.2.tgz
上传manger压缩包
put E:\工作\电商数仓项目\2.资料\01_jars\05_kafka\kafka-manager-1.3.3.22.zip
b.解压安装kafka:
tar -zxvf kafka_2.11-0.11.0.0.tgz -C /opt/module/
修改解压后的文件名称:
mv kafka_2.11-0.11.0.2/ kafka
创建日志目录
mkdir logs
2)进入配置目录
cd kafka/config/
配置server.properties中以下内容:
#broker的全局唯一编号,不能重复
broker.id=0
#删除topic功能使能
delete.topic.enable=true
#kafka运行日志存放的路径
log.dirs=/opt/module/kafka/logs
#配置连接Zookeeper集群地址
zookeeper.connect=node1:2181,node2:2181,node3:2181
3)配置环境变量:
sudo vi /etc/profile
#KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH= P A T H : PATH: PATH:KAFKA_HOME/bin
source /etc/profile
4)分发:
xsync kafka/
分发之后记得配置其他机器的环境变量
scp /etc/profile root@node2:/etc/profile
scp /etc/profile root@node3:/etc/profile
5)分别在node1和node2上修改配置文件/opt/module/kafka/config/server.properties中的broker.id=1、broker.id=2
6)Kafka集群启动停止脚本
bin目录下创建脚本kf.sh
vim kf.sh
#! /bin/bash
case $1 in
"start"){
for i in hadoop102 hadoop103 hadoop104
do
echo " --------启动 $i Kafka-------"
# 用于KafkaManager监控
ssh $i "export JMX_PORT=9988 && /opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties "
done
};;
"stop"){
for i in hadoop102 hadoop103 hadoop104
do
echo " --------停止 $i Kafka-------"
ssh $i "/opt/module/kafka/bin/kafka-server-stop.sh stop"
done
};;
esac
chmod 777 kf.sh
b.启动停止脚本:
kf.sh start
kf.sh stop
停止脚本有问题:参考kafka停止脚本不能用
7)查看Kafka Topic列表
bin/kafka-topics.sh --zookeeper node1:2181 --list
8)Kafka消费消息
bin/kafka-console-consumer.sh
–zookeeper node1:2181 --from-beginning --topic topic_start
9)生产消息
bin/kafka-console-producer.sh
–broker-list node1:9092 --topic topic_start
G.Kafka Manager安装
1)Kafka Manager是yahoo的一个Kafka监控管理项目。
a.解压kafka-manager-1.3.3.22.zip到/opt/module目录
mv kafka-manager-1.3.3.22.zip /opt/module/
unzip kafka-manager-1.3.3.22.zip
b.进入到/opt/module/kafka-manager-1.3.3.22/conf目录,在application.conf文件中修改kafka-manager.zkhosts
vim application.conf
kafka-manager.zkhosts=“node1:2181,node2:2181,node3:2181”
c.启动KafkaManager
nohup bin/kafka-manager -Dhttp.port=7456 >/opt/module/kafka-manager-1.3.3.22/start.log 2>&1 &
d.KafkaManager使用
https://blog.csdn.net/u011089412/article/details/87895652
2)Kafka Manager启动停止脚本
bin目录下
vim km.sh
#! /bin/bash
case $1 in
"start"){
echo " -------- 启动 KafkaManager -------"
nohup /opt/module/kafka-manager-1.3.3.22/bin/kafka-manager -Dhttp.port=7456 >start.log 2>&1 &
};;
"stop"){
echo " -------- 停止 KafkaManager -------"
jps | grep ProdServerStart |awk '{print $1}' | xargs kill
};;
esac
chmod 777 km.sh
启动与停止脚本:
km.sh start
km.sh stop
H.消费Kafka数据Flume
(1)在node3的/opt/module/flume/conf目录下创建kafka-flume-hdfs.conf文件
vim kafka-flume-hdfs.conf
a1.sources=r1 r2
a1.channels=c1 c2
a1.sinks=k1 k2
## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = node1:9092,node2:9092,node3:9092
a1.sources.r1.kafka.topics=topic_start
## source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = node1:9092,node2:9092,node3:9092
a1.sources.r2.kafka.topics=topic_event
## channel1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
## channel2
a1.channels.c2.type = file
a1.channels.c2.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c2.dataDirs = /opt/module/flume/data/behavior2/
a1.channels.c2.maxFileSize = 2146435071
a1.channels.c2.capacity = 1000000
a1.channels.c2.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = second
##sink2
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = second
## 虏禄要虏煤小脦录镁
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollInterval = 10
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
## 驴脴脝盲脦录镁脢原脡脦录镁隆拢
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop
a1.sinks.k2.hdfs.codeC = lzop
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2
2)日志消费Flume启动停止脚本
node1 家用户bin目录下创建脚本f2.sh
vim f2.sh
#! /bin/bash
case $1 in
"start"){
for i in node3
do
echo " --------启动 $i 消费flume-------"
ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/kafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/log.txt 2>&1 &"
done
};;
"stop"){
for i in node3
do
echo " --------停止 $i 消费flume-------"
ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs kill"
done
};;
esac
chmod 777 f2.sh
启动与停止消费脚本
f2.sh start
f2.sh stop
I.采集通道启动/停止脚本
bin目录下创建脚本cluster.sh
在脚本中填写如下内容
#! /bin/bash
case $1 in
"start"){
echo " -------- 启动 集群 -------"
echo " -------- 启动 hadoop集群 -------"
/opt/module/hadoop-2.7.2/sbin/start-dfs.sh
ssh hadoop103 "/opt/module/hadoop-2.7.2/sbin/start-yarn.sh"
#启动 Zookeeper集群
zk.sh start
sleep 4s;
#启动 Flume采集集群
f1.sh start
#启动 Kafka采集集群
kf.sh start
sleep 6s;
#启动 Flume消费集群
f2.sh start
#启动 KafkaManager
km.sh start
};;
"stop"){
echo " -------- 停止 集群 -------"
#停止 KafkaManager
km.sh stop
#停止 Flume消费集群
f2.sh stop
#停止 Kafka采集集群
kf.sh stop
sleep 6s;
#停止 Flume采集集群
f1.sh stop
#停止 Zookeeper集群
zk.sh stop
echo " -------- 停止 hadoop集群 -------"
ssh hadoop103 "/opt/module/hadoop-2.7.2/sbin/stop-yarn.sh"
/opt/module/hadoop-2.7.2/sbin/stop-dfs.sh
};;
esac
chmod 777 cluster.sh
2)集群启动与停止脚本
cluster.sh start
cluster.sh stop
可能要先执行下面两条指令:
日志时间修改:dt.sh 2019-2-10(最先执行)
日志生成:lg.sh
lzo压缩出问题,需要安装lzo,lzo安装
A. Hive&MySQL安装
1)上传安装包:
put E:\工作\电商数仓项目\2.资料\01_jars\06_hive\apache-hive-1.2.1-bin.tar.gz
put E:\工作\电商数仓项目\2.资料\01_jars\06_hive\apache-tez-0.9.1-bin.tar.gz
put E:\工作\电商数仓项目\2.资料\01_jars\07_mysql\mysql-libs.zip
2)解压安装:
tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt/module
修改apache-hive-1.2.1-bin.tar.gz的名称为hive
mv apache-hive-1.2.1-bin/ hive
3)修改/opt/module/hive/conf目录下的hive-env.sh.template名称为hive-env.sh
mv hive-env.sh.template hive-env.sh
4)配置hive-env.sh文件
(a)配置HADOOP_HOME路径
HADOOP_HOME=/opt/module/hadoop-2.7.2
(b)配置HIVE_CONF_DIR路径
export HIVE_CONF_DIR=/opt/module/hive/conf
5)MySql安装
a.查看mysql是否安装,如果安装了,卸载mysql
rpm -qa|grep mysql
b.卸载
sudo rpm -e --nodeps mysql-libs-5.1.66-2.el6_3.x86_64
c.解压mysql-libs.zip文件到当前目录
unzip mysql-libs.zip
d.进入到mysql-libs文件夹下,安装MySql服务器
rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm
查看产生的随机密码
cat /root/.mysql_secret
查看mysql状态
service mysql status
启动mysql
service mysql start
e.安装MySql客户端
rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm
链接mysql
mysql -uroot -pF2WKrHqrJE7Zfhb2
修改密码
SET PASSWORD=PASSWORD(‘123456’);
退出mysql
exit
f. 修改MySql中user表中主机配置
配置只要是root用户+密码,在任何主机上都能登录MySQL数据库。
1.进入mysql
[root@hadoop102 mysql-libs]# mysql -uroot -p123456
2.显示数据库
mysql>show databases;
3.使用mysql数据库
mysql>use mysql;
4.展示mysql数据库中的所有表
mysql>show tables;
5.展示user表的结构
mysql>desc user;
6.查询user表
mysql>select User, Host, Password from user;
7.修改user表,把Host表内容修改为%
mysql>update user set host=’%’ where host=‘localhost’;
8.删除root用户的其他host
mysql>
delete from user where Host=‘node1’;
delete from user where Host=‘127.0.0.1’;
delete from user where Host=’::1’;
9.刷新
mysql>flush privileges;
10.退出
mysql>quit;
g.Hive元数据配置到MySql
1 驱动拷贝
在/opt/software/mysql-libs目录下解压mysql-connector-java-5.1.27.tar.gz驱动包
tar -zxvf mysql-connector-java-5.1.27.tar.gz
拷贝/opt/software/mysql-libs/mysql-connector-java-5.1.27目录下的mysql-connector-java-5.1.27-bin.jar到/opt/module/hive/lib/
cp mysql-connector-java-5.1.27-bin.jar /opt/module/hive/lib/
6)配置Metastore到MySql
1.在/opt/module/hive/conf目录下创建一个hive-site.xm
vi hive-site.xml
2.根据官方文档配置参数,拷贝数据到hive-site.xml文件中
javax.jdo.option.ConnectionURL
jdbc:mysql://node1:3306/metastore?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUserName
root
username to use against metastore database
javax.jdo.option.ConnectionPassword
123456
password to use against metastore database
hive.cli.print.header
true
hive.cli.print.current.db
true
7)启动hive:
bin/hive;
8)关闭元数据检查
vim hive-site.xml
hive.metastore.schema.verification
false
B.Hive运行引擎Tez
Tez可以将多个有依赖的作业转换为一个作业,这样只需写一次HDFS,且中间节点较少,从而大大提升作业的计算性能。
1)a.上传安装包(已完成)
b.解压安装:
tar -zxvf apache-tez-0.9.1-bin.tar.gz -C /opt/module/
c.修改名称:
mv apache-tez-0.9.1-bin/ tez-0.9.1
2)在Hive中配置Tez
a.进入到Hive的配置目录:/opt/module/hive/conf
b.在hive-env.sh文件中添加tez环境变量配置和依赖包环境变量配置
添加如下配置
export TEZ_HOME=/opt/module/tez-0.9.1 #是你的tez的解压目录
export TEZ_JARS=""
for jar in `ls $TEZ_HOME |grep jar`; do
export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/$jar
done
for jar in `ls $TEZ_HOME/lib`; do
export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/lib/$jar
done
export HIVE_AUX_JARS_PATH=/opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar$TEZ_JARS
c.在hive-site.xml文件中添加如下配置,更改hive计算引擎
hive.execution.engine
tez
d.配置Tez
在Hive的/opt/module/hive/conf下面创建一个tez-site.xml文件
tez.lib.uris
${fs.defaultFS}/tez/tez-0.9.1,${fs.defaultFS}/tez/tez-0.9.1/lib
tez.lib.uris.classpath
${fs.defaultFS}/tez/tez-0.9.1,${fs.defaultFS}/tez/tez-0.9.1/lib
tez.use.cluster.hadoop-libs
true
tez.history.logging.service.class
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService
e.上传Tez到集群
将/opt/module/tez-0.9.1上传到HDFS的/tez路径
hadoop fs -mkdir /tez
hadoop fs -put /opt/module/tez-0.9.1/ /tez
f.测试
启动hive:
bin/hive
1)运行Tez时检查到用过多内存而被NodeManager杀死进程问题:
关掉虚拟内存检查。我们选这个,修改yarn-site.xml
yarn.nodemanager.vmem-check-enabled
false
xsync yarn-site.xml
重启集群
cluster.sh stop
cluster.sh start
2)创建LZO表
create table student(
id int,
name string);
向表中插入数据:
insert into student values(1,“zhangsan”);
如果没有报错就成功了
select * from student;
A.创建数据库
创建gmall数据库
create database gmall;
使用gmall数据库
use gmall;
B.创建启动日志表ods_start_log
1)创建输入数据是lzo输出是text,支持json解析的分区表
hive (gmall)>
drop table if exists ods_start_log;
CREATE EXTERNAL TABLE ods_start_log (`line` string)
PARTITIONED BY (`dt` string)
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_start_log';
2)加载数据
load data inpath '/origin_data/gmall/log/topic_start/2019-03-10' into table gmall.ods_start_log partition(dt='2019-03-10');
3)查看是否加载成功:
select * from ods_start_log limit 2;
C.创建事件日志表ods_event_log
1)创建输入数据是lzo输出是text,支持json解析的分区表
drop table if exists ods_event_log;
CREATE EXTERNAL TABLE ods_event_log(`line` string)
PARTITIONED BY (`dt` string)
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_event_log';
2)加载数据
load data inpath '/origin_data/gmall/log/topic_event/2019-03-10' into table gmall.ods_event_log partition(dt='2019-03-10');
3)查看是否加载成功
select * from ods_event_log limit 2;
D.ODS层加载数据脚本
vim ods_log.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
echo "===日志日期为 $do_date==="
sql="
load data inpath '/origin_data/gmall/log/topic_start/$do_date' into table "$APP".ods_start_log partition(dt='$do_date');
load data inpath '/origin_data/gmall/log/topic_event/$do_date' into table "$APP".ods_event_log partition(dt='$do_date');
"
$hive -e "$sql"
2)脚本使用
ods_log.sh 2019-02-11
3)查看导入数据
select * from ods_start_log where dt='2019-03-11' limit 2;
select * from ods_event_log where dt='2019-03-11' limit 2;
对ODS层数据进行清洗(去除空值,脏数据,超过极限范围的数据,行式存储改为列存储,改压缩格式)。
A.创建启动表
1)建表语句:
drop table if exists dwd_start_log;
CREATE EXTERNAL TABLE dwd_start_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`entry` string,
`open_ad_type` string,
`action` string,
`loading_time` string,
`detail` string,
`extend1` string
)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_start_log/';
2)向启动表导入数据
insert overwrite table dwd_start_log
PARTITION (dt='2019-03-10')
select
get_json_object(line,'$.mid') mid_id,
get_json_object(line,'$.uid') user_id,
get_json_object(line,'$.vc') version_code,
get_json_object(line,'$.vn') version_name,
get_json_object(line,'$.l') lang,
get_json_object(line,'$.sr') source,
get_json_object(line,'$.os') os,
get_json_object(line,'$.ar') area,
get_json_object(line,'$.md') model,
get_json_object(line,'$.ba') brand,
get_json_object(line,'$.sv') sdk_version,
get_json_object(line,'$.g') gmail,
get_json_object(line,'$.hw') height_width,
get_json_object(line,'$.t') app_time,
get_json_object(line,'$.nw') network,
get_json_object(line,'$.ln') lng,
get_json_object(line,'$.la') lat,
get_json_object(line,'$.entry') entry,
get_json_object(line,'$.open_ad_type') open_ad_type,
get_json_object(line,'$.action') action,
get_json_object(line,'$.loading_time') loading_time,
get_json_object(line,'$.detail') detail,
get_json_object(line,'$.extend1') extend1
from ods_start_log
where dt='2019-03-10';
3)测试
select * from dwd_start_log limit 2;
C.DWD层启动表加载数据脚本
1)在bin目录下创建脚本
vim dwd_start_log.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_start_log
PARTITION (dt='$do_date')
select
get_json_object(line,'$.mid') mid_id,
get_json_object(line,'$.uid') user_id,
get_json_object(line,'$.vc') version_code,
get_json_object(line,'$.vn') version_name,
get_json_object(line,'$.l') lang,
get_json_object(line,'$.sr') source,
get_json_object(line,'$.os') os,
get_json_object(line,'$.ar') area,
get_json_object(line,'$.md') model,
get_json_object(line,'$.ba') brand,
get_json_object(line,'$.sv') sdk_version,
get_json_object(line,'$.g') gmail,
get_json_object(line,'$.hw') height_width,
get_json_object(line,'$.t') app_time,
get_json_object(line,'$.nw') network,
get_json_object(line,'$.ln') lng,
get_json_object(line,'$.la') lat,
get_json_object(line,'$.entry') entry,
get_json_object(line,'$.open_ad_type') open_ad_type,
get_json_object(line,'$.action') action,
get_json_object(line,'$.loading_time') loading_time,
get_json_object(line,'$.detail') detail,
get_json_object(line,'$.extend1') extend1
from "$APP".ods_start_log
where dt='$do_date';
"
$hive -e "$sql"
2)脚本使用
3)查询导入结果
select * from dwd_start_log where dt=‘2019-03-11’ limit 2;
E.DWD层事件表数据解析
1)创建基础明细表
a.
drop table if exists dwd_base_event_log;
CREATE EXTERNAL TABLE dwd_base_event_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`event_name` string,
`event_json` string,
`server_time` string)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_base_event_log/';
b.说明:其中event_name和event_json用来对应事件名和整个事件。这个地方将原始日志1对多的形式拆分出来了。操作的时候我们需要将原始日志展平,需要用到UDF和UDTF。
2)自定义UDF和UDTF函数
a.这里直接用现成的hivefunction-1.0-SNAPSHOT上传到node1的/opt/module/hive/
b.将jar包添加到Hive的classpath
hive (gmall)> add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar;
c.创建临时函数与开发好的java class关联
create temporary function base_analizer as ‘com.atguigu.udf.BaseFieldUDF’;
create temporary function flat_analizer as ‘com.atguigu.udtf.EventJsonUDTF’;
4.2.4 解析事件日志基础明细表
1)解析事件日志基础明细表
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_base_event_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
event_name,
event_json,
server_time
from
(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time
from ods_event_log where dt='2019-03-10' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
2)测试:
select * from dwd_base_event_log limit 2;
4.2.5 DWD层数据解析脚本
bin目录下创建脚本
vim dwd_base_log.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar;
create temporary function base_analizer as 'com.atguigu.udf.BaseFieldUDF';
create temporary function flat_analizer as 'com.atguigu.udtf.EventJsonUDTF';
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_base_event_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source ,
os ,
area ,
model ,
brand ,
sdk_version ,
gmail ,
height_width ,
network ,
lng ,
lat ,
app_time ,
event_name ,
event_json ,
server_time
from
(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time
from "$APP".ods_event_log where dt='$do_date' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
"
$hive -e "$sql"
4)查询导入结果
select * from dwd_base_event_log where dt=‘2019-03-11’ limit 2;
4.3DWD层事件表获取
4.3.1 商品点击表
1)建表语句
drop table if exists dwd_display_log;
CREATE EXTERNAL TABLE dwd_display_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`action` string,
`goodsid` string,
`place` string,
`extend1` string,
`category` string,
`server_time` string
)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_display_log/';
2)导入数据
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_display_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.goodsid') goodsid,
get_json_object(event_json,'$.kv.place') place,
get_json_object(event_json,'$.kv.extend1') extend1,
get_json_object(event_json,'$.kv.category') category,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='display';
3)测试
select * from dwd_display_log limit 2;
4.3.2 商品详情页表
1)建表语句
drop table if exists dwd_newsdetail_log;
CREATE EXTERNAL TABLE dwd_newsdetail_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`entry` string,
`action` string,
`goodsid` string,
`showtype` string,
`news_staytime` string,
`loading_time` string,
`type1` string,
`category` string,
`server_time` string)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_newsdetail_log/';
2)导入数据
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_newsdetail_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.entry') entry,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.goodsid') goodsid,
get_json_object(event_json,'$.kv.showtype') showtype,
get_json_object(event_json,'$.kv.news_staytime') news_staytime,
get_json_object(event_json,'$.kv.loading_time') loading_time,
get_json_object(event_json,'$.kv.type1') type1,
get_json_object(event_json,'$.kv.category') category,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='newsdetail';
3)测试
hive (gmall)> select * from dwd_newsdetail_log limit 2;
4.3.3 商品列表页表
1)建表语句
hive (gmall)>
drop table if exists dwd_loading_log;
CREATE EXTERNAL TABLE dwd_loading_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`action` string,
`loading_time` string,
`loading_way` string,
`extend1` string,
`extend2` string,
`type` string,
`type1` string,
`server_time` string)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_loading_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_loading_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.loading_time') loading_time,
get_json_object(event_json,'$.kv.loading_way') loading_way,
get_json_object(event_json,'$.kv.extend1') extend1,
get_json_object(event_json,'$.kv.extend2') extend2,
get_json_object(event_json,'$.kv.type') type,
get_json_object(event_json,'$.kv.type1') type1,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='loading';
3)测试
hive (gmall)> select * from dwd_loading_log limit 2;
4.3.4 广告表
1)建表语句
hive (gmall)>
drop table if exists dwd_ad_log;
CREATE EXTERNAL TABLE dwd_ad_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`entry` string,
`action` string,
`content` string,
`detail` string,
`ad_source` string,
`behavior` string,
`newstype` string,
`show_style` string,
`server_time` string)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_ad_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_ad_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.entry') entry,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.content') content,
get_json_object(event_json,'$.kv.detail') detail,
get_json_object(event_json,'$.kv.source') ad_source,
get_json_object(event_json,'$.kv.behavior') behavior,
get_json_object(event_json,'$.kv.newstype') newstype,
get_json_object(event_json,'$.kv.show_style') show_style,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='ad';
3)测试
hive (gmall)> select * from dwd_ad_log limit 2;、
4.3.5 消息通知表
1)建表语句
hive (gmall)>
drop table if exists dwd_notification_log;
CREATE EXTERNAL TABLE dwd_notification_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`action` string,
`noti_type` string,
`ap_time` string,
`content` string,
`server_time` string
)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_notification_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_notification_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.noti_type') noti_type,
get_json_object(event_json,'$.kv.ap_time') ap_time,
get_json_object(event_json,'$.kv.content') content,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='notification';
3)测试
hive (gmall)> select * from dwd_notification_log limit 2;
4.3.6 用户前台活跃表
1)建表语句
hive (gmall)>
drop table if exists dwd_active_foreground_log;
CREATE EXTERNAL TABLE dwd_active_foreground_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`push_id` string,
`access` string,
`server_time` string)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_foreground_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_active_foreground_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.push_id') push_id,
get_json_object(event_json,'$.kv.access') access,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='active_foreground';
3)测试
hive (gmall)> select * from dwd_active_foreground_log limit 2;
4.3.7 用户后台活跃表
1)建表语句
hive (gmall)>
drop table if exists dwd_active_background_log;
CREATE EXTERNAL TABLE dwd_active_background_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`active_source` string,
`server_time` string
)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_background_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_active_background_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.active_source') active_source,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='active_background';
3)测试
hive (gmall)> select * from dwd_active_background_log limit 2;
4.3.8 评论表
1)建表语句
hive (gmall)>
drop table if exists dwd_comment_log;
CREATE EXTERNAL TABLE dwd_comment_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`comment_id` int,
`userid` int,
`p_comment_id` int,
`content` string,
`addtime` string,
`other_id` int,
`praise_count` int,
`reply_count` int,
`server_time` string
)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_comment_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_comment_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.comment_id') comment_id,
get_json_object(event_json,'$.kv.userid') userid,
get_json_object(event_json,'$.kv.p_comment_id') p_comment_id,
get_json_object(event_json,'$.kv.content') content,
get_json_object(event_json,'$.kv.addtime') addtime,
get_json_object(event_json,'$.kv.other_id') other_id,
get_json_object(event_json,'$.kv.praise_count') praise_count,
get_json_object(event_json,'$.kv.reply_count') reply_count,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='comment';
3)测试
hive (gmall)> select * from dwd_comment_log limit 2;
4.3.9 收藏表
1)建表语句
hive (gmall)>
drop table if exists dwd_favorites_log;
CREATE EXTERNAL TABLE dwd_favorites_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`id` int,
`course_id` int,
`userid` int,
`add_time` string,
`server_time` string
)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_favorites_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_favorites_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.id') id,
get_json_object(event_json,'$.kv.course_id') course_id,
get_json_object(event_json,'$.kv.userid') userid,
get_json_object(event_json,'$.kv.add_time') add_time,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='favorites';
3)测试
hive (gmall)> select * from dwd_favorites_log limit 2;
4.3.10 点赞表
1)建表语句
hive (gmall)>
drop table if exists dwd_praise_log;
CREATE EXTERNAL TABLE dwd_praise_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`id` string,
`userid` string,
`target_id` string,
`type` string,
`add_time` string,
`server_time` string
)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_praise_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_praise_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.id') id,
get_json_object(event_json,'$.kv.userid') userid,
get_json_object(event_json,'$.kv.target_id') target_id,
get_json_object(event_json,'$.kv.type') type,
get_json_object(event_json,'$.kv.add_time') add_time,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='praise';
3)测试
hive (gmall)> select * from dwd_praise_log limit 2;
4.3.11 错误日志表
1)建表语句
hive (gmall)>
drop table if exists dwd_error_log;
CREATE EXTERNAL TABLE dwd_error_log(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`errorBrief` string,
`errorDetail` string,
`server_time` string)
PARTITIONED BY (dt string)
location '/warehouse/gmall/dwd/dwd_error_log/';
2)导入数据
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_error_log
PARTITION (dt='2019-03-10')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.errorBrief') errorBrief,
get_json_object(event_json,'$.kv.errorDetail') errorDetail,
server_time
from dwd_base_event_log
where dt='2019-03-10' and event_name='error';
3)测试
hive (gmall)> select * from dwd_error_log limit 2;
4.3.12 DWD层事件表加载数据总脚本
bin目录下创建脚本
vim dwd_event_log.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_display_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.goodsid') goodsid,
get_json_object(event_json,'$.kv.place') place,
get_json_object(event_json,'$.kv.extend1') extend1,
get_json_object(event_json,'$.kv.category') category,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='display';
insert overwrite table "$APP".dwd_newsdetail_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.entry') entry,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.goodsid') goodsid,
get_json_object(event_json,'$.kv.showtype') showtype,
get_json_object(event_json,'$.kv.news_staytime') news_staytime,
get_json_object(event_json,'$.kv.loading_time') loading_time,
get_json_object(event_json,'$.kv.type1') type1,
get_json_object(event_json,'$.kv.category') category,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='newsdetail';
insert overwrite table "$APP".dwd_loading_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.loading_time') loading_time,
get_json_object(event_json,'$.kv.loading_way') loading_way,
get_json_object(event_json,'$.kv.extend1') extend1,
get_json_object(event_json,'$.kv.extend2') extend2,
get_json_object(event_json,'$.kv.type') type,
get_json_object(event_json,'$.kv.type1') type1,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='loading';
insert overwrite table "$APP".dwd_ad_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.entry') entry,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.content') content,
get_json_object(event_json,'$.kv.detail') detail,
get_json_object(event_json,'$.kv.source') ad_source,
get_json_object(event_json,'$.kv.behavior') behavior,
get_json_object(event_json,'$.kv.newstype') newstype,
get_json_object(event_json,'$.kv.show_style') show_style,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='ad';
insert overwrite table "$APP".dwd_notification_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.action') action,
get_json_object(event_json,'$.kv.noti_type') noti_type,
get_json_object(event_json,'$.kv.ap_time') ap_time,
get_json_object(event_json,'$.kv.content') content,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='notification';
insert overwrite table "$APP".dwd_active_foreground_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.push_id') push_id,
get_json_object(event_json,'$.kv.access') access,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='active_foreground';
insert overwrite table "$APP".dwd_active_background_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.active_source') active_source,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='active_background';
insert overwrite table "$APP".dwd_comment_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.comment_id') comment_id,
get_json_object(event_json,'$.kv.userid') userid,
get_json_object(event_json,'$.kv.p_comment_id') p_comment_id,
get_json_object(event_json,'$.kv.content') content,
get_json_object(event_json,'$.kv.addtime') addtime,
get_json_object(event_json,'$.kv.other_id') other_id,
get_json_object(event_json,'$.kv.praise_count') praise_count,
get_json_object(event_json,'$.kv.reply_count') reply_count,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='comment';
insert overwrite table "$APP".dwd_favorites_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.id') id,
get_json_object(event_json,'$.kv.course_id') course_id,
get_json_object(event_json,'$.kv.userid') userid,
get_json_object(event_json,'$.kv.add_time') add_time,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='favorites';
insert overwrite table "$APP".dwd_praise_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.id') id,
get_json_object(event_json,'$.kv.userid') userid,
get_json_object(event_json,'$.kv.target_id') target_id,
get_json_object(event_json,'$.kv.type') type,
get_json_object(event_json,'$.kv.add_time') add_time,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='praise';
insert overwrite table "$APP".dwd_error_log
PARTITION (dt='$do_date')
select
mid_id,
user_id,
version_code,
version_name,
lang,
source,
os,
area,
model,
brand,
sdk_version,
gmail,
height_width,
app_time,
network,
lng,
lat,
get_json_object(event_json,'$.kv.errorBrief') errorBrief,
get_json_object(event_json,'$.kv.errorDetail') errorDetail,
server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='error';
"
$hive -e "$sql"
脚本使用
dwd_event_log.sh 2019-03-11
查询导入结果:
select * from dwd_comment_log where dt=‘2019-03-11’ limit 2;
6.1 DWS层
目标:统计当日、当周、当月活动的每个设备明细
1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_day;
create external table dws_uv_detail_day
(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度'
)
partitioned by(dt string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_day'
;
2)数据导入
以用户单日访问为key进行聚合,如果某个用户在一天中使用了两种操作系统、两个系统版本、多个地区,登录不同账号,只取其中之一
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_day
partition(dt='2019-03-10')
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang))lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat
from dwd_start_log
where dt='2019-03-10'
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_day limit 1;
hive (gmall)> select count(*) from dws_uv_detail_day;
6.1.2 每周活跃设备明细
1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_wk;
create external table dws_uv_detail_wk(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度',
`monday_date` string COMMENT '周一日期',
`sunday_date` string COMMENT '周日日期'
) COMMENT '活跃用户按周明细'
PARTITIONED BY (`wk_dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_wk/';
2)数据导入
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_wk partition(wk_dt)
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat,
date_add(next_day('2019-03-10','MO'),-7),
date_add(next_day('2019-03-10','MO'),-1),
concat(date_add( next_day('2019-03-10','MO'),-7), '_' , date_add(next_day('2019-03-10','MO'),-1)
)
from dws_uv_detail_day
where dt>=date_add(next_day('2019-03-10','MO'),-7) and dt<=date_add(next_day('2019-03-10','MO'),-1)
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_wk limit 1;
hive (gmall)> select count(*) from dws_uv_detail_wk;
6.1.3 每月活跃设备明细
1)建表语句
hive (gmall)>
drop table if exists dws_uv_detail_mn;
create external table dws_uv_detail_mn(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度'
) COMMENT '活跃用户按月明细'
PARTITIONED BY (`mn` string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_mn/'
;
2)数据导入
hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_mn partition(mn)
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat,
date_format('2019-03-10','yyyy-MM')
from dws_uv_detail_day
where date_format(dt,'yyyy-MM') = date_format('2019-03-10','yyyy-MM')
group by mid_id;
3)查询导入结果
hive (gmall)> select * from dws_uv_detail_mn limit 1;
hive (gmall)> select count(*) from dws_uv_detail_mn ;
6.1.4 DWS层加载数据脚本
bin目录下创建脚本
vim dws_uv_log.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dws_uv_detail_day partition(dt='$do_date')
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat
from "$APP".dwd_start_log
where dt='$do_date'
group by mid_id;
insert overwrite table "$APP".dws_uv_detail_wk partition(wk_dt)
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat,
date_add(next_day('$do_date','MO'),-7),
date_add(next_day('$do_date','MO'),-1),
concat(date_add( next_day('$do_date','MO'),-7), '_' , date_add(next_day('$do_date','MO'),-1)
)
from "$APP".dws_uv_detail_day
where dt>=date_add(next_day('$do_date','MO'),-7) and dt<=date_add(next_day('$do_date','MO'),-1)
group by mid_id;
insert overwrite table "$APP".dws_uv_detail_mn partition(mn)
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang))lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat,
date_format('$do_date','yyyy-MM')
from "$APP".dws_uv_detail_day
where date_format(dt,'yyyy-MM') = date_format('$do_date','yyyy-MM')
group by mid_id;
"
$hive -e "$sql"
脚本使用:dws_uv_log.sh 2019-03-11
查询结果:select count() from dws_uv_detail_day where dt=‘2019-03-11’;
select count() from dws_uv_detail_wk;
select count(*) from dws_uv_detail_mn ;
6.2 ADS层
目标:当日、当周、当月活跃设备数
6.2.1 活跃设备数
1)建表语句
hive (gmall)>
drop table if exists ads_uv_count;
create external table ads_uv_count(
`dt` string COMMENT '统计日期',
`day_count` bigint COMMENT '当日用户数量',
`wk_count` bigint COMMENT '当周用户数量',
`mn_count` bigint COMMENT '当月用户数量',
`is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
`is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果'
) COMMENT '活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_uv_count/'
;
2)导入数据
hive (gmall)>
insert into table ads_uv_count
select
'2019-03-10' dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day('2019-03-10','MO'),-1)='2019-03-10','Y','N') ,
if(last_day('2019-03-10')='2019-03-10','Y','N')
from
(
select
'2019-03-10' dt,
count(*) ct
from dws_uv_detail_day
where dt='2019-03-10'
)daycount join
(
select
'2019-03-10' dt,
count (*) ct
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day('2019-03-10','MO'),-7),'_' ,date_add(next_day('2019-03-10','MO'),-1) )
) wkcount on daycount.dt=wkcount.dt
join
(
select
'2019-03-10' dt,
count (*) ct
from dws_uv_detail_mn
where mn=date_format('2019-03-10','yyyy-MM')
)mncount on daycount.dt=mncount.dt
;
3)查询导入结果
hive (gmall)> select * from ads_uv_count ;
6.2.2 ADS层加载数据脚本
bin目录下创建脚本
vim ads_uv_log.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table "$APP".ads_uv_count
select
'$do_date' dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,
if(last_day('$do_date')='$do_date','Y','N')
from
(
select
'$do_date' dt,
count(*) ct
from "$APP".dws_uv_detail_day
where dt='$do_date'
)daycount join
(
select
'$do_date' dt,
count (*) ct
from "$APP".dws_uv_detail_wk
where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_' ,date_add(next_day('$do_date','MO'),-1) )
) wkcount on daycount.dt=wkcount.dt
join
(
select
'$do_date' dt,
count (*) ct
from "$APP".dws_uv_detail_mn
where mn=date_format('$do_date','yyyy-MM')
)mncount on daycount.dt=mncount.dt;
"
$hive -e "$sql"
脚本使用
ads_uv_log.sh 2019-03-11
查询导入结果
select * from ads_uv_count;
首次联网使用应用的用户。如果一个用户首次打开某APP,那这个用户定义为新增用户;卸载再安装的设备,不会被算作一次新增。新增用户包括日新增用户、周新增用户、月新增用户。
7.1 DWS层(每日新增设备明细表)
1)建表语句
hive (gmall)>
drop table if exists dws_new_mid_day;
create external table dws_new_mid_day
(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度',
`create_date` string comment '创建时间'
) COMMENT '每日新增设备信息'
stored as parquet
location '/warehouse/gmall/dws/dws_new_mid_day/';
2)导入数据
用每日活跃用户表Left Join每日新增设备表,关联的条件是mid_id相等。如果是每日新增的设备,则在每日新增设备表中为null。
hive (gmall)>
insert into table dws_new_mid_day
select
ud.mid_id,
ud.user_id ,
ud.version_code ,
ud.version_name ,
ud.lang ,
ud.source,
ud.os,
ud.area,
ud.model,
ud.brand,
ud.sdk_version,
ud.gmail,
ud.height_width,
ud.app_time,
ud.network,
ud.lng,
ud.lat,
'2019-03-10'
from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id
where ud.dt='2019-03-10' and nm.mid_id is null;
3)查询导入数据
hive (gmall)> select count(*) from dws_new_mid_day ;
7.2 ADS层(每日新增设备表)
1)建表语句
hive (gmall)>
drop table if exists ads_new_mid_count;
create external table ads_new_mid_count
(
`create_date` string comment '创建时间' ,
`new_mid_count` BIGINT comment '新增设备数量'
) COMMENT '每日新增设备信息数量'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_new_mid_count/';
2)导入数据
hive (gmall)>
insert into table ads_new_mid_count
select
create_date,
count(*)
from dws_new_mid_day
where create_date='2019-03-10'
group by create_date;
3)查询导入数据
hive (gmall)> select * from ads_new_mid_count;
8.2.1 DWS层(每日留存用户明细表)
1)建表语句
hive (gmall)>
drop table if exists dws_user_retention_day;
create external table dws_user_retention_day
(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度',
`create_date` string comment '设备新增时间',
`retention_day` int comment '截止当前日期留存天数'
) COMMENT '每日用户留存情况'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_retention_day/'
;
2)导入数据(每天计算前1天的新用户访问留存明细)
hive (gmall)>
insert overwrite table dws_user_retention_day
partition(dt="2019-03-11")
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt='2019-03-11' and nm.create_date=date_add('2019-03-11',-1);
3)查询导入数据(每天计算前1天的新用户访问留存明细)
hive (gmall)> select count(*) from dws_user_retention_day;
8.2.2 DWS层(1,2,3,n天留存用户明细表)
1)导入数据(每天计算前1,2,3,n天的新用户访问留存明细)
hive (gmall)>
insert overwrite table dws_user_retention_day
partition(dt="2019-03-11")
select
nm.mid_id,
nm.user_id,
nm.version_code,
nm.version_name,
nm.lang,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt='2019-03-11' and nm.create_date=date_add('2019-03-11',-1)
union all
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
2 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt='2019-03-11' and nm.create_date=date_add('2019-03-11',-2)
union all
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
3 retention_day
from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id
where ud.dt='2019-03-11' and nm.create_date=date_add('2019-03-11',-3);
2)查询导入数据(每天计算前1,2,3天的新用户访问留存明细)
hive (gmall)> select retention_day , count(*) from dws_user_retention_day group by retention_day;
8.3 ADS层
8.3.1 留存用户数
1)建表语句
hive (gmall)>
drop table if exists ads_user_retention_day_count;
create external table ads_user_retention_day_count
(
`create_date` string comment '设备新增日期',
`retention_day` int comment '截止当前日期留存天数',
`retention_count` bigint comment '留存数量'
) COMMENT '每日用户留存情况'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_retention_day_count/';
2)导入数据
hive (gmall)>
insert into table ads_user_retention_day_count
select
create_date,
retention_day,
count(*) retention_count
from dws_user_retention_day
where dt='2019-03-11'
group by create_date,retention_day;
3)查询导入数据
hive (gmall)> select * from ads_user_retention_day_count;
8.3.2 留存用户比率
1)建表语句
hive (gmall)>
drop table if exists ads_user_retention_day_rate;
create external table ads_user_retention_day_rate
(
`stat_date` string comment '统计日期',
`create_date` string comment '设备新增日期',
`retention_day` int comment '截止当前日期留存天数',
`retention_count` bigint comment '留存数量',
`new_mid_count` bigint comment '当日设备新增数量',
`retention_ratio` decimal(10,2) comment '留存率'
) COMMENT '每日用户留存情况'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_retention_day_rate/';
2)导入数据
hive (gmall)>
insert into table ads_user_retention_day_rate
select
'2019-03-11',
ur.create_date,
ur.retention_day,
ur.retention_count,
nc.new_mid_count,
ur.retention_count/nc.new_mid_count*100
from
(
select
create_date,
retention_day,
count(*) retention_count
from dws_user_retention_day
where dt='2019-03-11'
group by create_date,retention_day
) ur join ads_new_mid_count nc on nc.create_date=ur.create_date;
3)查询导入数据
hive (gmall)>select * from ads_user_retention_day_rate;
为了分析沉默用户、本周回流用户数、流失用户、最近连续3周活跃用户、最近七天内连续三天活跃用户数,需要准备2019-03-12、2019-03-20日的数据。
1)2019-03-12数据准备
(1)修改日志时间
dt.sh 2019-03-12
(2)启动集群
cluster.sh start
(3)生成日志数据,
lg.sh
(4)将HDFS数据导入到ODS层
ods_log.sh 2019-03-12
(5)将ODS数据导入到DWD层
dwd_start_log.sh 2019-03-12
dwd_base_log.sh 2019-03-12
dwd_event_log.sh 2019-03-12
(6)将DWD数据导入到DWS层
dws_uv_log.sh 2019-03-12
(7)验证
hive (gmall)> select * from dws_uv_detail_day where dt=‘2019-03-12’ limit 2;
2)2019-03-20数据准备
(1)修改日志时间
dt.sh 2019-03-20
(3)生成日志数据
lg.sh
(4)将HDFS数据导入到ODS层
ods_log.sh 2019-03-20
(5)将ODS数据导入到DWD层
dwd_start_log.sh 2019-03-20
dwd_base_log.sh 2019-03-20
dwd_event_log.sh 2019-03-20
(6)将DWD数据导入到DWS层
dws_uv_log.sh 2019-03-20
(7)验证
hive (gmall)> select * from dws_uv_detail_day where dt=‘2019-03-20’ limit 2;
沉默用户:指的是只在安装当天启动过,且启动时间是在一周前
10.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
1)建表语句
hive (gmall)>
drop table if exists ads_slient_count;
create external table ads_slient_count(
`dt` string COMMENT '统计日期',
`slient_count` bigint COMMENT '沉默设备数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_slient_count';
2)导入2019-02-20数据
hive (gmall)>
insert into table ads_slient_count
select
'2019-03-20' dt,
count(*) slient_count
from
(
select mid_id
from dws_uv_detail_day
where dt<='2019-03-20'
group by mid_id
having count(*)=1 and min(dt)
3)查询导入数据
hive (gmall)> select * from ads_slient_count;
10.3 编写脚本
#!/bin/bash
hive=/opt/module/hive/bin/hive
APP=gmall
if [ -n "$1" ];then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
echo "-----------导入日期$do_date-----------"
sql="
insert into table "$APP".ads_slient_count
select
'$do_date' dt,
count(*) slient_count
from
(
select
mid_id
from "$APP".dws_uv_detail_day
where dt<='$do_date'
group by mid_id
having count(*)=1 and min(dt)<=date_add('$do_date',-7)
)t1;"
$hive -e "$sql"
脚本使用
ads_slient_log.sh 2019-03-20
本周回流=本周活跃-本周新增-上周活跃
11.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
11.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_back_count;
create external table ads_back_count(
`dt` string COMMENT '统计日期',
`wk_dt` string COMMENT '统计日期所在周',
`wastage_count` bigint COMMENT '回流设备数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_back_count';
2)导入数据:
hive (gmall)>
insert into table ads_back_count
select
'2019-03-20' dt,
concat(date_add(next_day('2019-03-20','MO'),-7),'_',date_add(next_day('2019-03-20','MO'),-1)) wk_dt,
count(*)
from
(
select t1.mid_id
from
(
select mid_id
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day('2019-03-20','MO'),-7),'_',date_add(next_day('2019-03-20','MO'),-1))
)t1
left join
(
select mid_id
from dws_new_mid_day
where create_date<=date_add(next_day('2019-03-20','MO'),-1) and create_date>=date_add(next_day('2019-03-20','MO'),-7)
)t2
on t1.mid_id=t2.mid_id
left join
(
select mid_id
from dws_uv_detail_wk
where wk_dt=concat(date_add(next_day('2019-03-20','MO'),-7*2),'_',date_add(next_day('2019-03-20','MO'),-7-1))
)t3
on t1.mid_id=t3.mid_id
where t2.mid_id is null and t3.mid_id is null
)t4;
3)查询结果
hive (gmall)> select * from ads_back_count;
11.3 编写脚本
1)bin目录下创建脚本
ads_back_log.sh
#!/bin/bash
if [ -n "$1" ];then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo "-----------导入日期$do_date-----------"
sql="
insert into table "$APP".ads_back_count
select
'$do_date' dt,
concat(date_add(next_day('$do_date','MO'),-7),'_',date_add(next_day('$do_date','MO'),-1)) wk_dt,
count(*)
from
(
select t1.mid_id
from
(
select mid_id
from "$APP".dws_uv_detail_wk
where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_',date_add(next_day('$do_date','MO'),-1))
)t1
left join
(
select mid_id
from "$APP".dws_new_mid_day
where create_date<=date_add(next_day('$do_date','MO'),-1) and create_date>=date_add(next_day('$do_date','MO'),-7)
)t2
on t1.mid_id=t2.mid_id
left join
(
select mid_id
from "$APP".dws_uv_detail_wk
where wk_dt=concat(date_add(next_day('$do_date','MO'),-7*2),'_',date_add(next_day('$do_date','MO'),-7-1))
)t3
on t1.mid_id=t3.mid_id
where t2.mid_id is null and t3.mid_id is null
)t4;
"
$hive -e "$sql"
脚本使用
ads_back_log.sh 2019-03-20
查询结果
select * from ads_back_count;
流失用户:最近7天未登录我们称之为流失用户
12.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
12.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_wastage_count;
create external table ads_wastage_count(
`dt` string COMMENT '统计日期',
`wastage_count` bigint COMMENT '流失设备数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_wastage_count';
2)导入2019-02-20数据
hive (gmall)>
insert into table ads_wastage_count
select
'2019-03-20',
count(*)
from
(
select mid_id
from dws_uv_detail_day
group by mid_id
having max(dt)<=date_add('2019-03-20',-7)
)t1;
12.3 编写脚本
bin目录下创建脚本
vim ads_wastage_log.sh
#!/bin/bash
if [ -n "$1" ];then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo "-----------导入日期$do_date-----------"
sql="
insert into table "$APP".ads_wastage_count
select
'$do_date',
count(*)
from
(
select mid_id
from "$APP".dws_uv_detail_day
group by mid_id
having max(dt)<=date_add('$do_date',-7)
)t1;
"
$hive -e "$sql"
脚本使用
ads_wastage_log.sh 2019-03-20
查询结果
select * from ads_wastage_count;
最近3周连续活跃的用户:通常是周一对前3周的数据做统计,该数据一周计算一次。
13.1 DWS层
使用周活明细表dws_uv_detail_wk作为DWS层数据
13.2 ADS层
建表语句
编写脚本
drop table if exists ads_continuity_wk_count;
create external table ads_continuity_wk_count(
`dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期',
`wk_dt` string COMMENT '持续时间',
`continuity_count` bigint
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_wk_count';
vim ads_continuity_wk_log.sh
#!/bin/bash
if [ -n "$1" ];then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo "-----------导入日期$do_date-----------"
sql="
insert into table "$APP".ads_continuity_wk_count
select
'$do_date',
concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-1)),
count(*)
from
(
select mid_id
from "$APP".dws_uv_detail_wk
where wk_dt>=concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-7*2-1))
and wk_dt<=concat(date_add(next_day('$do_date','MO'),-7),'_',date_add(next_day('$do_date','MO'),-1))
group by mid_id
having count(*)=3
)t1;"
$hive -e "$sql"
ads_continuity_wk_log.sh 2019-03-20
select * from ads_continuity_wk_count;
14.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据
14.2 ADS层
1)建表语句
hive (gmall)>
drop table if exists ads_continuity_uv_count;
create external table ads_continuity_uv_count(
`dt` string COMMENT '统计日期',
`wk_dt` string COMMENT '最近7天日期',
`continuity_count` bigint
) COMMENT '连续活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_uv_count';
2)写出导入数据的SQL语句
hive (gmall)>
insert into table ads_continuity_uv_count
select
'2019-02-12',
concat(date_add('2019-02-12',-6),'_','2019-02-12'),
count(*)
from
(
select mid_id
from
(
select mid_id
from
(
select
mid_id,
date_sub(dt,rank) date_dif
from
(
select
mid_id,
dt,
rank() over(partition by mid_id order by dt) rank
from dws_uv_detail_day
where dt>=date_add('2019-02-12',-6) and dt<='2019-02-12'
)t1
)t2
group by mid_id,date_dif
having count(*)>=3
)t3
group by mid_id
)t4;
(5)查询
hive (gmall)> select * from ads_continuity_uv_count;
14.3 编写脚本
bin目录下创建脚本
vim ads_continuity_log.sh
#!/bin/bash
if [ -n "$1" ];then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
hive=/opt/module/hive/bin/hive
APP=gmall
echo "-----------导入日期$do_date-----------"
sql="
insert into table "$APP".ads_continuity_uv_count
select
'$do_date',
concat(date_add('$do_date',-6),'_','$do_date') dt,
count(*)
from
(
select mid_id
from
(
select mid_id
from
(
select
mid_id,
date_sub(dt,rank) date_diff
from
(
select
mid_id,
dt,
rank() over(partition by mid_id order by dt) rank
from "$APP".dws_uv_detail_day
where dt>=date_add('$do_date',-6) and dt<='$do_date'
)t1
)t2
group by mid_id,date_diff
having count(*)>=3
)t3
group by mid_id
)t4;
"
$hive -e "$sql"
ads_continuity_log.sh 2019-03-12
select * from ads_continuity_uv_count;
数据同步策略的类型包括:全量表、增量表、新增及变化表、拉链表
全量表:存储完整的数据。
增量表:存储新增加的数据。
新增及变化表:存储新增加的数据和变化的数据。
拉链表:对新增及变化表做定期合并
3.0 配置Hadoop支持Snappy压缩
1)将编译后支持Snappy压缩的Hadoop jar包解压缩,并将lib/native目录中所有文件上传到node1的/opt/module/hadoop-2.7.2/lib/native目录,并分发。
2)重新启动Hadoop。
3)检查支持的压缩方式
hadoop checknative
3.1 业务数据生成(用一系列SQL脚本生成假数据,略)
3.2 业务数据导入数仓
3.2.1 Sqoop安装
1)上传安装包
put E:\工作\电商数仓项目\2.资料\01_jars\08_sqoop\sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
2)解压安装
3)修改名称
mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop
4)重命名配置文件
mv sqoop-env-template.sh sqoop-env.sh
修改配置文件:
export HADOOP_COMMON_HOME=/opt/module/hadoop-2.7.2
export HADOOP_MAPRED_HOME=/opt/module/hadoop-2.7.2
export HIVE_HOME=/opt/module/hive
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.4.10
export ZOOCFGDIR=/opt/module/zookeeper-3.4.10
export HBASE_HOME=/opt/module/hbase
5)拷贝jdbc驱动到sqoop的lib目录下,
cp mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib/
6)验证:
bin/sqoop help
7)测试Sqoop是否能够成功连接数据库
bin/sqoop list-databases --connect jdbc:mysql://node1:3306/ --username root --password 123456
3.2.2 Sqoop导入命令
/opt/module/sqoop/bin/sqoop import
–connect
–username
–password
–target-dir
–delete-target-dir
–num-mappers
–fields-terminated-by
–query “$2” ’ and $CONDITIONS;’
3.2.4 Sqoop定时导入脚本
bin目录下创建脚本sqoop_import.sh
vim sqoop_import.sh
#!/bin/bash
db_date=$2
echo $db_date
db_name=gmall
import_data() {
/opt/module/sqoop/bin/sqoop import \
--connect jdbc:mysql://node1:3306/$db_name \
--username root \
--password 123456\
--target-dir /origin_data/$db_name/db/$1/$db_date \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query "$2"' and $CONDITIONS;'
}
import_sku_info(){
import_data "sku_info" "select
id, spu_id, price, sku_name, sku_desc, weight, tm_id,
category3_id, create_time
from sku_info where 1=1"
}
import_user_info(){
import_data "user_info" "select
id, name, birthday, gender, email, user_level,
create_time
from user_info where 1=1"
}
import_base_category1(){
import_data "base_category1" "select
id, name from base_category1 where 1=1"
}
import_base_category2(){
import_data "base_category2" "select
id, name, category1_id from base_category2 where 1=1"
}
import_base_category3(){
import_data "base_category3" "select id, name, category2_id from base_category3 where 1=1"
}
import_order_detail(){
import_data "order_detail" "select
od.id,
order_id,
user_id,
sku_id,
sku_name,
order_price,
sku_num,
o.create_time
from order_info o, order_detail od
where o.id=od.order_id
and DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date'"
}
import_payment_info(){
import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
alipay_trade_no,
total_amount,
subject,
payment_type,
payment_time
from payment_info
where DATE_FORMAT(payment_time,'%Y-%m-%d')='$db_date'"
}
import_order_info(){
import_data "order_info" "select
id,
total_amount,
order_status,
user_id,
payment_way,
out_trade_no,
create_time,
operate_time
from order_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date' or DATE_FORMAT(operate_time,'%Y-%m-%d')='$db_date')"
}
case $1 in
"base_category1")
import_base_category1
;;
"base_category2")
import_base_category2
;;
"base_category3")
import_base_category3
;;
"order_info")
import_order_info
;;
"order_detail")
import_order_detail
;;
"sku_info")
import_sku_info
;;
"user_info")
import_user_info
;;
"payment_info")
import_payment_info
;;
"all")
import_base_category1
import_base_category2
import_base_category3
import_order_info
import_order_detail
import_sku_info
import_user_info
import_payment_info
;;
esac
增加脚本执行权限
chmod 777 sqoop_import.sh
执行脚本导入数据
sqoop_import.sh all 2019-03-10
4)在SQLyog中生成2019年3月11日数据
CALL init_data(‘2019-03-11’,1000,200,300,TRUE);
5)执行脚本导入数据
sqoop_import.sh all 2019-03-11
3.3 ODS层
完全仿照业务数据库中的表字段,一模一样的创建ODS层对应表。
3.3.1 创建订单表
hive (gmall)>
drop table if exists ods_order_info;
create external table ods_order_info (
`id` string COMMENT '订单编号',
`total_amount` decimal(10,2) COMMENT '订单金额',
`order_status` string COMMENT '订单状态',
`user_id` string COMMENT '用户id',
`payment_way` string COMMENT '支付方式',
`out_trade_no` string COMMENT '支付流水号',
`create_time` string COMMENT '创建时间',
`operate_time` string COMMENT '操作时间'
) COMMENT '订单表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_order_info/'
;
3.3.2 创建订单详情表
hive (gmall)>
drop table if exists ods_order_detail;
create external table ods_order_detail(
`id` string COMMENT '订单编号',
`order_id` string COMMENT '订单号',
`user_id` string COMMENT '用户id',
`sku_id` string COMMENT '商品id',
`sku_name` string COMMENT '商品名称',
`order_price` string COMMENT '商品价格',
`sku_num` string COMMENT '商品数量',
`create_time` string COMMENT '创建时间'
) COMMENT '订单明细表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_order_detail/'
;
3.3.3 创建商品表
hive (gmall)>
drop table if exists ods_sku_info;
create external table ods_sku_info(
`id` string COMMENT 'skuId',
`spu_id` string COMMENT 'spuid',
`price` decimal(10,2) COMMENT '价格',
`sku_name` string COMMENT '商品名称',
`sku_desc` string COMMENT '商品描述',
`weight` string COMMENT '重量',
`tm_id` string COMMENT '品牌id',
`category3_id` string COMMENT '品类id',
`create_time` string COMMENT '创建时间'
) COMMENT '商品表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_sku_info/'
;
3.3.4 创建用户表
hive (gmall)>
drop table if exists ods_user_info;
create external table ods_user_info(
`id` string COMMENT '用户id',
`name` string COMMENT '姓名',
`birthday` string COMMENT '生日',
`gender` string COMMENT '性别',
`email` string COMMENT '邮箱',
`user_level` string COMMENT '用户等级',
`create_time` string COMMENT '创建时间'
) COMMENT '用户信息'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_user_info/'
;
3.3.5 创建商品一级分类表
hive (gmall)>
drop table if exists ods_base_category1;
create external table ods_base_category1(
`id` string COMMENT 'id',
`name` string COMMENT '名称'
) COMMENT '商品一级分类'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_base_category1/'
;
3.3.6 创建商品二级分类表
hive (gmall)>
drop table if exists ods_base_category2;
create external table ods_base_category2(
`id` string COMMENT ' id',
`name` string COMMENT '名称',
category1_id string COMMENT '一级品类id'
) COMMENT '商品二级分类'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_base_category2/'
;
3.3.7 创建商品三级分类表
hive (gmall)>
drop table if exists ods_base_category3;
create external table ods_base_category3(
`id` string COMMENT ' id',
`name` string COMMENT '名称',
category2_id string COMMENT '二级品类id'
) COMMENT '商品三级分类'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_base_category3/'
;
3.3.8 创建支付流水表
hive (gmall)>
drop table if exists ods_payment_info;
create external table ods_payment_info(
`id` bigint COMMENT '编号',
`out_trade_no` string COMMENT '对外业务编号',
`order_id` string COMMENT '订单编号',
`user_id` string COMMENT '用户编号',
`alipay_trade_no` string COMMENT '支付宝交易流水编号',
`total_amount` decimal(16,2) COMMENT '支付金额',
`subject` string COMMENT '交易内容',
`payment_type` string COMMENT '支付类型',
`payment_time` string COMMENT '支付时间'
) COMMENT '支付流水表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ods/ods_payment_info/'
;
3.3.9 ODS层数据导入脚本
1)bin目录下创建脚本ods_db.sh
ods_db.sh
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
load data inpath '/origin_data/$APP/db/order_info/$do_date' OVERWRITE into table "$APP".ods_order_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/order_detail/$do_date' OVERWRITE into table "$APP".ods_order_detail partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/sku_info/$do_date' OVERWRITE into table "$APP".ods_sku_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table "$APP".ods_user_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table "$APP".ods_payment_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table "$APP".ods_base_category1 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table "$APP".ods_base_category2 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table "$APP".ods_base_category3 partition(dt='$do_date');
"
$hive -e "$sql"
执行脚本导入数据
ods_db.sh 2019-03-10
ods_db.sh 2019-03-11
查询导入数据
select * from ods_order_info where dt='2019-03-10' limit 1;
select * from ods_order_info where dt='2019-03-11' limit 1;
3.4 DWD层
对ODS层数据进行判空过滤。对商品分类表进行维度退化(降维)。
3.4.1 创建订单表
hive (gmall)>
drop table if exists dwd_order_info;
create external table dwd_order_info (
`id` string COMMENT '',
`total_amount` decimal(10,2) COMMENT '',
`order_status` string COMMENT ' 1 2 3 4 5',
`user_id` string COMMENT 'id',
`payment_way` string COMMENT '',
`out_trade_no` string COMMENT '',
`create_time` string COMMENT '',
`operate_time` string COMMENT ''
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_order_info/'
tblproperties ("parquet.compression"="snappy")
;
3.4.2 创建订单详情表
hive (gmall)>
drop table if exists dwd_order_detail;
create external table dwd_order_detail(
`id` string COMMENT '',
`order_id` decimal(10,2) COMMENT '',
`user_id` string COMMENT 'id',
`sku_id` string COMMENT 'id',
`sku_name` string COMMENT '',
`order_price` string COMMENT '',
`sku_num` string COMMENT '',
`create_time` string COMMENT ''
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_order_detail/'
tblproperties ("parquet.compression"="snappy")
;
3.4.3 创建用户表
hive (gmall)>
drop table if exists dwd_user_info;
create external table dwd_user_info(
`id` string COMMENT 'id',
`name` string COMMENT '',
`birthday` string COMMENT '',
`gender` string COMMENT '',
`email` string COMMENT '',
`user_level` string COMMENT '',
`create_time` string COMMENT ''
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_user_info/'
tblproperties ("parquet.compression"="snappy")
;
3.4.4 创建支付流水表
hive (gmall)>
drop table if exists dwd_payment_info;
create external table dwd_payment_info(
`id` bigint COMMENT '',
`out_trade_no` string COMMENT '',
`order_id` string COMMENT '',
`user_id` string COMMENT '',
`alipay_trade_no` string COMMENT '',
`total_amount` decimal(16,2) COMMENT '',
`subject` string COMMENT '',
`payment_type` string COMMENT '',
`payment_time` string COMMENT ''
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_payment_info/'
tblproperties ("parquet.compression"="snappy")
;
3.4.5 创建商品表(增加分类)
hive (gmall)>
drop table if exists dwd_sku_info;
create external table dwd_sku_info(
`id` string COMMENT 'skuId',
`spu_id` string COMMENT 'spuid',
`price` decimal(10,2) COMMENT '',
`sku_name` string COMMENT '',
`sku_desc` string COMMENT '',
`weight` string COMMENT '',
`tm_id` string COMMENT 'id',
`category3_id` string COMMENT '1id',
`category2_id` string COMMENT '2id',
`category1_id` string COMMENT '3id',
`category3_name` string COMMENT '3',
`category2_name` string COMMENT '2',
`category1_name` string COMMENT '1',
`create_time` string COMMENT ''
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_sku_info/'
tblproperties ("parquet.compression"="snappy")
;
3.4.6 DWD层数据导入脚本
bin目录下创建脚本dwd_db.sh
vim dwd_db.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_order_info partition(dt)
select * from "$APP".ods_order_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_order_detail partition(dt)
select * from "$APP".ods_order_detail
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_user_info partition(dt)
select * from "$APP".ods_user_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_payment_info partition(dt)
select * from "$APP".ods_payment_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_sku_info partition(dt)
select
sku.id,
sku.spu_id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.tm_id,
sku.category3_id,
c2.id category2_id,
c1.id category1_id,
c3.name category3_name,
c2.name category2_name,
c1.name category1_name,
sku.create_time,
sku.dt
from
"$APP".ods_sku_info sku
join "$APP".ods_base_category3 c3 on sku.category3_id=c3.id
join "$APP".ods_base_category2 c2 on c3.category2_id=c2.id
join "$APP".ods_base_category1 c1 on c2.category1_id=c1.id
where sku.dt='$do_date' and c2.dt='$do_date'
and c3.dt='$do_date' and c1.dt='$do_date'
and sku.id is not null;
"
$hive -e "$sql"
chmod 777 dwd_db.sh
dwd_db.sh 2019-03-10
dwd_db.sh 2019-03-11
查看导入数据
select * from dwd_sku_info where dt='2019-03-10' limit 2;
select * from dwd_sku_info where dt='2019-03-11' limit 2;
3.5 DWS层之用户行为宽表
1)为什么要建宽表
需求目标,把每个用户单日的行为聚合起来组成一张多列宽表,以便之后关联用户维度信息后进行,不同角度的统计分析。
3.5.1 创建用户行为宽表
hive (gmall)>
drop table if exists dws_user_action;
create external table dws_user_action
(
user_id string comment '用户 id',
order_count bigint comment '下单次数 ',
order_amount decimal(16,2) comment '下单金额 ',
payment_count bigint comment '支付次数',
payment_amount decimal(16,2) comment '支付金额 ',
comment_count bigint comment '评论次数'
) COMMENT '每日用户行为宽表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_action/'
tblproperties ("parquet.compression"="snappy");
3.5.2 向用户行为宽表导入数据
1)导入数据
hive (gmall)>
with
tmp_order as
(
select
user_id,
count(*) order_count,
sum(oi.total_amount) order_amount
from dwd_order_info oi
where date_format(oi.create_time,'yyyy-MM-dd')='2019-03-10'
group by user_id
) ,
tmp_payment as
(
select
user_id,
sum(pi.total_amount) payment_amount,
count(*) payment_count
from dwd_payment_info pi
where date_format(pi.payment_time,'yyyy-MM-dd')='2019-03-10'
group by user_id
),
tmp_comment as
(
select
user_id,
count(*) comment_count
from dwd_comment_log c
where date_format(c.dt,'yyyy-MM-dd')='2019-03-10'
group by user_id
)
insert overwrite table dws_user_action partition(dt='2019-03-10')
select
user_actions.user_id,
sum(user_actions.order_count),
sum(user_actions.order_amount),
sum(user_actions.payment_count),
sum(user_actions.payment_amount),
sum(user_actions.comment_count)
from
(
select
user_id,
order_count,
order_amount,
0 payment_count,
0 payment_amount,
0 comment_count
from tmp_order
union all
select
user_id,
0,
0,
payment_count,
payment_amount,
0
from tmp_payment
union all
select
user_id,
0,
0,
0,
0,
comment_count
from tmp_comment
) user_actions
group by user_id;
2)查询导入结果
hive (gmall)> select * from dws_user_action;
3.5.3 用户行为数据宽表导入脚本
vim dws_db_wide.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
with
tmp_order as
(
select
user_id,
sum(oi.total_amount) order_amount,
count(*) order_count
from "$APP".dwd_order_info oi
where date_format(oi.create_time,'yyyy-MM-dd')='$do_date'
group by user_id
) ,
tmp_payment as
(
select
user_id,
sum(pi.total_amount) payment_amount,
count(*) payment_count
from "$APP".dwd_payment_info pi
where date_format(pi.payment_time,'yyyy-MM-dd')='$do_date'
group by user_id
),
tmp_comment as
(
select
user_id,
count(*) comment_count
from "$APP".dwd_comment_log c
where date_format(c.dt,'yyyy-MM-dd')='$do_date'
group by user_id
)
Insert overwrite table "$APP".dws_user_action partition(dt='$do_date')
select
user_actions.user_id,
sum(user_actions.order_count),
sum(user_actions.order_amount),
sum(user_actions.payment_count),
sum(user_actions.payment_amount),
sum(user_actions.comment_count)
from
(
select
user_id,
order_count,
order_amount,
0 payment_count,
0 payment_amount,
0 comment_count
from tmp_order
union all
select
user_id,
0,
0,
payment_count,
payment_amount,
0
from tmp_payment
union all
select
user_id,
0,
0,
0,
0,
comment_count
from tmp_comment
) user_actions
group by user_id;
"
$hive -e "$sql"
chmod 777 dws_db_wide.sh
dws_db_wide.sh 2019-03-11
4)查看导入数据
hive (gmall)>
select * from dws_user_action where dt=‘2019-03-11’ limit 2;
4.1 ADS层
4.1.2 GMV建表语句
hive (gmall)>
drop table if exists ads_gmv_sum_day;
create external table ads_gmv_sum_day(
`dt` string COMMENT '统计日期',
`gmv_count` bigint COMMENT '当日gmv订单个数',
`gmv_amount` decimal(16,2) COMMENT '当日gmv订单总金额',
`gmv_payment` decimal(16,2) COMMENT '当日支付金额'
) COMMENT 'GMV'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_gmv_sum_day/'
;
4.1.3 数据导入
1)数据导入
hive (gmall)>
insert into table ads_gmv_sum_day
select
'2019-03-10' dt,
sum(order_count) gmv_count,
sum(order_amount) gmv_amount,
sum(payment_amount) payment_amount
from dws_user_action
where dt ='2019-03-10'
group by dt
;
2)查询导入数据
hive (gmall)> select * from ads_gmv_sum_day;
4.1.4 数据导入脚本
bin目录下创建脚本ads_db_gmv.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert into table "$APP".ads_gmv_sum_day
select
'$do_date' dt,
sum(order_count) gmv_count,
sum(order_amount) gmv_amount,
sum(payment_amount) payment_amount
from "$APP".dws_user_action
where dt ='$do_date'
group by dt;
"
$hive -e "$sql"
chmod 777 ads_db_gmv.sh
ads_db_gmv.sh 2019-03-11
4)查看导入数据
hive (gmall)>
select * from ads_gmv_sum_day where dt=‘2019-03-11’ limit 2;
5.1 什么是转化率
5.2 ADS层之新增用户占日活跃用户比率
5.2.1 建表语句
hive (gmall)>
drop table if exists ads_user_convert_day;
create external table ads_user_convert_day(
`dt` string COMMENT '统计日期',
`uv_m_count` bigint COMMENT '当日活跃设备',
`new_m_count` bigint COMMENT '当日新增设备',
`new_m_ratio` decimal(10,2) COMMENT '当日新增占日活的比率'
) COMMENT '转化率'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_convert_day/'
;
5.2.2 数据导入
1)数据导入
hive (gmall)>
insert into table ads_user_convert_day
select
'2019-03-10',
sum(uc.dc) sum_dc,
sum(uc.nmc) sum_nmc,
cast(sum( uc.nmc)/sum( uc.dc)*100 as decimal(10,2)) new_m_ratio
from
(
select
day_count dc,
0 nmc
from ads_uv_count
where dt='2019-03-10'
union all
select
0 dc,
new_mid_count nmc
from ads_new_mid_count
where create_date='2019-03-10'
)uc;
2)查看导入数据
hive (gmall)>
select * from ads_user_convert_day;
5.3 ADS层之用户行为漏斗分析
5.3.1 建表语句
hive (gmall)>
drop table if exists ads_user_action_convert_day;
create external table ads_user_action_convert_day(
`dt` string COMMENT '统计日期',
`total_visitor_m_count` bigint COMMENT '总访问人数',
`order_u_count` bigint COMMENT '下单人数',
`visitor2order_convert_ratio` decimal(10,2) COMMENT '访问到下单转化率',
`payment_u_count` bigint COMMENT '支付人数',
`order2payment_convert_ratio` decimal(10,2) COMMENT '下单到支付的转化率'
) COMMENT '用户行为漏斗分析'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_action_convert_day/'
;
5.3.2 数据导入
1)数据导入
hive (gmall)>
insert into table ads_user_action_convert_day
select
'2019-03-10',
uv.day_count,
ua.order_count,
cast(ua.order_count/uv.day_count as decimal(10,2)) visitor2order_convert_ratio,
ua.payment_count,
cast(ua.payment_count/ua.order_count as decimal(10,2)) order2payment_convert_ratio
from
(
select
dt,
sum(if(order_count>0,1,0)) order_count,
sum(if(payment_count>0,1,0)) payment_count
from dws_user_action
where dt='2019-03-10'
group by dt
)ua join ads_uv_count uv on uv.dt=ua.dt
;
2)查询导入数据
hive (gmall)> select * from ads_user_action_convert_day;
需求:以月为单位统计,购买2次以上商品的用户
6.2 DWS层
6.2.1 用户购买商品明细表(宽表)
hive (gmall)>
drop table if exists dws_sale_detail_daycount;
create external table dws_sale_detail_daycount
(
user_id string comment '用户 id',
sku_id string comment '商品 Id',
user_gender string comment '用户性别',
user_age string comment '用户年龄',
user_level string comment '用户等级',
order_price decimal(10,2) comment '商品价格',
sku_name string comment '商品名称',
sku_tm_id string comment '品牌id',
sku_category3_id string comment '商品三级品类id',
sku_category2_id string comment '商品二级品类id',
sku_category1_id string comment '商品一级品类id',
sku_category3_name string comment '商品三级品类名称',
sku_category2_name string comment '商品二级品类名称',
sku_category1_name string comment '商品一级品类名称',
spu_id string comment '商品 spu',
sku_num int comment '购买个数',
order_count string comment '当日下单单数',
order_amount string comment '当日下单金额'
) COMMENT '用户购买商品明细表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_sale_detail_daycount/'
tblproperties ("parquet.compression"="snappy");
6.2.2 数据导入
hive (gmall)>
with
tmp_detail as
(
select
user_id,
sku_id,
sum(sku_num) sku_num,
count(*) order_count,
sum(od.order_price*sku_num) order_amount
from dwd_order_detail od
where od.dt='2019-03-10'
group by user_id, sku_id
)
insert overwrite table dws_sale_detail_daycount partition(dt='2019-03-10')
select
tmp_detail.user_id,
tmp_detail.sku_id,
u.gender,
months_between('2019-03-10', u.birthday)/12 age,
u.user_level,
price,
sku_name,
tm_id,
category3_id,
category2_id,
category1_id,
category3_name,
category2_name,
category1_name,
spu_id,
tmp_detail.sku_num,
tmp_detail.order_count,
tmp_detail.order_amount
from tmp_detail
left join dwd_user_info u on tmp_detail.user_id =u.id and u.dt='2019-03-10'
left join dwd_sku_info s on tmp_detail.sku_id =s.id and s.dt='2019-03-10'
;
6.2.3 数据导入脚本
bin目录下创建脚本dws_sale.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
with
tmp_detail as
(
select
user_id,
sku_id,
sum(sku_num) sku_num,
count(*) order_count,
sum(od.order_price*sku_num) order_amount
from "$APP".dwd_order_detail od
where od.dt='$do_date'
group by user_id, sku_id
)
insert overwrite table "$APP".dws_sale_detail_daycount partition(dt='$do_date')
select
tmp_detail.user_id,
tmp_detail.sku_id,
u.gender,
months_between('$do_date', u.birthday)/12 age,
u.user_level,
price,
sku_name,
tm_id,
category3_id,
category2_id,
category1_id,
category3_name,
category2_name,
category1_name,
spu_id,
tmp_detail.sku_num,
tmp_detail.order_count,
tmp_detail.order_amount
from tmp_detail
left join "$APP".dwd_user_info u
on tmp_detail.user_id=u.id and u.dt='$do_date'
left join "$APP".dwd_sku_info s on tmp_detail.sku_id =s.id and s.dt='$do_date';
"
$hive -e "$sql"
chmod 777 dws_sale.sh
dws_sale.sh 2019-03-11
查看导入数据
hive (gmall)>
select * from dws_sale_detail_daycount where dt='2019-03-11' limit 2;
6.3 ADS层品牌复购率
6.3.1 建表语句
hive (gmall)>
drop table ads_sale_tm_category1_stat_mn;
create external table ads_sale_tm_category1_stat_mn
(
tm_id string comment '品牌id',
category1_id string comment '1级品类id ',
category1_name string comment '1级品类名称 ',
buycount bigint comment '购买人数',
buy_twice_last bigint comment '两次以上购买人数',
buy_twice_last_ratio decimal(10,2) comment '单次复购率',
buy_3times_last bigint comment '三次以上购买人数',
buy_3times_last_ratio decimal(10,2) comment '多次复购率',
stat_mn string comment '统计月份',
stat_date string comment '统计日期'
) COMMENT '复购率统计'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_sale_tm_category1_stat_mn/'
;
6.3.2 数据导入
1)数据导入
hive (gmall)>
insert into table ads_sale_tm_category1_stat_mn
select
mn.sku_tm_id,
mn.sku_category1_id,
mn.sku_category1_name,
sum(if(mn.order_count>=1,1,0)) buycount,
sum(if(mn.order_count>=2,1,0)) buyTwiceLast,
sum(if(mn.order_count>=2,1,0))/sum( if(mn.order_count>=1,1,0)) buyTwiceLastRatio,
sum(if(mn.order_count>=3,1,0)) buy3timeLast ,
sum(if(mn.order_count>=3,1,0))/sum( if(mn.order_count>=1,1,0)) buy3timeLastRatio ,
date_format('2019-03-10' ,'yyyy-MM') stat_mn,
'2019-03-10' stat_date
from
(
select
user_id,
sd.sku_tm_id,
sd.sku_category1_id,
sd.sku_category1_name,
sum(order_count) order_count
from dws_sale_detail_daycount sd
where date_format(dt,'yyyy-MM')=date_format('2019-03-10' ,'yyyy-MM')
group by user_id, sd.sku_tm_id, sd.sku_category1_id, sd.sku_category1_name
) mn
group by mn.sku_tm_id, mn.sku_category1_id, mn.sku_category1_name
;
2)查询导入数据
hive (gmall)> select * from ads_sale_tm_category1_stat_mn;
6.3.3 数据导入脚本
bin目录下创建脚本ads_sale.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table "$APP".ads_sale_tm_category1_stat_mn
select
mn.sku_tm_id,
mn.sku_category1_id,
mn.sku_category1_name,
sum(if(mn.order_count>=1,1,0)) buycount,
sum(if(mn.order_count>=2,1,0)) buyTwiceLast,
sum(if(mn.order_count>=2,1,0))/sum( if(mn.order_count>=1,1,0)) buyTwiceLastRatio,
sum(if(mn.order_count>=3,1,0)) buy3timeLast,
sum(if(mn.order_count>=3,1,0))/sum( if(mn.order_count>=1,1,0)) buy3timeLastRatio ,
date_format('$do_date' ,'yyyy-MM') stat_mn,
'$do_date' stat_date
from
(
select
user_id,
od.sku_tm_id,
od.sku_category1_id,
od.sku_category1_name,
sum(order_count) order_count
from "$APP".dws_sale_detail_daycount od
where date_format(dt,'yyyy-MM')=date_format('$do_date' ,'yyyy-MM')
group by user_id, od.sku_tm_id, od.sku_category1_id, od.sku_category1_name
) mn
group by mn.sku_tm_id, mn.sku_category1_id, mn.sku_category1_name;
"
$hive -e "$sql"
chmod 777 ads_sale.sh
ads_sale.sh 2019-03-11
查看导入数据
hive (gmall)>
select * from ads_sale_tm_category1_stat_mn limit 2;
7.1 在MySQL中创建表
以下见表都在sqlyog完成
7.1.1 每日活跃统计
1)在MySQL中创建ads_uv_count表
DROP TABLE IF EXISTS `ads_uv_count`;
CREATE TABLE `ads_uv_count` (
`dt` varchar(255) DEFAULT NULL COMMENT '统计日期',
`day_count` bigint(200) DEFAULT NULL COMMENT '当日用户数量',
`wk_count` bigint(200) DEFAULT NULL COMMENT '当周用户数量',
`mn_count` bigint(200) DEFAULT NULL COMMENT '当月用户数量',
`is_weekend` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL COMMENT 'Y,N是否是周末,用于得到本周最终结果',
`is_monthend` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL COMMENT 'Y,N是否是月末,用于得到本月最终结果'
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '每日活跃用户数量' ROW_FORMAT = Dynamic;
7.1.2 留存率统计
1)在MySQL中创建ads_user_retention_day_rate表
DROP TABLE IF EXISTS `ads_user_retention_day_rate`;
CREATE TABLE `ads_user_retention_day_rate` (
`stat_date` varchar(255) DEFAULT NULL COMMENT '统计日期',
`create_date` varchar(255) DEFAULT NULL COMMENT '设备新增日期',
`retention_day` bigint(200) DEFAULT NULL COMMENT '截止当前日期留存天数',
`retention_count` bigint(200) DEFAULT NULL COMMENT '留存数量',
`new_mid_count` bigint(200) DEFAULT NULL COMMENT '当日设备新增数量',
`retention_ratio` decimal(10, 2) DEFAULT NULL COMMENT '留存率'
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '每日用户留存情况' ROW_FORMAT = Dynamic;
7.1.3 漏斗分析
1)在MySQL中创建ads_user_action_convert_day表
DROP TABLE IF EXISTS `ads_user_action_convert_day`;
CREATE TABLE `ads_user_action_convert_day` (
`dt` varchar(200) DEFAULT NULL COMMENT '统计日期',
`total_visitor_m_count` bigint(20) DEFAULT NULL COMMENT '总访问人数',
`order_u_count` bigint(20) DEFAULT NULL COMMENT '下单人数',
`visitor2order_convert_ratio` decimal(10, 2) DEFAULT NULL COMMENT '购物车到下单转化率',
`payment_u_count` bigint(20) DEFAULT NULL COMMENT '支付人数',
`order2payment_convert_ratio` decimal(10, 2) DEFAULT NULL COMMENT '下单到支付的转化率'
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '每日用户行为转化率统计' ROW_FORMAT = Dynamic;
7.1.4 GMV统计
1)在MySQL中创建ads_gmv_sum_day表
DROP TABLE IF EXISTS ads_gmv_sum_day;
CREATE TABLE ads_gmv_sum_day(
`dt` varchar(200) DEFAULT NULL COMMENT '统计日期',
`gmv_count` bigint(20) DEFAULT NULL COMMENT '当日gmv订单个数',
`gmv_amount` decimal(16, 2) DEFAULT NULL COMMENT '当日gmv订单总金额',
`gmv_payment` decimal(16, 2) DEFAULT NULL COMMENT '当日支付金额'
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '每日活跃用户数量' ROW_FORMAT = Dynamic;
7.1.5 全国商品销售
1)在MySQL中创建ads_gmv_sum_province表
DROP TABLE IF EXISTS `ads_gmv_sum_province`;
CREATE TABLE `ads_gmv_sum_province` (
`province` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`gmv` bigint(255) DEFAULT NULL,
`remark` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
7.2 Sqoop导出脚本
1)编写Sqoop导出脚本
在bin目录下创建脚本sqoop_export.sh
vim sqoop_export.sh
在脚本中填写如下内容
#!/bin/bash
db_name=gmall
export_data() {
/opt/module/sqoop/bin/sqoop export \
--connect "jdbc:mysql://hadoop102:3306/${db_name}?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 000000 \
--table $1 \
--num-mappers 1 \
--export-dir /warehouse/$db_name/ads/$1 \
--input-fields-terminated-by "\t" \
--update-mode allowinsert \
--update-key "tm_id,category1_id,stat_mn,stat_date" \
--input-null-string '\\N' \
--input-null-non-string '\\N'
}
case $1 in
"ads_uv_count")
export_data "ads_uv_count"
;;
"ads_user_action_convert_day")
export_data "ads_user_action_convert_day"
;;
"ads_gmv_sum_day")
export_data "ads_gmv_sum_day"
;;
"all")
export_data "ads_uv_count"
export_data "ads_user_action_convert_day"
export_data "ads_gmv_sum_day"
;;
esac
关于导出update还是insert的问题
–update-mode:
updateonly 只更新,无法插入新数据
allowinsert 允许新增
–update-key:允许更新的情况下,指定哪些字段匹配视为同一条数据,进行更新而不增加。多个字段用逗号分隔。
–input-null-string和–input-null-non-string:
分别表示,将字符串列和非字符串列的空串和“null”转换成’\N’。
官网地址:http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL(like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. Because sqoop is using those parameters in generated code, you need to properly escape value \N to \N:
Hive中的Null在底层是以“\N”来存储,而MySQL中的Null在底层就是Null,为了保证数据两端的一致性。在导出数据时采用–input-null-string和–input-null-non-string两个参数。导入数据时采用–null-string和–null-non-string。
3)执行Sqoop导出脚本
chmod 777 sqoop_export.sh
sqoop_export.sh all
4)在MySQL中查看结果
SELECT * FROM ads_uv_count;
SELECT * FROM ads_user_retention_day_rate;
SELECT * FROM ads_user_action_convert_day;
SELECT * FROM ads_gmv_sum_day;
SELECT * FROM ads_gmv_sum_province;
7.3 WEB页面查看
1)运行spring-boot-echarts-master程序
2)在web页面上查看显示结果
http://localhost:8080/active
8.1 Azkaban安装
1 安装前准备
1)将Azkaban Web服务器、Azkaban执行服务器、Azkaban的sql执行脚本及MySQL安装包拷贝到node1虚拟机/opt/software目录下
a)azkaban-web-server-2.5.0.tar.gz
b)azkaban-executor-server-2.5.0.tar.gz
c)azkaban-sql-script-2.5.0.tar.gz
d)mysql-libs.zip
2.2 安装Azkaban
1)在/opt/module/目录下创建azkaban目录
mkdir azkaban
2)解压azkaban-web-server-2.5.0.tar.gz、azkaban-executor-server-2.5.0.tar.gz、azkaban-sql-script-2.5.0.tar.gz到/opt/module/azkaban目录下
tar -zxvf azkaban-web-server-2.5.0.tar.gz -C /opt/module/azkaban/
tar -zxvf azkaban-executor-server-2.5.0.tar.gz -C /opt/module/azkaban/
tar -zxvf azkaban-sql-script-2.5.0.tar.gz -C /opt/module/azkaban/
3)对解压后的文件重新命名
mv azkaban-web-2.5.0/ server
mv azkaban-executor-2.5.0/ executor
4)azkaban脚本导入
进入mysql,创建azkaban数据库,并将解压的脚本导入到azkaban数据库。
mysql -uroot -p123456
mysql> create database azkaban;
mysql> use azkaban;
mysql> source /opt/module/azkaban/azkaban-2.5.0/create-all-sql-2.5.0.sql
注:source后跟.sql文件,用于批量处理.sql文件中的sql语句。
2.3 生成密钥库
Keytool:是java数据证书的管理工具,使用户能够管理自己的公/私钥对及相关证书。
-keystore:指定密钥库的名称及位置(产生的各类信息将不在.keystore文件中)
-genkey:在用户主目录中创建一个默认文件".keystore"
-alias:对我们生成的.keystore进行指认别名;如果没有默认是mykey
-keyalg:指定密钥的算法 RSA/DSA 默认是DSA
1)生成 keystore的密码及相应信息的密钥库
keytool -keystore keystore -alias jetty -genkey -keyalg RSA
2)将keystore 拷贝到 azkaban web服务器根目录中
mv keystore /opt/module/azkaban/server/
2.5 配置文件
2.5.1 Web服务器配置
1)进入azkaban web服务器安装目录 conf目录,打开azkaban.properties文件
vim azkaban.properties
2)按照如下配置修改azkaban.properties文件。
#Azkaban Personalization Settings
#服务器UI名称,用于服务器上方显示的名字
azkaban.name=Test
#描述
azkaban.label=My Local Azkaban
#UI颜色
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
#默认web server存放web文件的目录
web.resource.dir=/opt/module/azkaban/server/web/
#默认时区,已改为亚洲/上海 默认为美国
default.timezone.id=Asia/Shanghai
#Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
#用户权限管理默认类(绝对路径)
user.manager.xml.file=/opt/module/azkaban/server/conf/azkaban-users.xml
#Loader for projects
#global配置文件所在位置(绝对路径)
executor.global.properties=/opt/module/azkaban/executor/conf/global.properties
azkaban.project.dir=projects
#数据库类型
database.type=mysql
#端口号
mysql.port=3306
#数据库连接IP
mysql.host=node1
#数据库实例名
mysql.database=azkaban
#数据库用户名
mysql.user=root
#数据库密码
mysql.password=123456
#最大连接数
mysql.numconnections=100
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
# Jetty服务器属性.
#最大线程数
jetty.maxThreads=25
#Jetty SSL端口
jetty.ssl.port=8443
#Jetty端口
jetty.port=8081
#SSL文件名(绝对路径)
jetty.keystore=/opt/module/azkaban/server/keystore
#SSL文件密码
jetty.password=123456
#Jetty主密码与keystore文件相同
jetty.keypassword=123456
#SSL文件名(绝对路径)
jetty.truststore=/opt/module/azkaban/server/keystore
#SSL文件密码
jetty.trustpassword=123456
# Azkaban Executor settings
executor.port=12321
# mail settings
mail.sender=
mail.host=
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
3)web服务器用户配置
在azkaban web服务器安装目录 conf目录,按照如下配置修改azkaban-users.xml 文件,增加管理员用户。
[vim azkaban-users.xml
2.5.2 执行服务器配置
1)进入执行服务器安装目录conf,打开azkaban.properties
[ vim azkaban.properties
2)按照如下配置修改azkaban.properties文件。
#Azkaban
#时区
default.timezone.id=Asia/Shanghai
# Azkaban JobTypes Plugins
#jobtype 插件所在位置
azkaban.jobtype.plugin.dir=plugins/jobtypes
#Loader for projects
executor.global.properties=/opt/module/azkaban/executor/conf/global.properties
azkaban.project.dir=projects
database.type=mysql
mysql.port=3306
mysql.host=node1
mysql.database=azkaban
mysql.user=root
mysql.password=123456
mysql.numconnections=100
# Azkaban Executor settings
#最大线程数
executor.maxThreads=50
#端口号(如修改,请与web服务中一致)
executor.port=12321
#线程数
executor.flow.threads=30
2.6 启动executor服务器
在executor服务器目录下执行启动命令
bin/azkaban-executor-start.sh
2.7 启动web服务器
在azkaban web服务器目录下执行启动命令
bin/azkaban-web-start.sh
注意:
先执行executor,再执行web,避免Web Server会因为找不到执行器启动失败。
jps查看进程
jps
3601 AzkabanExecutorServer
5880 Jps
3661 AzkabanWebServer
启动完成后,在浏览器(建议使用谷歌浏览器)中输入https://服务器IP地址:8443,即可访问azkaban服务了
8.2 GMV指标获取的全调度流程
1)生成数据
CALL init_data(‘2019-03-12’,300,200,300,FALSE);
2)编写Azkaban程序运行job
(1)import.job文件
type=command
do_date=${dt}
command=/home/hadoop/bin/sqoop_import.sh all d o d a t e ( 2 ) o d s . j o b 文 件 t y p e = c o m m a n d d o d a t e = {do_date} (2)ods.job文件 type=command do_date= dodate(2)ods.job文件type=commanddodate={dt}
dependencies=import
command=/home/hadoop/bin/ods_db.sh d o d a t e ( 3 ) d w d . j o b 文 件 t y p e = c o m m a n d d o d a t e = {do_date} (3)dwd.job文件 type=command do_date= dodate(3)dwd.job文件type=commanddodate={dt}
dependencies=ods
command=/home/hadoop/bin/dwd_db.sh d o d a t e ( 4 ) d w s . j o b 文 件 t y p e = c o m m a n d d o d a t e = {do_date} (4)dws.job文件 type=command do_date= dodate(4)dws.job文件type=commanddodate={dt}
dependencies=dwd
command=/home/hadoop/bin/dws_db_wide.sh d o d a t e ( 5 ) a d s . j o b 文 件 t y p e = c o m m a n d d o d a t e = {do_date} (5)ads.job文件 type=command do_date= dodate(5)ads.job文件type=commanddodate={dt}
dependencies=dws
command=/home/hadoop/bin/ads_db_gmv.sh ${do_date}
(6)export.job文件
type=command
dependencies=ads
command=/home/hadoop/bin/sqoop_export.sh ads_gmv_sum_day
(7)将以上6个文件压缩成gmv-job.zip文件
3)创建Azkaban工程,并上传gmv-job.zip文件。
4)在浏览器中输入https://node1:8443,并在页面上创建工程执行gmv-job.zip任务。(集群必须先运行))
5)等待大约20分钟,在MySQL中查看结果
select * from ads_gmv_sum_day;
9.5 拉链表制作过程
1)生成10条原始订单数据
SQLyog:CALL init_data(‘2019-03-13’,10,5,10,TRUE);
sqoop_import.sh all 2019-03-13
ods_db.sh 2019-03-13
dwd_db.sh 2019-03-13
2)建立拉链表
hive (gmall)>
drop table if exists dwd_order_info_his;
create external table dwd_order_info_his(
`id` string COMMENT '订单编号',
`total_amount` decimal(10,2) COMMENT '订单金额',
`order_status` string COMMENT '订单状态',
`user_id` string COMMENT '用户id' ,
`payment_way` string COMMENT '支付方式',
`out_trade_no` string COMMENT '支付流水号',
`create_time` string COMMENT '创建时间',
`operate_time` string COMMENT '操作时间',
`start_date` string COMMENT '有效开始日期',
`end_date` string COMMENT '有效结束日期'
) COMMENT '订单拉链表'
stored as parquet
location '/warehouse/gmall/dwd/dwd_order_info_his/'
tblproperties ("parquet.compression"="snappy");
3)初始化拉链表
hive (gmall)>
insert overwrite table dwd_order_info_his
select
id,
total_amount,
order_status,
user_id,
payment_way,
out_trade_no,
create_time,
operate_time,
'2019-03-13',
'9999-99-99'
from ods_order_info oi
where oi.dt='2019-03-13';
4)查询拉链表中数据
hive (gmall)> select * from dwd_order_info_his limit 2;
9.5.2 步骤1:制作当日变动数据(包括新增,修改)每日执行
(1)2019-03-14日新增2条订单数据
CALL init_data(‘2019-03-14’,2,5,10,TRUE);
(2)通过Sqoop把2019-02-14日所有数据导入
sqoop_import.sh all 2019-03-14
(3)ODS层数据导入
ods_db.sh 2019-03-14
(4)DWD层数据导入
dwd_db.sh 2019-03-14
9.5.3 步骤2:先合并变动信息,再追加新增信息,插入到临时表中
1)建立临时表
hive (gmall)>
drop table if exists dwd_order_info_his_tmp;
create table dwd_order_info_his_tmp(
`id` string COMMENT '订单编号',
`total_amount` decimal(10,2) COMMENT '订单金额',
`order_status` string COMMENT '订单状态',
`user_id` string COMMENT '用户id' ,
`payment_way` string COMMENT '支付方式',
`out_trade_no` string COMMENT '支付流水号',
`create_time` string COMMENT '创建时间',
`operate_time` string COMMENT '操作时间',
`start_date` string COMMENT '有效开始日期',
`end_date` string COMMENT '有效结束日期'
) COMMENT '订单拉链临时表'
stored as parquet
location '/warehouse/gmall/dwd/dwd_order_info_his_tmp/'
tblproperties ("parquet.compression"="snappy");
2)导入脚本
hive (gmall)>
insert overwrite table dwd_order_info_his_tmp
select * from
(
select
id,
total_amount,
order_status,
user_id,
payment_way,
out_trade_no,
create_time,
operate_time,
'2019-03-14' start_date,
'9999-99-99' end_date
from dwd_order_info where dt='2019-03-14'
union all
select oh.id,
oh.total_amount,
oh.order_status,
oh.user_id,
oh.payment_way,
oh.out_trade_no,
oh.create_time,
oh.operate_time,
oh.start_date,
if(oi.id is null, oh.end_date, date_add(oi.dt,-1)) end_date
from dwd_order_info_his oh left join
(
select
*
from dwd_order_info
where dt='2019-03-14'
) oi
on oh.id=oi.id and oh.end_date='9999-99-99'
)his
order by his.id, start_date;
9.5.4 步骤3:把临时表覆盖给拉链表
1)导入数据
hive (gmall)>
insert overwrite table dwd_order_info_his
select * from dwd_order_info_his_tmp;
2)查询导入数据
hive (gmall)> select * from dwd_order_info_his;
1.2 Presto安装
1.2.1 Presto Server安装
将presto-server-0.196.tar.gz导入node1的/opt/software目录下,并解压到/opt/module目录
tar -zxvf presto-server-0.196.tar.gz -C /opt/module/
3)修改名称为presto
mv presto-server-0.196/ presto
4)进入到/opt/module/presto目录,并创建存储数据文件夹
mkdir data
5)进入到/opt/module/presto目录,并创建存储配置文件文件夹
mkdir etc
6)配置在/opt/module/presto/etc目录下添加jvm.config配置文件
vim jvm.config
添加如下内容
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
7)Presto可以支持多个数据源,在Presto里面叫catalog,这里我们配置支持Hive的数据源,配置一个Hive的catalog
mkdir catalog
vim hive.properties
添加如下内容
connector.name=hive-hadoop2
hive.metastore.uri=thrift://node1:9083
8)将node1上的presto分发到node2、node3
xsync presto
9)分发之后,分别进入node1、node2、node3三台主机的/opt/module/presto/etc的路径。配置node属性,node id每个节点都不一样。
vim node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/opt/module/presto/data
vim node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-fffffffffffe
node.data-dir=/opt/module/presto/data
vim node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-fffffffffffd
node.data-dir=/opt/module/presto/data
10)Presto是由一个coordinator节点和多个worker节点组成。在node1上配置成coordinator,在node2、node3上配置为worker。
(1)node1上配置coordinator节点
vim config.properties
添加内容如下
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery-server.enabled=true
discovery.uri=http://node1:8881
(2)node2 3上配置worker节点
etc]$ vim config.properties
添加内容如下
coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery.uri=http://node1:8881
etc]$ vim config.properties
添加内容如下
coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery.uri=http://node1:8881
11)在node1的/opt/module/hive目录下,启动Hive Metastore,用atguigu角色
hive]$
nohup bin/hive --service metastore >/dev/null 2>&1 &
12)分别在node1、2、3上启动Presto Server
(1)前台启动Presto,控制台显示日志
presto]$ bin/launcher run
presto]$ bin/launcher run
presto]$ bin/launcher run
(2)后台启动Presto
presto]$ bin/launcher start
presto]$ bin/launcher start
presto]$ bin/launcher start
13)日志查看路径/opt/module/presto/data/var/log
1.2.2 Presto命令行Client安装
1)下载Presto的客户端
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.196/presto-cli-0.196-executable.jar
2)将presto-cli-0.196-executable.jar上传到node1的/opt/module/presto文件夹下
3)修改文件名称
presto]$ mv presto-cli-0.196-executable.jar prestocli
4)增加执行权限
presto]$ chmod +x prestocli
5)启动prestocli
presto]$ ./prestocli --server node1:8881 --catalog hive --schema default
6)Presto命令行操作
Presto的命令行操作,相当于Hive命令行操作。每个表必须要加上schema。
例如:
select * from schema.table limit 100
1.2.3 Presto可视化Client安装
1)将yanagishima-18.0.zip上传到node1的/opt/module目录
2)解压缩yanagishima
module]$ unzip yanagishima-18.0.zip
cd yanagishima-18.0
3)进入到/opt/module/yanagishima-18.0/conf文件夹,编写yanagishima.properties配置
conf]$ vim yanagishima.properties
添加如下内容
jetty.port=7080
presto.datasources=node1-presto
presto.coordinator.server.atiguigu-presto=http://node1:8881
catalog.node1-presto=hive
schema.node1-presto=default
sql.query.engines=presto
4)在/opt/module/yanagishima-18.0路径下启动yanagishima
yanagishima-18.0]$
nohup bin/yanagishima-start.sh >y.log 2>&1 &
5)启动web页面
http://node1:7080
看到界面,进行查询了。
2.4 Druid安装(单机版)
1)将imply-2.7.10.tar.gz上传到node1的/opt/software目录下,并解压
software]$ tar -zxvf imply-2.7.10.tar.gz -C /opt/module
2)修改/opt/module/imply-2.7.10名称为/opt/module/imply
module]$ mv imply-2.7.10/ imply
3)修改配置文件
(1)修改Druid的ZK配置
_common]$ vi /opt/module/imply/conf/druid/_common/common.runtime.properties
修改如下内容
druid.zk.service.host=node1:2181,node2:2181,node3:2181
(2)修改启动命令参数,使其不校验不启动内置ZK
supervise]$
vim /opt/module/imply/conf/supervise/quickstart.conf
修改如下内容
:verify bin/verify-java
#:verify bin/verify-default-ports
#:verify bin/verify-version-check
:kill-timeout 10
#!p10 zk bin/run-zk conf-quickstart
4)启动
(1)启动Zookeeper
imply]$ zk.sh start
(2)启动imply
imply]$ bin/supervise -c conf/supervise/quickstart.conf
说明:每启动一个服务均会打印出一条日志。可以通过/opt/module/imply/var/sv/查看服务启动时的日志信息
(3)启动采集Flume和Kafka(主要是为了节省内存开销,同时node1内存调整为8G)
imply]$ f1.sh start
imply]$ kf.sh start
2.4.3 Web页面使用
0)启动日志生成程序(延时1秒发送一条日志)
server]$ lg.sh 1000 5000
1)登录node1:9095查看