数据仓库,简称数仓,( Data Warehouse )。
从逻辑上理解,数据库和数仓没有区别,都是通过数据库软件实现存放数据的地方,
只不过从数据量来说,数据仓库要比数据库更庞大。
数仓主要是为企业制定决策,提供数据支持的。
当业务简单,可以用数据库来存储,分析,制表。
但当数据量几何式增长,需要跨机器整合时,数仓就是非常必要的了。
几乎所有的数仓软件本身都不存储数据
数仓中存储的数据来源于多个数据源,原始数据在不同数据源中的存储方式各不相同。
要整合成为最终的数据集合,需要从数据源经过一系列抽取、清洗、转换的过程。
数仓中保存的数据是历史记录,不允许被修改。用户只能通过分析工具进行查询和分析。
数仓的数据会随时间变化而定期更新,这里的定期更新不是指修改数据,
一般是将业务系统发生变化的数据定期同步到数仓,和稳定性不冲突。
不可更新是针对应用而言,即用户分析处理时不更新数据
传统数据库对应的业务不同,数仓需要根据需求,将不同数据源的数据进行整合,
即数据一般都围绕某一业务主题进行建模。例如“贷款”主题、“存款”主题等。
之所以大型数据仓库系统架构设计复杂,是因为考虑到了未来3-5年的扩展性,
这样的话,未来不用花太大的精力去重建数据仓库系统,就能很稳定运行。
主要体现在数据建模的合理性,数据仓库方案中多出一些中间层,
使海量数据流有足够的缓冲,不至于数据量大很多,就运行不起来了。
就是各行业在处理事务过程中产生的数据。
比如用户在电商网站中登录、下单、支付等过程中,需要和网站后台数据库进行增删改查交互,
产生的数据就是业务数据。业务数据通常存储在MySQL、Oracle等数据库中。
用户在使用产品过程中,通过埋点收集与客户端产品交互过程中产生的数据,并发往日志服务器进行保存。 比如页面浏览、点击、停留、评论、点赞、收藏等。用户行为数据通常存储在日志文件中。
合作方提供数据接口或数据源
通常通过技术手段获取其他公司网站的数据。不建议这样去做。
一个好的维度,决定着数仓的生死。数仓分层有利于:
1. 减少重复开发,在数据开发的过程中可以产生中间层,将公共逻辑下沉,减少重复计算;
2. 清晰数据结构,每个分层分工明确,方便开发人员理解;
3. 方便定位问题,通过分层了解数据血缘关系,在出问题的时候通过回溯定位问题;
4. 简单化复杂问题,和分治法思想类似,分而治之,将复杂的问题简单化,还能解耦
严格遵守三范式理论
数据冗余度低,但在大数据的场景下,查询效率低。
维度模型相对清晰、简洁。
以数据分析作为出发点,不遵循三范式,故数据存在一定的冗余。
维度模型面向业务,将业务用事实表和维度表呈现出来,表结构简单,故查询简单,查询效率较高。
不同的企业不同的业务类型服务不同的用户,没有定论说分3、4层还是5层,只有最合适的分层。
一般的数仓包含:离线数仓、准实时、实时数仓;我们这里是采用T+1离线数仓,分为五层架构。
原始数据,此层保存最原始数据,并且备份,备份时可以压缩。
数据清洗,脱敏,维度退化。
DWS层就是关于各个主题的加工和使用,是宽表聚合值。
和DWS同,不过DWS层单位是日,而DWT是对截止到当日、或者近7日、近30日的汇总。
ADS层是面向业务定制的应用数据层。
保持数据原貌不做任何修改,保留历史数据,储存起到备份数据作用;
数据一般采用lzo、Snappy、parquet等压缩格式;
创建分区表,防止后续的全表扫描,减少集群资源访问数仓的压力,一般按天存储在数仓中。
由Kafka和flume采集到HDFS上;前端埋点日志以JSON格式形式存在,
又分为两部分:启动日志和事件日志。
由mysql
数据库数据通过sqoop
这种数据同步工具,采集到HDFS上;
DWD层是对事实表的处理,代表的是业务的最小粒度层。任何数据的记录都可以从这一层获取,为后续的DWS和DWT层做准备。还有对日志行为进行解析,以及对业务数据采用维度模型的方式重新建模(维度退化)。
DWD主要完成:
清洗的手段包括sql、mr、rdd、kettle、python等等。
清洗掉数据不能太多也不能很少。合理范围:1 万条数据清洗掉 1 条。
脱敏
对手机号、身份证号等敏感数据脱敏
维度退化
对业务数据传过来的表进行维度退化和降维。(商品一级二级三级、省市县、年月日)
压缩
LZO,列式存储 parquet
DWS层就是关于各个主题的加工和使用,这层是宽表聚合值,是各个事实表的聚合值。这里做轻度的汇总会让以后的计算更加的高效,如:统计各个主题对象计算7天、30天、90天的行为, 应对特殊需求(例如,购买行为,统计商品复购率)会快很多不必走ODS层反复拿数据做加工。
这层会把每个用户单日的行为聚合起来组成一张多列宽表,以便之后关联用户维度信息后进行,不同角度的统计分析。
涉及的主题包括:访客主题、用户主题、商品主题、优惠券主题、活动主题、地区主题等,更详细可参考【6】.
这层涉及的主题和DWS层一样包括:访客主题、用户主题、商品主题、优惠券主题、活动主题、地区主题等。只不过DWS层的粒度是对当日用户汇总信息,而DWT层是对截止到当日、或者近7日、近30日等的汇总信息。
以用户主题这个来举列:
*DWS层:用户主题层是记录某一个用户在某一天的汇总行为。
*DWT层:用户主题层是记录某一个用户截止在当日的汇总行为。
ADS层数据是专门给业务使用的数据层,这层是面向业务定制的应用数据层。
ADS主要完成:
(1)提供为数据产品使用的结果数据、指标等。
(2)提供给数据产品和数据分析使用的数据,一般会存放在 ES、MySQL等系统中供线上系统使用,也可能会存在 Hive 或者 Druid 中供数据分析和数据挖掘使用。如报表数据,或者说那种大宽表。
drop database if exists livestreaming_commerce;
-- 创建数据库
create database livestreaming_commerce default character set utf8mb3 collate utf8mb3_bin;
-- 使用数据库
use livestreaming_commerce;
-- 地区表
create table region(
region_code int not null comment '地区代码' primary key,
region_name varchar(10) null comment '地区名称',
parent_code int null comment '当前地区的上一级地区代码'
) comment '中国地区表' charset = utf8mb3;
create table region_district(
region_code int not null comment '地区代码' primary key,
region_name varchar(10) null comment '地区名称',
parent_code int null comment '当前地区的上一级地区代码'
) comment '中国区县表' charset = utf8mb3;
insert into region_district (region_code,region_name,parent_code)
select region_code,region_name,parent_code from region where region_code > 100000;
-- 数据建模
-- 用户表(user)
CREATE TABLE user (
user_id INT COMMENT '用户ID' AUTO_INCREMENT,
username VARCHAR(50) COMMENT '用户名',
gender VARCHAR(10) COMMENT '性别',
age INT COMMENT '年龄',
location VARCHAR(50) COMMENT '地点',
register_time DATETIME COMMENT '注册时间',
PRIMARY KEY (user_id)
);
-- 视频表(video)
CREATE TABLE video (
video_id INT COMMENT '视频ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
title VARCHAR(100) COMMENT '标题',
description TEXT COMMENT '描述',
duration INT COMMENT '时长',
upload_time DATETIME COMMENT '上传时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
PRIMARY KEY (video_id)
);
-- 直播间表(live_room)
CREATE TABLE live_room (
room_id INT COMMENT '直播间ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
title VARCHAR(100) COMMENT '标题',
description TEXT COMMENT '描述',
start_time DATETIME COMMENT '开始时间',
end_time DATETIME COMMENT '结束时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
PRIMARY KEY (room_id)
);
-- 商品表(product)
CREATE TABLE product (
product_id INT COMMENT '商品ID' AUTO_INCREMENT,
seller_id INT COMMENT '卖家ID',
name VARCHAR(100) COMMENT '名称',
description TEXT COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time DATETIME COMMENT '创建时间',
FOREIGN KEY (seller_id) REFERENCES user(user_id),
PRIMARY KEY (product_id)
);
-- 商品订单表(product_order)
CREATE TABLE product_order (
order_id INT COMMENT '订单ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time DATETIME COMMENT '下单时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (product_id) REFERENCES product(product_id),
PRIMARY KEY (order_id)
);
-- 直播间订单表(live_order)
CREATE TABLE live_order (
order_id INT COMMENT '订单ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time DATETIME COMMENT '下单时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (room_id) REFERENCES live_room(room_id),
FOREIGN KEY (product_id) REFERENCES product(product_id),
PRIMARY KEY (order_id)
);
-- 评论表(comment)
CREATE TABLE `comment` (
comment_id INT COMMENT '评论ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content TEXT COMMENT '内容',
comment_time DATETIME COMMENT '评论时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
FOREIGN KEY (video_id) REFERENCES video(video_id),
FOREIGN KEY (live_room_id) REFERENCES live_room(room_id),
PRIMARY KEY (comment_id)
);
-- 退货表(return_order)
CREATE TABLE return_order (
return_id INT COMMENT '退货ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason TEXT COMMENT '原因',
return_time DATETIME COMMENT '退货时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (product_id) REFERENCES product(product_id),
FOREIGN KEY (order_id) REFERENCES product_order(order_id),
PRIMARY KEY (return_id)
);
-- 举报表(report)
CREATE TABLE report (
report_id INT COMMENT '举报ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason TEXT COMMENT '原因',
report_time DATETIME COMMENT '举报时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
FOREIGN KEY (live_room_id) REFERENCES live_room(room_id),
PRIMARY KEY (report_id)
);
-- 纠纷表(dispute)
CREATE TABLE dispute (
dispute_id INT COMMENT '纠纷ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason TEXT COMMENT '原因',
dispute_time DATETIME COMMENT '纠纷时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (order_id) REFERENCES product_order(order_id),
PRIMARY KEY (dispute_id)
);
warehouse/data/mysql/livestreaming_commerce.sql
mysql livestreaming_commerce -h spark03 -uroot -p < warehouse/shell/sql/mysql/livestreaming_commerce_mysql.sql
- 创建
create_db.sql
文件- 执行
create_db.sql
create_db.sql
文件内容如下:
-- 删除数据库
drop database if exists ods_lc;
drop database if exists dwd_lc;
drop database if exists dwt_lc;
drop database if exists dws_lc;
drop database if exists ads_lc;
drop database if exists app_lc;
-- 创建数据库
create database if not exists ods_lc location '/data/ods/livestreaming_commerce';
create database if not exists dwd_lc location '/data/dwd/livestreaming_commerce';
create database if not exists dwt_lc location '/data/dwt/livestreaming_commerce';
create database if not exists dws_lc location '/data/dwt/livestreaming_commerce';
create database if not exists ads_lc location '/data/dws/livestreaming_commerce';
create database if not exists app_lc location '/data/ads/livestreaming_commerce';
执行 create_db.sql
hive -f warehouse/shell/hive/create_db.sql
- 创建
create_ods_table.sql
文件- 执行
create_ods_table.sql
文件
create_ods_table.sql
文件-- ODS层
-- 公共数据地区表
-- 地区表
create external table ods_lc.region(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国地区表'
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/public/region';
create external table ods_lc.region_district(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国区县表'
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/public/region_district';
-- 抖音数据存放在 /ods/douyin 目录, 数据格式:867,3,汽车,用车知识,206,2022-01-01 23:18:38
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.douyin_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/user';
-- 视频表(video)
CREATE external TABLE ods_lc.douyin_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.douyin_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.douyin_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.douyin_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.douyin_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.douyin_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.douyin_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.douyin_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.douyin_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/dispute';
-- 快手数据存放在 /ods/kuaishou 目录,数据格式:{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.kuaishou_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/user';
-- WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" )
-- ALTER TABLE user SET SERDEPROPERTIES ( "ignore.malformed.json" = "true" );
-- 视频表(video)
CREATE external TABLE ods_lc.kuaishou_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.kuaishou_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.kuaishou_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.kuaishou_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.kuaishou_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.kuaishou_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.kuaishou_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.kuaishou_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.kuaishou_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/dispute';
-- 小红书数据存放在 /ods/kuaishou 目录,数据格式:{"data:[{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}]"}
create external table ods_lc.xiaohongshu_user(info string) comment '用户表' location '/ods/xiaohongshu/user';
create external table ods_lc.xiaohongshu_video(info string) comment '视频表' location '/ods/xiaohongshu/video';
create external table ods_lc.xiaohongshu_live_room(info string) comment '直播间' location '/ods/xiaohongshu/live_room';
create external table ods_lc.xiaohongshu_product(info string) comment '商品表' location '/ods/xiaohongshu/product';
create external table ods_lc.xiaohongshu_prouct_order(info string) comment '商品订单表' location '/ods/xiaohongshu/product_order';
create external table ods_lc.xiaohongshu_live_order(info string) comment '直播间订单表' location '/ods/xiaohongshu/live_order';
create external table ods_lc.xiaohongshu_comment(info string) comment '评论表' location '/ods/xiaohongshu/comment';
create external table ods_lc.xiaohongshu_return_order(info string) comment '退货表' location '/ods/xiaohongshu/return_order';
create external table ods_lc.xiaohongshu_report(info string) comment '投诉表' location '/ods/xiaohongshu/report';
create external table ods_lc.xiaohongshu_dispute(info string) comment '纠纷表' location '/ods/xiaohongshu/dispute';
执行 create_ods_table.sql
文件
hive -f /warehouse/shell/hive/create_ods_table.sql
方法一:sqoop + shell
方法二:springboot + quarz + hdfs | hive
方法三:springboot + spark + quarz + hdfs | hive
vim warehouse/shell/sqoop/sqoop_collect_data_util.sh
if [$# != 2]
then
echo "参数异常:sqoop_collect_data.util.sh "
exit 33
fi
# 第一个参数为sql语句
sql=$1
# 第二个参数为hdfs存储路径
hdfs_path=$2
sqoop import \
--username root \
--password 'Lihaozhe!!@@1122' \
--connect jdbc:mysql://spark03:3306/livestreaming_commerce?characterEncoding=UTF-8 \
--input-fields-terminated-by ',' \
--input-lines-terminated-by '\n' \
--num-mappers 1 \
--target-dir "${hdfs_path}" \
--delete-target-dir \
--query "${sql}"' and $CONDITIONS' \
--null-string '\\N' \
--null-non-string '\\N'
vim warehouse/shell/sqoop/collect_data_full.sh
#!/bin/bash
# 全量数据采集
# 当调用该脚本的时候没有传递参数 z=z 成立 则获取当前日期前一天赋值给变量dt
# 当调用该脚本的时候传递参数 z=z 不成立 则将第一个参数赋值给变量dt
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 day ago"`
else
dt=$1
fi
# sql语句
region_sql="select region_code, region_name, parent_code from region where 1=1"
region_district_sql="select region_code, region_name, parent_code from region_district where 1=1"
# 路径前缀
path_prefix="/ods/public"
# HDFS存储路径
region_path="${path_prefix}/region/${dt}"
region_district_path="${path_prefix}/region_district/${dt}"
# 采集数据
echo "开始采集..."
echo "采集数据表 region 数据..."
bash sqoop_collect_data_util.sh "${region_sql}" "${region_path}"
echo "采集数据表 region_district 数据..."
bash sqoop_collect_data_util.sh "${region_district_sql}" "${region_district_path}"
echo "结束采集..."
bash warehouse/shell/sqoop/collect_data_full.sh
vim warehouse/shell/sqoop/collect_data_incrment.sh
#!/bin/bash
# 曾量数据采集 每天凌晨之后 一个固定时间段执行一次 采集前一天的数据
# 当调用该脚本的时候没有传递参数 z=z 成立 则获取当前日期前一天赋值给变量dt
# 当调用该脚本的时候传递参数 z=z 不成立 则将第一个参数赋值给变量dt
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 day ago"`
pt1=`date +%Y --date="${dt}"`
pt2=`date +%m --date="${dt}"`
pt3=`date +%d --date="${dt}"`
else
dt=$1
pt1=`date +%Y --date="$1"`
pt2=`date +%m --date="$1"`
pt3=`date +%d --date="$1"`
fi
# 转换日期格式,20260101 改为 2026-01-01 sql查询日期区间
dt_new=`date +%Y-%m-%d --date="${dt}"`
# sql语句
user_sql="select user_id, username, gender, age, location, date_format(register_time,'%Y-%m-%d %H:%i:%s') from user where register_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
video_sql="select video_id, user_id, title, description, duration, date_format(upload_time,'%Y-%m-%d %H:%i:%s') from video where upload_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
live_room_sql="select room_id, user_id, title, description
, date_format(start_time,'%Y-%m-%d %H:%i:%s')
, date_format(end_time,'%Y-%m-%d %H:%i:%s') from live_room
where end_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
product_sql="select product_id, seller_id, name, description, price, date_format(create_time,'%Y-%m-%d %H:%i:%s') from product where create_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
product_order_sql="select order_id, buyer_id, seller_id, product_id, quantity, price, date_format(order_time,'%Y-%m-%d %H:%i:%s') from product_order where order_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
# 路径前缀
path_prefix="/ods/douyin"
# HDFS存储路径
user_path="${path_prefix}/user/${pt1}/${pt2}/${pt3}"
video_path="${path_prefix}/video/${pt1}/${pt2}/${pt3}"
live_room_path="${path_prefix}/live_room/${pt1}/${pt2}/${pt3}"
product_path="${path_prefix}/product/${pt1}/${pt2}/${pt3}"
product_order_path="${path_prefix}/product_order/${pt1}/${pt2}/${pt3}"
# 采集数据
echo "开始采集..."
echo "采集数据表 user 数据..."
bash sqoop_collect_data_util.sh "${user_sql}" "${user_path}"
echo "采集数据表 video 数据..."
bash sqoop_collect_data_util.sh "${video_sql}" "${video_path}"
echo "采集数据表 live_room 数据..."
bash sqoop_collect_data_util.sh "${live_room_sql}" "${live_room_path}"
echo "采集数据表 product 数据..."
bash sqoop_collect_data_util.sh "${product_sql}" "${product_path}"
echo "采集数据表 product_order 数据..."
bash sqoop_collect_data_util.sh "${product_order_sql}" "${product_order_path}"
echo "结束采集..."
采集某天的数据
bash warehouse/shell/sqoop/collect_data_incrment.sh 20231122
采集前一天的数据
bash warehouse/shell/sqoop/collect_data_incrment.sh
vim warehouse/shell/sqoop/collect_data_year_incrment.sh
#!/bin/bash
# 如果没有输入参数,则获取当前年份
if [ $# -eq 0 ]; then
current_year=$(date +'%Y')
years=($current_year)
else
years=("$@")
fi
for year in "${years[@]}"
do
# 判断是否为闰年
if [ $((year % 4)) -eq 0 ] && [ $((year % 100)) -ne 0 ] || [ $((year % 400)) -eq 0 ]; then
days_in_feb=29
else
days_in_feb=28
fi
# 循环打印每一天
for month in {1..12}
do
# 获取当前月份的天数
case $month in
1|3|5|7|8|10|12) days_in_month=31 ;;
4|6|9|11) days_in_month=30 ;;
2) days_in_month=$days_in_feb ;;
esac
for (( day=1; day<=$days_in_month; day++ ))
do
# 格式化日期
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
bash collect_data_incrment.sh $date_string
done
done
done
采集某年全年数据
bash warehouse/shell/sqoop/collect_data_year_incrment.sh 2023
采集执行时间所在年份全年数据
warehouse/shell/sqoop/collect_data_year_incrment.sh
vim warehouse/shell/sqoop/collect_data_year_month_incrment.sh
#!/bin/bash
# 如果没有输入参数,则获取当前年份
if [ $# -eq 0 ]; then
current_year=$(date +'%Y')
years=($current_year)
else
years=()
for arg in "$@"
do
if [[ $arg =~ ^[0-9]{4}$ ]]; then
years+=($arg)
elif [[ $arg =~ ^[0-9]{6}$ ]]; then
years+=(${arg:0:4})
months+=(${arg:4:2})
fi
done
fi
for year in "${years[@]}"
do
# 判断是否为闰年
if [ $((year % 4)) -eq 0 ] && [ $((year % 100)) -ne 0 ] || [ $((year % 400)) -eq 0 ]; then
days_in_feb=29
else
days_in_feb=28
fi
# 如果参数是yyyyMM格式,则打印出该年月的每一天
if [ ${#months[@]} -gt 0 ]; then
for month in "${months[@]}"
do
# 获取当前月份的天数
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
# 循环打印每一天
for (( day=1; day<=$days_in_month; day++ ))
do
# 格式化日期
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
bash collect_data_incrment.sh $date_string
done
done
else
# 打印当前参数所在年份的每一天
for month in {01..12}
do
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
for (( day=1; day<=$days_in_month; day++ ))
do
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
bash collect_data_incrment.sh $date_string
done
done
fi
done
采集某年全年数据
bash warehouse/shell/sqoop/collect_data_year_month_incrment.sh 2023
采集某年某月全月数据
warehouse/shell/sqoop/collect_data_year_month_incrment.sh 20231122
vim warehouse/shell/hive/add_partition.sh
#!/bin/bash
# 加外部分区表
# 接收三个参数
# 1、表名称
# 2、分区字段dt的值,格式:20260201
# 3、分区路径(相对或者绝对路径都可以)但个人习惯使用绝对路径
if [ $# != 3 ]
then
echo "参数异常:add_partition.sh "
exit 100
fi
# 表名称
table_name=$1
# 分区字段的值,格式:20260201
date=$2
year=$(date -d "$date" +%Y)
month=$(date -d "$date" +%m)
day=$(date -d "$date" +%d)
# 分区路径
path=$3
echo "alter table "${table_name}" add if not exists partition ("y=${year},m=${month},d=${day}") location '"${path}"';"
hive -e "
alter table "${table_name}" add if not exists partition ("y=${year},m=${month},d=${day}") location '"${path}"';
"
使用举例:
bash warehouse/shell/hive/add_partition.sh 20231122 /ods/douyin/user/2022/01/01
vim warehouse/shell/hive/warehouse/shell/hive/add_partition.sh
if [ $# -eq 0 ]; then
current_year=$(date +'%Y')
years=($current_year)
else
years=()
for arg in "$@"
do
if [[ $arg =~ ^[0-9]{4}$ ]]; then
years+=($arg)
elif [[ $arg =~ ^[0-9]{6}$ ]]; then
years+=(${arg:0:4})
months+=(${arg:4:2})
fi
done
fi
for year in "${years[@]}"
do
# 判断是否为闰年
if [ $((year % 4)) -eq 0 ] && [ $((year % 100)) -ne 0 ] || [ $((year % 400)) -eq 0 ]; then
days_in_feb=29
else
days_in_feb=28
fi
# 如果参数是yyyyMM格式,则打印出该年月的每一天
if [ ${#months[@]} -gt 0 ]; then
for month in "${months[@]}"
do
# 获取当前月份的天数
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
# 循环打印每一天
for (( day=1; day<=$days_in_month; day++ ))
do
# 格式化日期
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
partation_string=$(date -d "$month/$day/$year" +'%Y/%m/%d')
bash add_partation.sh ods_lc.douyin_user $date_string /ods/douyin/user/$partation_string
bash add_partation.sh ods_lc.douyin_video $date_string /ods/douyin/video/$partation_string
bash add_partation.sh ods_lc.douyin_live_room $date_string /ods/douyin/live_room/$partation_string
bash add_partation.sh ods_lc.douyin_product $date_string /ods/douyin/product/$partation_string
bash add_partation.sh ods_lc.douyin_product_order $date_string /ods/douyin/product_order/$partation_string
done
done
else
# 打印当前参数所在年份的每一天
for month in {01..12}
do
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
for (( day=1; day<=$days_in_month; day++ ))
do
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
partation_string=$(date -d "$month/$day/$year" +'%Y/%m/%d')
bash add_partation.sh ods_lc.douyin_user $date_string /ods/douyin/user/$partation_string
bash add_partation.sh ods_lc.douyin_video $date_string /ods/douyin/video/$partation_string
bash add_partation.sh ods_lc.douyin_live_room $date_string /ods/douyin/live_room/$partation_string
bash add_partation.sh ods_lc.douyin_product $date_string /ods/douyin/product/$partation_string
bash add_partation.sh ods_lc.douyin_product_order $date_string /ods/douyin/product_order/$partation_string
done
done
fi
done
使用举例:
为2023年11月每天所有目录添加分区
bash warehouse/shell/hive/add_partition.sh 202311
为2023年全年所有目录添加分区
bash warehouse/shell/hive/add_partition.sh 202311
-- 快手数据存放在 /ods/kuaishou 目录,数据格式:{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.kuaishou_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time DATE COMMENT '注册时间'
) comment '用户表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/user';
-- 视频表(video)
CREATE external TABLE ods_lc.kuaishou_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.kuaishou_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.kuaishou_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.kuaishou_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.kuaishou_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.kuaishou_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.kuaishou_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.kuaishou_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.kuaishou_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/dispute';
hdfs dfs -put warehouse/data/kuaishou/user.log /ods/kuaishou/user
hdfs dfs -put warehouse/data/kuaishou/video.log /ods/kuaishou/video
hdfs dfs -put warehouse/data/kuaishou/live_room.log /ods/kuaishou/live_room
hdfs dfs -put warehouse/data/kuaishou/product.log /ods/kuaishou/product
hdfs dfs -put warehouse/data/kuaishou/product_order.log /ods/kuaishou/product_order
create external table ods_lc.xiaohongshu_user(info string) comment '用户表' location '/ods/xiaohongshu/user';
create external table ods_lc.xiaohongshu_video(info string) comment '视频表' location '/ods/xiaohongshu/video';
create external table ods_lc.xiaohongshu_live_room(info string) comment '直播间' location '/ods/xiaohongshu/live_room';
create external table ods_lc.xiaohongshu_product(info string) comment '商品表' location '/ods/xiaohongshu/product';
create external table ods_lc.xiaohongshu_prouct_order(info string) comment '商品订单表' location '/ods/xiaohongshu/product_order';
create external table ods_lc.xiaohongshu_live_order(info string) comment '直播间订单表' location '/ods/xiaohongshu/live_order';
create external table ods_lc.xiaohongshu_comment(info string) comment '评论表' location '/ods/xiaohongshu/comment';
create external table ods_lc.xiaohongshu_return_order(info string) comment '退货表' location '/ods/xiaohongshu/return_order';
create external table ods_lc.xiaohongshu_report(info string) comment '投诉表' location '/ods/xiaohongshu/report';
create external table ods_lc.xiaohongshu_dispute(info string) comment '纠纷表' location '/ods/xiaohongshu/dispute';
hdfs dfs -put warehouse/data/xiaohongshu/user.json /ods/xiaohongshu/user
hdfs dfs -put warehouse/data/xiaohongshu/video.json /ods/xiaohongshu/video
hdfs dfs -put warehouse/data/xiaohongshu/live_room.json /ods/xiaohongshu/live_room
hdfs dfs -put warehouse/data/xiaohongshu/product.json /ods/xiaohongshu/product
hdfs dfs -put warehouse/data/xiaohongshu/product_order.json /ods/xiaohongshu/product_order
-- DWD
-- 数据建模
-- 用户表(user)
CREATE external TABLE dwd_lc.xiaohongshu_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
location '/dwd/xiaohongshu/user';
-- 视频表(video)
CREATE external TABLE dwd_lc.xiaohongshu_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
location '/dwd/xiaohongshu/video';
-- 直播间表(live_room)
CREATE external TABLE dwd_lc.xiaohongshu_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
location '/dwd/xiaohongshu/live_room';
-- 商品表(product)
CREATE external TABLE dwd_lc.xiaohongshu_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
location '/dwd/xiaohongshu/product';
-- 商品订单表(product_order)
CREATE external TABLE dwd_lc.xiaohongshu_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
location '/dwd/xiaohongshu/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE dwd_lc.xiaohongshu_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
location '/dwd/xiaohongshu/live_order';
-- 评论表(comment)
CREATE external TABLE dwd_lc.xiaohongshu_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
location '/dwd/xiaohongshu/comment';
-- 退货表(return_order)
CREATE external TABLE dwd_lc.xiaohongshu_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
location '/dwd/xiaohongshu/return_order';
-- 投诉表(report)
CREATE external TABLE dwd_lc.xiaohongshu_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
location '/dwd/xiaohongshu/report';
-- 纠纷表(dispute)
CREATE external TABLE dwd_lc.xiaohongshu_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
location '/dwd/xiaohongshu/dispute';
ETL是将业务系统的数据经过**抽取(Extract)、清洗转换(Transform)之后加载(Load)**到数据仓库的过程,
目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据, ETL是BI(商业智能)项目重要的一个环节。
-- 小红书 ETL
-- 用户表(user)
insert overwrite table dwd_lc.xiaohongshu_user (
select json_tuple(people,'user_id','username','gender','age','location','register_time') as (user_id,username,gender,age,location,register_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_user) ods
lateral view explode(people_list) t1 as people) t2);
-- 视频表(video)
insert overwrite table dwd_lc.xiaohongshu_video (
select json_tuple(people,'video_id','user_id','title','description','duration','upload_time') as (video_id,user_id,title,description,duration,upload_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_video) ods
lateral view explode(people_list) t1 as people) t2);
-- 直播间表(live_room)
insert overwrite table dwd_lc.xiaohongshu_live_room (
select json_tuple(people,'room_id','user_id','title','description','start_time','end_time') as (room_id,user_id,title,description,start_time,end_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_live_room) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品表(product)
insert overwrite table dwd_lc.xiaohongshu_product (
select json_tuple(people,'product_id','seller_id','name','description','price','create_time') as (product_id,seller_id,name,description,price,create_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_product) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品订单表(product_order)
insert overwrite table dwd_lc.xiaohongshu_product_order (
select json_tuple(people,'order_id','buyer_id','seller_id','product_id','quantity','price','order_time') as (order_id,buyer_id,seller_id,product_id,quantity,price,order_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_prouct_order) ods
lateral view explode(people_list) t1 as people) t2);
-- 数据分析需求示例
-- 1. 求每个用户的视频数量
SELECT user_id, COUNT(*) AS video_count
FROM video
GROUP BY user_id
ORDER BY video_count DESC
LIMIT 30;
-- 2. 求每个视频的点赞数量
SELECT video_id, COUNT(*) AS like_count
FROM video_like
GROUP BY video_id
ORDER BY like_count DESC
LIMIT 30;
-- 3. 求每个商品的销量和销售额
SELECT product_id, SUM(quantity) AS total_sales, SUM(quantity * price) AS total_revenue
FROM product_order
GROUP BY product_id
ORDER BY total_sales DESC
LIMIT 30;
-- 4. 求每个直播间的观众数量
SELECT room_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY room_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 5. 求每个用户的评论数量
SELECT user_id, COUNT(*) AS comment_count
FROM user_comment
GROUP BY user_id
ORDER BY comment_count DESC
LIMIT 30;
-- 6. 求每个用户的评价数量
SELECT user_id, COUNT(*) AS review_count
FROM product_review
GROUP BY user_id
ORDER BY review_count DESC
LIMIT 30;
-- 7. 求每个用户的直播间评价数量
SELECT user_id, COUNT(*) AS review_count
FROM live_review
GROUP BY user_id
ORDER BY review_count DESC
LIMIT 30;
-- 8. 求每个用户的点赞数量
SELECT user_id, COUNT(*) AS like_count
FROM video_like
GROUP BY user_id
ORDER BY like_count DESC
LIMIT 30;
-- 9. 求每个用户的观众数量
SELECT user_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 10. 求每个用户的购买数量和消费金额
SELECT buyer_id, COUNT(*) AS purchase_count, SUM(quantity * price) AS total_spent
FROM product_order
GROUP BY buyer_id
ORDER BY purchase_count DESC
LIMIT 30;
-- 11. 求每个用户的卖出数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM product_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 12. 求每个用户的直播间订单数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM live_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 13. 求每个用户的直播间观众数量
SELECT user_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 14. 求每个用户的视频观看数量
SELECT user_id, COUNT(*) AS view_count
FROM video_view
GROUP BY user_id
ORDER BY view_count DESC
LIMIT 30;
-- 15. 求每个用户的视频观看时长
SELECT user_id, SUM(duration) AS total_duration
FROM video
GROUP BY user_id
ORDER BY total_duration DESC
LIMIT 30;
-- 16. 求每个直播间的观众数量和观看时长
SELECT room_id, COUNT(*) AS viewer_count, SUM(duration) AS total_duration
FROM live_viewer lv
JOIN video v ON lv.room_id = v.video_id
GROUP BY room_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 17. 求每个商品的评价数量
SELECT product_id, COUNT(*) AS review_count
FROM product_review
GROUP BY product_id
ORDER BY review_count DESC
LIMIT 30;
-- 18. 求每个直播间的评价数量
SELECT room_id, COUNT(*) AS review_count
FROM live_review
GROUP BY room_id
ORDER BY review_count DESC
LIMIT 30;
-- 19. 求每个视频的评论数量
SELECT video_id, COUNT(*) AS comment_count
FROM user_comment
WHERE target_id IS NOT NULL
GROUP BY video_id
ORDER BY comment_count DESC
LIMIT 30;
-- 20. 求每个直播间的评论数量
SELECT room_id, COUNT(*) AS comment_count
FROM user_comment
WHERE target_id IS NULL
GROUP BY room_id
ORDER BY comment_count DESC
LIMIT 30;
-- 21. 求每个用户的视频点赞数量
SELECT user_id, COUNT(*) AS like_count
FROM video_like vl
JOIN video v ON vl.video_id = v.video_id
GROUP BY user_id
ORDER BY like_count DESC
LIMIT 30;
-- 22. 求每个用户的直播间观众数量
SELECT user_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 23. 求每个用户
-- 23. 求每个用户的视频观看数量和观看时长
SELECT user_id, COUNT(*) AS view_count, SUM(duration) AS total_duration
FROM video_view vv
JOIN video v ON vv.video_id = v.video_id
GROUP BY user_id
ORDER BY view_count DESC
LIMIT 30;
-- 24. 求每个用户的直播间观众数量和观看时长
SELECT user_id, COUNT(*) AS viewer_count, SUM(duration) AS total_duration
FROM live_viewer lv
JOIN video v ON lv.room_id = v.video_id
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 25. 求每个用户的购买数量和消费金额
SELECT buyer_id, COUNT(*) AS purchase_count, SUM(quantity * price) AS total_spent
FROM product_order
GROUP BY buyer_id
ORDER BY purchase_count DESC
LIMIT 30;
-- 26. 求每个用户的卖出数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM product_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 27. 求每个用户的直播间订单数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM live_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 28. 求每个用户的视频观看数量
SELECT user_id, COUNT(*) AS view_count
FROM video_view
GROUP BY user_id
ORDER BY view_count DESC
LIMIT 30;
-- 29. 求每个用户的视频观看时长
SELECT user_id, SUM(duration) AS total_duration
FROM video
GROUP BY user_id
ORDER BY total_duration DESC
LIMIT 30;
-- 30. 求每个用户的直播间观众数量和观看时长
SELECT user_id, COUNT(*) AS viewer_count, SUM(duration) AS total_duration
FROM live_viewer lv
JOIN video v ON lv.room_id = v.video_id
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- hive
-- 1. 初始化数据库
-- 删除数据库
drop database if exists ods_lc cascade;
drop database if exists dwd_lc cascade;
drop database if exists dwt_lc cascade;
drop database if exists dws_lc cascade;
drop database if exists ads_lc cascade;
drop database if exists app_lc cascade;
-- 创建数据库
create database if not exists ods_lc location '/data/ods/livestreaming_commerce';
create database if not exists dwd_lc location '/data/dwd/livestreaming_commerce';
create database if not exists dwt_lc location '/data/dwt/livestreaming_commerce';
create database if not exists dws_lc location '/data/dwt/livestreaming_commerce';
create database if not exists ads_lc location '/data/dws/livestreaming_commerce';
create database if not exists app_lc location '/data/ads/livestreaming_commerce';
-- 数据建模
-- ODS层
-- 公共数据地区表
-- 地区表
create external table ods_lc.region(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国地区表'
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/public/region';
create external table ods_lc.region_district(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国区县表'
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/public/region_district';
-- 抖音数据存放在 /ods/douyin 目录, 数据格式:867,3,汽车,用车知识,206,2022-01-01 23:18:38
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.douyin_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/user';
-- 视频表(video)
CREATE external TABLE ods_lc.douyin_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.douyin_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.douyin_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.douyin_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.douyin_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.douyin_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.douyin_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.douyin_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.douyin_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by '\n'
location '/ods/douyin/dispute';
-- 快手数据存放在 /ods/kuaishou 目录,数据格式:{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.kuaishou_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/user';
-- WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" )
-- ALTER TABLE user SET SERDEPROPERTIES ( "ignore.malformed.json" = "true" );
-- 视频表(video)
CREATE external TABLE ods_lc.kuaishou_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.kuaishou_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.kuaishou_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.kuaishou_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.kuaishou_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.kuaishou_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.kuaishou_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.kuaishou_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.kuaishou_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/dispute';
-- 小红书数据存放在 /ods/kuaishou 目录,数据格式:{"data:[{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}]"}
create external table ods_lc.xiaohongshu_user(info string) comment '用户表' location '/ods/xiaohongshu/user';
create external table ods_lc.xiaohongshu_video(info string) comment '视频表' location '/ods/xiaohongshu/video';
create external table ods_lc.xiaohongshu_live_room(info string) comment '直播间' location '/ods/xiaohongshu/live_room';
create external table ods_lc.xiaohongshu_product(info string) comment '商品表' location '/ods/xiaohongshu/product';
create external table ods_lc.xiaohongshu_prouct_order(info string) comment '商品订单表' location '/ods/xiaohongshu/product_order';
create external table ods_lc.xiaohongshu_live_order(info string) comment '直播间订单表' location '/ods/xiaohongshu/live_order';
create external table ods_lc.xiaohongshu_comment(info string) comment '评论表' location '/ods/xiaohongshu/comment';
create external table ods_lc.xiaohongshu_return_order(info string) comment '退货表' location '/ods/xiaohongshu/return_order';
create external table ods_lc.xiaohongshu_report(info string) comment '投诉表' location '/ods/xiaohongshu/report';
create external table ods_lc.xiaohongshu_dispute(info string) comment '纠纷表' location '/ods/xiaohongshu/dispute';
-- DWD
-- 数据建模
-- 小红书
-- 用户表(user)
CREATE external TABLE dwd_lc.xiaohongshu_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
location '/dwd/xiaohongshu/user';
-- 视频表(video)
CREATE external TABLE dwd_lc.xiaohongshu_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
location '/dwd/xiaohongshu/video';
-- 直播间表(live_room)
CREATE external TABLE dwd_lc.xiaohongshu_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
location '/dwd/xiaohongshu/live_room';
-- 商品表(product)
CREATE external TABLE dwd_lc.xiaohongshu_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
location '/dwd/xiaohongshu/product';
-- 商品订单表(product_order)
CREATE external TABLE dwd_lc.xiaohongshu_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
location '/dwd/xiaohongshu/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE dwd_lc.xiaohongshu_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
location '/dwd/xiaohongshu/live_order';
-- 评论表(comment)
CREATE external TABLE dwd_lc.xiaohongshu_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
location '/dwd/xiaohongshu/comment';
-- 退货表(return_order)
CREATE external TABLE dwd_lc.xiaohongshu_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
location '/dwd/xiaohongshu/return_order';
-- 投诉表(report)
CREATE external TABLE dwd_lc.xiaohongshu_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
location '/dwd/xiaohongshu/report';
-- 纠纷表(dispute)
CREATE external TABLE dwd_lc.xiaohongshu_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
location '/dwd/xiaohongshu/dispute';
-- 小红书 ETL
-- 用户表(user)
insert overwrite table dwd_lc.xiaohongshu_user (
select json_tuple(people,'user_id','username','gender','age','location','register_time') as (user_id,username,gender,age,location,register_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_user) ods
lateral view explode(people_list) t1 as people) t2);
-- 视频表(video)
insert overwrite table dwd_lc.xiaohongshu_video (
select json_tuple(people,'video_id','user_id','title','description','duration','upload_time') as (video_id,user_id,title,description,duration,upload_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_video) ods
lateral view explode(people_list) t1 as people) t2);
-- 直播间表(live_room)
insert overwrite table dwd_lc.xiaohongshu_live_room (
select json_tuple(people,'room_id','user_id','title','description','start_time','end_time') as (room_id,user_id,title,description,start_time,end_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_live_room) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品表(product)
insert overwrite table dwd_lc.xiaohongshu_product (
select json_tuple(people,'product_id','seller_id','name','description','price','create_time') as (product_id,seller_id,name,description,price,create_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_product) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品订单表(product_order)
insert overwrite table dwd_lc.xiaohongshu_product_order (
select json_tuple(people,'order_id','buyer_id','seller_id','product_id','quantity','price','order_time') as (order_id,buyer_id,seller_id,product_id,quantity,price,order_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_prouct_order) ods
lateral view explode(people_list) t1 as people) t2);
-- 直播间订单表(live_order)
insert overwrite table dwd_lc.xiaohongshu_live_order (
select json_tuple(people,'order_id','buyer_id','seller_id','seller_id','room_id','product_id','quantity','price','order_time') as (order_id,buyer_id,seller_id,room_id,product_id,quantity,price,order_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_live_order) ods
lateral view explode(people_list) t1 as people) t2);
-- 评论表(comment)
insert overwrite table dwd_lc.xiaohongshu_comment (
select json_tuple(people,'comment_id','user_id','video_id','live_room_id','content','upload_time') as (comment_id,user_id,video_id,live_room_id,content,comment_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_comment) ods
lateral view explode(people_list) t1 as people) t2);
-- 退货表(return_order)
insert overwrite table dwd_lc.xiaohongshu_return_order (
select json_tuple(people,'return_id','buyer_id','seller_id','product_id','order_id','reason','return_time') as (return_id,buyer_id,seller_id,product_id,order_id,reason,return_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_return_order) ods
lateral view explode(people_list) t1 as people) t2);
-- 投诉表(report)
insert overwrite table dwd_lc.xiaohongshu_report (
select json_tuple(people,'report_id','user_id','live_room_id','reason','report_time') as (report_id,user_id,live_room_id,reason,report_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_report) ods
lateral view explode(people_list) t1 as people) t2);
-- 纠纷表(dispute)
insert overwrite table dwd_lc.xiaohongshu_dispute (
select json_tuple(people,'dispute_id','buyer_id','seller_id','order_id','reason','dispute_time') as (dispute_id,buyer_id,seller_id,order_id,reason,dispute_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),";") people_list from ods_lc.xiaohongshu_dispute) ods
lateral view explode(people_list) t1 as people) t2);
-- 快手
select name, description, price, total_quantity from product inner join
(select product_id,sum(quantity) total_quantity from product_order
group by product_id order by total_quantity) t
on t.product_id = product.product_id;
-- 商品销售数量统计 原谅我一不小心做成了内部表 该表的数据从ods_lc.product_order中抽取
create table dwd_lc.kuaishou_product_order_total_quantity
(product_id int,total_quantity int)
partitioned by (create_date date);
-- 从ods层导入输入到 dwd_lc.product_order_total_quantity
insert into table dwd_lc.kuaishou_product_order_total_quantity
select product_id,sum(quantity) total_quantity,current_date() create_date from ods_lc.kuaishou_product_order
group by product_id order by total_quantity;
-- DWS
-- 数据建模
-- 快手
-- 商品销售数量统计 原谅我一不小心做成了内部表 该表的数据从ods_lc.product_order和dwd_lc.product_order_total_quantity中抽取
create table dws_lc.kuaishou_product_order_total_quantity(
product_id int,
name string,
description string,
price decimal(10,2),
total_quantity int
)
partitioned by (create_date date);
-- 从ods层和dwd层导入输入到 ws_lc.kuaishou_product_order_total_quantity
insert into table dwd_lc.kuaishou_product_order_total_quantity
select product_id,name,description,price from ods_lc.kuaishou_product_order ods
inner join
(select product_id,sum(quantity) total_quantity,current_date() create_date from ods_lc.kuaishou_product_order
group by product_id order by total_quantity) dwd
on dwd.product_id = ods.product_id;
# use livestreaming_commerce;
# alter database livestreaming_commerce character set utf8mb3 collate utf8mb3_bin;
# alter table comment convert to character set utf8mb3 collate utf8mb3_bin;
# alter table dispute convert to character set utf8mb3 collate utf8mb3_bin;
# alter table live_order convert to character set utf8mb3 collate utf8mb3_bin;
# alter table live_room convert to character set utf8mb3 collate utf8mb3_bin;
# alter table product convert to character set utf8mb3 collate utf8mb3_bin;
# alter table product_order convert to character set utf8mb3 collate utf8mb3_bin;
# alter table region convert to character set utf8mb3 collate utf8mb3_bin;
# alter table region_district convert to character set utf8mb3 collate utf8mb3_bin;
# alter table report convert to character set utf8mb3 collate utf8mb3_bin;
# alter table return_order convert to character set utf8mb3 collate utf8mb3_unicode_ci;
# alter table user convert to character set utf8mb3 collate utf8mb3_bin;
# alter table video convert to character set utf8mb3 collate utf8mb3_bin;
create database dwd_lc;
create table dwd_lc.product_order_total_quantity (product_id int,total_quantity int,create_dated date);
insert into dwd_lc.product_order_total_quantity
select product_id,sum(quantity) total_quantity,curdate() from product_order
group by product_id;
create database dws_lc;
create table dwd_lc.product_order_total_quantity (
product_id int,total_quantity int,create_dated date
);
create database dwt_lc;
create table dwt_lc.product_order_total_quantity (
product_id int,
name varchar(50),
description varchar(50),
price decimal(10,2),
total_quantity int,
create_dated date
);
insert into dwt_lc.product_order_total_quantity
select ods.product_id, name, description, price, total_quantity, create_dated create_date from livestreaming_commerce.product ods
inner join
(select product_id, total_quantity, create_dated from dwd_lc.product_order_total_quantity) dwd
on dwd.product_id = ods.product_id;
create database aps_lc;
create table aps_lc.product_order_total_quantity (
product_id int,
name varchar(50),
description varchar(50),
price decimal(10,2),
total_quantity int,
create_dated date
);
insert into aps_lc.product_order_total_quantity
select product_id, name, description, price, total_quantity, create_dated create_date
from dwt_lc.product_order_total_quantity order by total_quantity desc limit 10;
openEuler Linux 部署 FineBi
环境 | 版本 |
---|---|
openEuler Linux | 22.03 |
MySQL | 8.0.35 |
JDK | 1.8 |
Tomcat | 9.0.83 |
FineBi | 6.0 |
yum -y update
reboot
yum -y install vim tar net-tools
sudo yum -y install https://repo.mysql.com//mysql80-community-release-el8-9.noarch.rpm
sudo yum repolist enabled | grep mysql
sudo yum -y install mysql-community-server
使用以下命令启动 MySQL 服务器:
sudo systemctl start mysqld
您可以使用以下命令检查 MySQL 服务器的状态:
sudo systemctl status mysqld
使用以下命令开启自启动 MySQL 服务器:
sudo systemctl enable mysqld
在服务器初始启动时,假设服务器的数据目录为空,会发生以下情况:
服务器已初始化。
SSL 证书和密钥文件在数据目录中生成。
validate_password
已安装并启用。
创建了一个超级用户帐户'root'@'localhost
。超级用户的密码已设置并存储在错误日志文件中。要显示它,请使用以下命令:
sudo grep 'temporary password' /var/log/mysqld.log
通过使用生成的临时密码登录并为超级用户帐户设置自定义密码,尽快更改 root 密码:
mysql -uroot -p
ALTER USER 'root'@'localhost' IDENTIFIED BY 'Lihaozhe!!@@1122';
FLUSH PRIVILEGES;
不检查指定为散列值的密码,因为原始密码值不可用于检查:
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'Lihaozhe!!@@1122';
FLUSH PRIVILEGES;
update mysql.user set host = '%',plugin='mysql_native_password' where user='root';
FLUSH PRIVILEGES;
exit;
sudo systemctl restart mysqld
–zone #作用域
–add-port=3306/tcp #添加端口,格式为:端口/通讯协议
–permanent #永久生效,没有此参数重启后失效
firewall-cmd --zone=public --add-port=3306/tcp --permanent
firewall-cmd --reload
# 使用新密码连接
mysql -h 你自己的IP地址 -uroot -p
字符集应当为 utf8 ,排序规则为 utf8_bin。
create database finebi character set utf8 collate utf8_bin;
jdk 官网下载页面:https://www.oracle.com/java/technologies/downloads/
tomcat 官网下载页面:https://tomcat.apache.org/download-90.cgi
tar -zxvf jdk-8u391-linux-x64.tar.gz -C /usr/local/
tar -zxvf apache-tomcat-9.0.83.tar.gz -C /usr/local/
mv /usr/local/jdk1.8.0_391/ /usr/local/jdk-8
mv /usr/local/apache-tomcat-9.0.83 /usr/local/tomcat-9
其中tomcat.pid文件本身不存在,设置之后,启动Tomcat会自动生成
vim /usr/local/tomcat-9/bin/catalina.sh
定位文件第152行找到 PRGDIR=
dirname "$PRG"
并在该行下追加以下内容
CATALINA_PID="/usr/local/tomcat-9/bin/tomcat.pid"
vim /etc/profile.d/my_env.sh
export JAVA_HOME=/usr/local/jdk-8
export CATALINA_BASE=/usr/local/tomcat-9
export CATALINA_HOME=/usr/local/tomcat-9
export PATH=$PATH:$JAVA_HOME/bin:$CATALINA_BASE/bin:$CATALINA_HOME/bin
source /etc/profile.d/my_env.sh
cd /usr/lib/systemd/system
vim tomcat-9.service
[Unit]
Description=finereport-server
After=network.target
[Service]
Type=forking
Environment="JAVA_HOME=/usr/local/jdk-8"
PIDFile=/usr/local/tomcat-9/bin/tomcat.pid
ExecStart=/usr/local/tomcat-9/bin/startup.sh
ExecReload=/usr/local/tomcat-9/bin/shutdown.sh
ExecReload=/bin/kill -s HUP $MAINPID
PrivateTmp=true
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl start tomcat-9.service
systemctl enable tomcat-9.service
firewall-cmd --zone=public --add-port=8080/tcp --permanent
firewall-cmd --reload
reboot
http://服务器IP地址:8080
https://www.finebi.com/product/download
tomcat-linux-x64.tar.gz
/usr/local
tar -zxvf tomcat-linux-x64.tar.gz -C /usr/local/
解压后的目录名为 tomcat-linux-64
### 上传 MySQL 数据库驱动
```bash
cd /usr/local/tomcat-linux-x64/webapps/webroot/WEB-INF/lib/
ls mysql*
删除MySQL5的驱动后上传MySQL8的驱动
mysql-connector-j-8.0.33.jar mysql驱动
protobuf-java-3.5.1.jar mysql驱动依赖
rm -f mysql-connector-java-5.1.49-bin.jar /usr/local/tomcat-9/webapps/finebi
cp -r /usr/local/tomcat-linux-x64/webapps/webroot /usr/local/tomcat-9/webapps/finebi
本案例使用
MySQL8
作为finebi
外接数据库
- 关闭 BI 工程
- 删除工程webapps\webroot\WEB-INF\lib目录下的
mysql-connector-java-5.1.49-bin.jar
- 下载对应版本的
MySQL8
驱动,并上传至工程webapps\webroot\WEB-INF\lib目录下。
- 上传mysql8驱动
mysql-connector-j-8.0.33.jar
- 上传mysql8驱动依赖
protobuf-java-3.5.1.jar
- 重启 BI 工程。
字符集:
- 平台外接 MySQL 数据库时,需要设置字符集排序规则(大小写敏感)。
- 推荐使用 utf8 – UTF-8 Unicode 编码。
- 不支持 utf8mb4 编码
排序规则:
MySQL 中存在多种格式的 utf8 编码的排序规则,其中最常见的两种如下表所示:
- utf8_bin(推荐)
定义:将字符串中的每一个字符用二进制数据存储,区分大小写
示例:Alex 和 alex 被认为是两个不同的值- utf8_general_ci
定义:ci 为 case insensitive 的缩写,即大小写不敏感,为 utf8 默认编码
示例:Alex 和 alex 被认为是相同的值
登录
mysql
创建 finebi 数据库
create database finebi default character set utf8 collate utf8_bin;
systemctl restart tomcat-9
http://服务器IP地址:8080/finebi/decision
账号密码设置成功
本案例使用外接数据库
外接数据库配置
修改外接数据库参数
- 数据库类型
mysql
- 驱动
MySQL5
com.mysql.jdbc.DriverMySQL8
com.mysql.cj.jdbc.Driver
- 主机
MySQL服务器IP地址
- 端口号
3306
- 数据库名称
finebi
- 用户名
root
- 密码 ·MySQL数据库密码·
修改外接数据库参数后点击启用新数据库
finebi
连锁超市销售数据分析
步骤:
鼠标悬停在文件夹上,右侧出现 +
鼠标悬停在文件夹上,右侧出现 +,点击 + 选择 Excel数据集
保存数据前可以调整数据集,例如:数据类型、列位置
更新数据
两种方式:
鼠标选中数据标题,拖拽即可
点击字段设置
图形和表格的载体,在组件内可以制作图形和表格并对对其进行编辑
维度:分析数据的角度,例如:时间维度、空间维度、类别维度
指标:维度上量化的值
需求分析:
维度:商品名称
指标:销售额
使用分组表
原始数据并没有毛利额,毛利额需要我们自己计算产生
毛利额 = 销售额 - 成本额
毛利额 = 销售额 - 成本额
注意:数据列只能从鞥左侧拖拽不能手写
维度:商品类别
指标:销售额
单击标签中的销售额,在标签中销售额的下拉列表中选择数值格式
在数值格式中的数量单位下拉列表中选择万
销售额占比 = 类型销售额数量 / 销售总额
- 快速计算
- 占比
同比增长率=(本期数-同期数)÷同期数×100%
环比增长率=(本期数-上期数)/上期数×100%。
维度:时间
指标:毛利额
自定义图表
由于纵轴指标聚合,毛利额和毛利率数值相差巨大故而毛利率显示并不明显
解决方法:
将纵轴指标聚合修改为指标并列
设置值轴
在图形属性中找到毛利额增长率,点开下拉列表
在下拉列表中选择线
将数据中的毛利额拖入到图形属性中毛利额的标签属性中
注意:纵轴使用的是指标聚合
将数据中的毛利额拖入到图形属性中毛利额环比增长率的标签属性中
图形属性标签
- 确定表达的主题
- 确定表达的关系
- 选择图表形式
观察数据的角度
毛利率 = (销售额 - 成本额) / 销售额 * 100%
毛利额 = 销售额 - 成本额
毛利率 = 毛利额 / 销售额 * 100%
搜索右侧三个点点开
毛利率 = (销售额 - 成本额) / 销售额 * 100%
毛利额 = 销售额 - 成本额
毛利率 = 毛利额 / 销售额 * 100%
由于毛利额已经计算过所此处直接引用
假设使用毛利率升序
通过指标排序发现湖南省毛利率最低,于是逐级钻取
从左侧列表拖拽组件到仪表板
本案例使用MySQL数据库
FineBI内置了 MySQL5 版本的驱动,如使用 MySQL8 版本的数据库,需要手动更换驱动。
关闭 BI 工程
下载对应版本的 MySQL8.0 驱动,并上传至工程WEB-INF/lib目录下。
删除工程WEB-INF/lib目录下的mysql-connector-java-5.1.49-bin.jar。
重启 BI 工程。
数据连接
数据连接管理
新建数据连接
在常用中点击MySQL
MySQL数据连接
注意:如果连接连接MySQL8,将驱动修改为
com.mysql.cj.jdbc.Driver
若出现中文乱码或日期错乱的情况,可在「数据连接URL」后加后缀,格式为:jdbc:mysql://hostname:port/database?generateSimpleParameterMetadata=true&useUnicode=true&characterEncoding=utf8&useSSL=false&serverTimezone=Asia/Shanghai
其中
serverTimezone=Asia/Shanghai – 设置以"上海时区"为准
characterEncoding=utf8 – 编码转化
测试数据连接
保存数据连接
select name, description, price, total_quantity from product inner join
(select product_id,sum(quantity) total_quantity from product_order
group by product_id order by total_quantity desc limit 10) t
on t.product_id = product.product_id