目录
一、项目需求
二、数据介绍
三、创建表结构
四、数据清洗
五、数据加载
六、业务数据分析
七、原始数据
1.统计视频观看数 Top10
2.统计视频类别热度Top10
3.统计出视频观看数最高的20个视频的所属类别以及类别包含这Top20视频的个数
4.统计视频观看数Top50所关联视频的所属类别的热度排名
5.统计每个类别中的视频热度Top10,以 Music为例
6.统计每个类别中视频流量 Top10 ,以 Music为例
7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
8.统计每个类别视频观看数Top10(分组取topN)
1.视频数据表:
2.用户表:
1.视频表:
-
create
table youtube_ori(
-
videoId
string,
-
uploader
string,
-
age
int,
-
category
array<
string>,
-
length
int,
-
views
int,
-
rate
float,
-
ratings
int,
-
comments
int,
-
relatedId
array<
string>)
-
row
format
delimited
-
fields
terminated
by
"\t"
-
collection items
terminated
by
"&" ;
-
create
table youtube_orc(
-
videoId
string,
-
uploader
string,
-
age
int,
-
category
array<
string>,
-
length
int,
-
views
int,
-
rate
float,
-
ratings
int,
-
comments
int,
-
relatedId
array<
string>)
-
clustered
by (uploader)
into
8 buckets
-
row
format
delimited
-
fields
terminated
by
"\t"
-
collection items
terminated
by
"&"
-
stored
as orc;
2.用户表:
-
create
table youtube_user_ori(
-
uploader
string,
-
videos
int,
-
friends
int)
-
clustered
by (uploader)
into
24 buckets
-
row
format
delimited
fields
terminated
by
"\t";
-
create
table youtube_user_orc(
-
uploader
string,
-
videos
int,
-
friends
int)
-
clustered
by (uploader)
into
24 buckets
-
row
format
delimited
fields
terminated
by
"\t"
-
stored
as orc;
通过观察原始数据形式,可以发现,视频可以有多个所属分类,每个所属分类用&符号分割, 且分割的两边有空格字符,同时相关视频也是可以有多个元素,多个相关视频又用“\t”进 行分割。为了分析数据时方便对存在多个子元素的数据进行操作,我们首先进行数据重组清 洗操作。即:将所有的类别用“&”分割,同时去掉两边空格,多个相关视频 id 也使用“&” 进行分割。
1.ETLUtil
-
package com.company.sparksql;
-
-
public
class ETLUtil {
-
public static String oriString2ETLString(String ori){
-
StringBuilder etlString =
new StringBuilder();
-
String[] splits = ori.split(
"\t");
-
if(splits.length <
9)
return
null;
-
splits[
3] = splits[
3].replace(
" ",
"");
-
for(
int i =
0; i < splits.length; i++){
-
if(i <
9){
-
if(i == splits.length -
1){
-
etlString.append(splits[i]);
-
}
else{
-
etlString.append(splits[i] +
"\t");
-
}
-
}
else{
-
if(i == splits.length -
1){
-
etlString.append(splits[i]);
-
}
else{
-
etlString.append(splits[i] +
"&");
-
}
-
}
-
}
-
-
return etlString.toString();
-
}
-
}
2.DataCleaner
-
package com.company.sparksql
-
-
import org.apache.log4j.{Level, Logger}
-
import org.apache.spark.sql.SparkSession
-
-
object DataCleaner {
-
def main(args: Array[String]): Unit = {
-
val spark = SparkSession
-
.builder()
-
.master(
"local")
-
.appName(DataCleaner.getClass.getSimpleName)
-
.getOrCreate()
-
Logger.getLogger(
"org.apache.spark").setLevel(Level.OFF)
-
Logger.getLogger(
"org.apache.hadoop").setLevel(Level.OFF)
-
val lineDS = spark.read.textFile(
"e:/0.txt")
-
import spark.implicits._
-
val splitedDS = lineDS.map(ETLUtil.oriString2ETLString(_))
-
splitedDS.write.format(
"text").save(
"e:/movie")
-
-
}
-
-
}
1.视频表:
加载清洗之后的数据到原始视频表
load data local inpath "/opt/datas/cleaned.txt" into table youtube_ori;
加载数据到视频的ORC表
insert overwrite table youtube_orc select * from youtube_ori;
2用户表:
加载清洗之后的数据到原始用户表
load data local inpath "/opt/datas/user.txt" into table user_ori;
加载数据到用户的ORC表
insert overwrite table user_orc select * from user_ori;
1.统计视频观看数 Top10
-
select
-
videoId,
-
uploader,
-
age ,
-
category ,
-
length ,
-
views ,
-
rate ,
-
ratings ,
-
comments
-
from youtube_orc
-
order
by views
desc
-
limit
10;
2.统计视频类别热度Top10
-
第一种方式:
select
-
category_name,
-
count(videoId)
as video_num
-
from youtube_orc lateral
view explode(
category) youtube_view
as category_name
-
group
by category_name
-
order
by video_num
desc
-
limit
10;
-
第二种方式:
select category_name
as
category,
-
count(t1.videoId)
as hot
-
from (
select
-
videoId,
-
category_name
-
from youtube_orc lateral
view explode(
category) t_catetory
as category_name) t1
-
group
by t1.category_name
-
order
by hot
desc
-
limit
10;
3.统计出视频观看数最高的20个视频的所属类别以及类别包含这Top20视频的个数
-
select category_name,
-
count(videoId)
as videonums
-
from (
-
select
-
videoId,
-
category_name
-
from
-
(
select
-
category,
-
videoId,
-
views
-
from
-
youtube_orc
-
order
by views
desc
-
limit
20) top20view lateral
view explode(
category) t1_view
as category_name )t2_alias
group
by category_name
order
by videonums
desc;
4.统计视频观看数Top50所关联视频的所属类别的热度排名
-
select
-
category_name,
-
count(relatedvideoId)
as hot
-
from
-
(
select
-
relatedvideoId,
-
category
-
from
-
(
select
-
distinct relatedvideoId
-
from (
select
-
views,
-
relatedId
-
from
-
youtube_orc
-
order
by views
desc
-
limit
50 )t1 lateral
view explode(relatedId) explode_viedeo
as relatedvideoId)t2
join youtube_orc
on youtube_orc.videoId = t2.relatedvideoId)t3 lateral
view explode(
category) explode_category
as category_name
-
group
by category_name
-
order
by hot
desc;
5.统计每个类别中的视频热度Top10,以 Music为例
-
select
-
videoId,
-
views
-
from youtube_orc lateral
view explode(
category) t1_view
as category_name
-
where category_name =
"Music"
-
order
by views
desc
-
limit
10 ;
6.统计每个类别中视频流量 Top10 ,以 Music为例
-
select
-
videoId ,
-
ratings
-
from youtube_orc lateral
view explode(
category) t1_view
as category_name
-
where category_name =
"Music"
-
order
by ratings
desc
-
limit
10 ;
7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
-
select t1.uploader,youtube_orc.videoId,youtube_orc.views
-
from
-
(
select
-
uploader,videos
-
from youtube_user_orc
-
order
by videos
desc
-
limit
10) t1
inner
join youtube_orc
on t1.uploader = youtube_orc.uploader
order
by views
desc
limit
20 ;
8.统计每个类别视频观看数Top10(分组取topN)
-
select
-
t2_alias.category_name,
-
t2_alias.videoId,t2_alias.views
-
from
-
(
-
select
-
category_name,
-
videoId,views ,
-
row_number()
over(
partition
by category_name
order
by views
desc)
rank
-
from
-
youtube_orc lateral
view explode(
category) t1_view
as category_name ) t2_alias
where
rank <=
10;
1.视频表
2.用户表
3.清洗之后的视频表
链接:https://pan.baidu.com/s/1OkQ2E5_KCngVRbTkalodrQ
提取码:yzk7