某视频网站的常规指标,各种TopN指标:
提供到的两份数据:video.txt和user.txt
video id
视频唯一id
11位字符串
uploader
视频上传者
上传视频的用户名String
age
视频年龄
视频上传日期和2007年2月15日之间的整数天(Youtube的独特设定)
category
视频类别
上传视频指定的视频分类 数组
注意:
关于 类别People & Blogs 数组类型 [“People”,“Blogs”]
length
视频长度
整形数字标识的视频长度
viewsv
观看次数
视频被浏览的次数
rate
视频评分
满分5分
ratings
流量
视频的流量,整型数字
comments
评论数
一个视频的整数评论数
related ids
相关视频id
相关视频的id,最多20个 数组
uploader
上传者用户名
string
videos
上传视频数
int
friends
朋友数量
int
user.txt不需要清洗即可使用
下面将对video.txt进行数据清洗
核心代码:
public class ETLMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
private Text mapKey = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String str = value.toString();
String[] infos = str.split("\t");
if (infos.length < 9) {
return;
}
infos[3] = infos[3].replaceAll(" ", "");
StringBuilder result = new StringBuilder();
for (int i = 0; i < infos.length ; i++) {
result.append(infos[i]);
if (i < 9) {
if (i != infos.length - 1) {
result.append(",");
}
} else {
if (i != infos.length - 1) {
result.append("&");
}
}
}
mapKey.set(result.toString());
context.write(mapKey, NullWritable.get());
}
}
把清洗好的数据放到hdfs上。我这边的目录是/user/howard/etl/video和/user/howard/etl/user。
create external table user(
uploader string,
videos int,
friends int
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/howard/etl/user' #user的hdfs目录
create external table video(
videoid string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>
)
row format delimited fields terminated by ','
collection items terminated by '&'
stored as textfile
location '/user/howard/etl/video' #video的hdfs目录
ps:这里创建的是外部表,当外部表删除的时候,对应的数据和目录是不会删除的
select * from video order by views desc limit 10;
ps:这个没难度
select category_name, count(*) as hot from (select videoid,category_name from video lateral view explode(category) t_category as category_name) t group by category_name order by hot desc limit 10;
ps:首先使用explode函数把类别扩展开来,然后对类别分组统计即可
select t2.category_name,count(*) as hot from (select category_name from (select * from video order by views desc limit 20) t1 lateral view explode(category) t_category as category_name) t2 group by t2.category_name order by hot;
ps:这个也主要使用explode函数,嵌套多个查询语句
select distinct(t2.related_id),t3.category from (select explode(relatedid) as related_id from (select * from video order by views desc limit 50) t1) t2 inner join video t3 on t2.related_id = t3.videoid)
ps:这里要使用inner join把关联的数据查询出来,还使用distinct函数进行去重操作
select *,category_name from video lateral view explode(category) t_category as category_name where category_name = 'Film' order by comments desc limit 10;
ps:实际需求中需要把类别作为参数传递进来
select *,category_name from video lateral view explode(category) t_category as category_name where category_name = 'Film' order by ratings desc limit 10;
ps:同上
select * from (select * from user order by videos limit 10) t join video t2 on t.uploader = t2.uploader;
ps:使用join进行两个表的关联查询
select * from (select *,row_number() over(partition by category_name order by views desc) rank from (select * from video lateral view explode(category) t_category as category_name) t)t1 where rank <= 10;
ps:这里和前面的题目不同,需要把所有类别分区域都展示出来,并且进行局部排序,然后筛选出该分类下排名前10的数据。主要使用row_number() over函数
在Hive中explode函数非常的实用,需要熟练的掌握。
嵌套查询其实不难,一步一步的分析即可。
lateral view、join、row_number() over等等,也需要熟练掌握。
如果有任何问题,或者有什么想法,随时联系我,大家一起交流,共同进步。
我的邮箱 [email protected]