HIVE 大数据实战项目

目录

一、项目需求

二、数据介绍

三、创建表结构

四、数据清洗

五、数据加载

六、业务数据分析

七、原始数据


一、项目需求

1.统计视频观看数 Top10

2.统计视频类别热度Top10

3.统计出视频观看数最高的20个视频的所属类别以及类别包含这Top20视频的个数

4.统计视频观看数Top50所关联视频的所属类别的热度排名

5.统计每个类别中的视频热度Top10,以 Music为例

6.统计每个类别中视频流量 Top10 ,以 Music为例

7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

8.统计每个类别视频观看数Top10(分组取topN)

二、数据介绍

1.视频数据表:

2.用户表:

三、创建表结构

1.视频表:


   
   
   
   
  1. create table youtube_ori(
  2. videoId string,
  3. uploader string,
  4. age int,
  5. category array< string>,
  6. length int,
  7. views int,
  8. rate float,
  9. ratings int,
  10. comments int,
  11. relatedId array< string>)
  12. row format delimited
  13. fields terminated by "\t"
  14. collection items terminated by "&" ;

   
   
   
   
  1. create table youtube_orc(
  2. videoId string,
  3. uploader string,
  4. age int,
  5. category array< string>,
  6. length int,
  7. views int,
  8. rate float,
  9. ratings int,
  10. comments int,
  11. relatedId array< string>)
  12. clustered by (uploader) into 8 buckets
  13. row format delimited
  14. fields terminated by "\t"
  15. collection items terminated by "&"
  16. stored as orc;

2.用户表:


   
   
   
   
  1. create table youtube_user_ori(
  2. uploader string,
  3. videos int,
  4. friends int)
  5. clustered by (uploader) into 24 buckets
  6. row format delimited fields terminated by "\t";

   
   
   
   
  1. create table youtube_user_orc(
  2. uploader string,
  3. videos int,
  4. friends int)
  5. clustered by (uploader) into 24 buckets
  6. row format delimited fields terminated by "\t"
  7. stored as orc;

四、数据清洗

通过观察原始数据形式,可以发现,视频可以有多个所属分类,每个所属分类用&符号分割, 且分割的两边有空格字符,同时相关视频也是可以有多个元素,多个相关视频又用“\t”进 行分割。为了分析数据时方便对存在多个子元素的数据进行操作,我们首先进行数据重组清 洗操作。即:将所有的类别用“&”分割,同时去掉两边空格,多个相关视频 id 也使用“&” 进行分割。

1.ETLUtil


   
   
   
   
  1. package com.company.sparksql;
  2. public class ETLUtil {
  3. public static String oriString2ETLString(String ori){
  4. StringBuilder etlString = new StringBuilder();
  5. String[] splits = ori.split( "\t");
  6. if(splits.length < 9) return null;
  7. splits[ 3] = splits[ 3].replace( " ", "");
  8. for( int i = 0; i < splits.length; i++){
  9. if(i < 9){
  10. if(i == splits.length - 1){
  11. etlString.append(splits[i]);
  12. } else{
  13. etlString.append(splits[i] + "\t");
  14. }
  15. } else{
  16. if(i == splits.length - 1){
  17. etlString.append(splits[i]);
  18. } else{
  19. etlString.append(splits[i] + "&");
  20. }
  21. }
  22. }
  23. return etlString.toString();
  24. }
  25. }

2.DataCleaner


   
   
   
   
  1. package com.company.sparksql
  2. import org.apache.log4j.{Level, Logger}
  3. import org.apache.spark.sql.SparkSession
  4. object DataCleaner {
  5. def main(args: Array[String]): Unit = {
  6. val spark = SparkSession
  7. .builder()
  8. .master( "local")
  9. .appName(DataCleaner.getClass.getSimpleName)
  10. .getOrCreate()
  11. Logger.getLogger( "org.apache.spark").setLevel(Level.OFF)
  12. Logger.getLogger( "org.apache.hadoop").setLevel(Level.OFF)
  13. val lineDS = spark.read.textFile( "e:/0.txt")
  14. import spark.implicits._
  15. val splitedDS = lineDS.map(ETLUtil.oriString2ETLString(_))
  16. splitedDS.write.format( "text").save( "e:/movie")
  17. }
  18. }

五、数据加载

1.视频表:

加载清洗之后的数据到原始视频表

load data local inpath "/opt/datas/cleaned.txt" into table youtube_ori;
   
   
   
   

加载数据到视频的ORC表

insert overwrite table youtube_orc select * from youtube_ori;
   
   
   
   

2用户表:

加载清洗之后的数据到原始用户表

load data local inpath "/opt/datas/user.txt" into table user_ori;
   
   
   
   

加载数据到用户的ORC表

insert overwrite table user_orc select * from user_ori;
   
   
   
   

六、业务数据分析

1.统计视频观看数 Top10


   
   
   
   
  1. select
  2. videoId,
  3. uploader,
  4. age ,
  5. category ,
  6. length ,
  7. views ,
  8. rate ,
  9. ratings ,
  10. comments
  11. from youtube_orc
  12. order by views desc
  13. limit 10;

2.统计视频类别热度Top10


   
   
   
   
  1. 第一种方式: select
  2. category_name,
  3. count(videoId) as video_num
  4. from youtube_orc lateral view explode( category) youtube_view as category_name
  5. group by category_name
  6. order by video_num desc
  7. limit 10;
  8. 第二种方式: select category_name as category,
  9. count(t1.videoId) as hot
  10. from ( select
  11. videoId,
  12. category_name
  13. from youtube_orc lateral view explode( category) t_catetory as category_name) t1
  14. group by t1.category_name
  15. order by hot desc
  16. limit 10;

3.统计出视频观看数最高的20个视频的所属类别以及类别包含这Top20视频的个数


   
   
   
   
  1. select category_name,
  2. count(videoId) as videonums
  3. from (
  4. select
  5. videoId,
  6. category_name
  7. from
  8. ( select
  9. category,
  10. videoId,
  11. views
  12. from
  13. youtube_orc
  14. order by views desc
  15. limit 20) top20view lateral view explode( category) t1_view as category_name )t2_alias group by category_name order by videonums desc;

4.统计视频观看数Top50所关联视频的所属类别的热度排名


   
   
   
   
  1. select
  2. category_name,
  3. count(relatedvideoId) as hot
  4. from
  5. ( select
  6. relatedvideoId,
  7. category
  8. from
  9. ( select
  10. distinct relatedvideoId
  11. from ( select
  12. views,
  13. relatedId
  14. from
  15. youtube_orc
  16. order by views desc
  17. limit 50 )t1 lateral view explode(relatedId) explode_viedeo as relatedvideoId)t2 join youtube_orc on youtube_orc.videoId = t2.relatedvideoId)t3 lateral view explode( category) explode_category as category_name
  18. group by category_name
  19. order by hot desc;

5.统计每个类别中的视频热度Top10,以 Music为例


   
   
   
   
  1. select
  2. videoId,
  3. views
  4. from youtube_orc lateral view explode( category) t1_view as category_name
  5. where category_name = "Music"
  6. order by views desc
  7. limit 10 ;

6.统计每个类别中视频流量 Top10 ,以 Music为例


   
   
   
   
  1. select
  2. videoId ,
  3. ratings
  4. from youtube_orc lateral view explode( category) t1_view as category_name
  5. where category_name = "Music"
  6. order by ratings desc
  7. limit 10 ;

7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频


   
   
   
   
  1. select t1.uploader,youtube_orc.videoId,youtube_orc.views
  2. from
  3. ( select
  4. uploader,videos
  5. from youtube_user_orc
  6. order by videos desc
  7. limit 10) t1 inner join youtube_orc on t1.uploader = youtube_orc.uploader order by views desc limit 20 ;

8.统计每个类别视频观看数Top10(分组取topN)


   
   
   
   
  1. select
  2. t2_alias.category_name,
  3. t2_alias.videoId,t2_alias.views
  4. from
  5. (
  6. select
  7. category_name,
  8. videoId,views ,
  9. row_number() over( partition by category_name order by views desc) rank
  10. from
  11. youtube_orc lateral view explode( category) t1_view as category_name ) t2_alias where rank <= 10;

七、原始数据

1.视频表

2.用户表

3.清洗之后的视频表

链接:https://pan.baidu.com/s/1OkQ2E5_KCngVRbTkalodrQ 
提取码:yzk7 

你可能感兴趣的:(HIVE 大数据实战项目)