大数据开发之Hive(统计影音视频网站的常规指标)

第 11 章:Hive实战

11.1 数据结构

1、视频表

字段 备注 详细描述
videoId 视频唯一id(String) 11位字符串
uploader 视频上传者(String) 上传视频的用户名String
age 视频年龄(int) 视频在平台上的整天数
category 视频类别(Array) 上传视频指定的视频分类
length 视频长度(Int) 整形数字标识的视频长度
views 观看次数(Int) 视频被浏览的次数
rate 视频评分(Double) 满分5分
Ratings 流量(Int) 视频的流量,整形数字
comments 评论数(Int) 一个视频的整数评论数
relatedId 相关视频id(Array) 相关视频的id,最多20个

2、用户表

字段 备注 字段类型
uploader 上传者用户名 string
videos 上传视频数 int
friends 朋友数量 int

11.2 准备工作

1、需要准备的表
1)创建原始数据表:gulivideo_ori,gulivideo_user_ori,
2)创建最终表:gulivideo_orc,gulivideo_user_orc
2、创建原始数据表
1)创建原始数据表gulivideo_ori

create external table gulivideo_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>
)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile
location '/gulivideo/video';

2)创建原始数据表:gulivideo_user_ori

create external table gulivideo_user_ori(
    uploader string,
    videos int,
    friends int
)
row format delimited 
fields terminated by "\t" 
stored as textfile
location '/gulivideo/user';

3)创建orc存储格式带snappy压缩的表gulivideo_orc

create table gulivideo_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>
)
stored as orc
tblproperties("orc.compress"="SNAPPY");

4)创建orc存储格式带snappy压缩的表gulivideo_user_orc

create table gulivideo_user_orc(
    uploader string,
    videos int,
    friends int
)
row format delimited 
fields terminated by "\t" 
stored as orc
tblproperties("orc.compress"="SNAPPY");

5)向ori表插入数据

load data local inpath "/opt/module/hive/datas/video" into table gulivideo_ori;
load data local inpath "/opt/module/hive/datas/user.txt" into table gulivideo_user_ori;

6)向orc表插入数据

insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;

11.3 业务分析

11.3.1 统计视频观看数Top10

1、思路:
使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
2、代码

select
    videoId,
    `views`
from gulivideo_orc
order by `views` desc 
limit 10;
OK
videoid          views
dMH0bHeiRNg     42513417
0XxI-hvPRRA     20282464
1dmVU08zVpA     16087899
RB-wUgnyGv0     15712924
QjA5faZF1A8     15256922
-_CSo1gOd48     13199833
49IDp76kjPw     11970018
tYnn51C3X_w     11823701
pv5zWaTEVkI     11672017
D2kJZOfq7zk     11184051

11.3.2 统计视频类别热度Top10(类别热度:类别下的总视频数)

1、思路:
1)统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2)我们需要按照类别group by聚合,然后count组内的videoId个数即可。
3)因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行行列转化(展开),然后再进行count即可。
4)最后按照热度排序,显示前10条。
2、代码

select
    tmp01.category_col,
    count(tmp01.videoId) num
from (
     select
         videoId,
         category_col
     from gulivideo_orc
              lateral view
                  explode(category) t as category_col
) tmp01
group by tmp01.category_col
order by num desc
limit 10;
// 结果显示
OK
tmp01.category_col        num
Music                      179049
Entertainment             127674
Comedy                     87818
Animation                 73293
Film                       73293
Sports                     67329
Gadgets                    59817
Games                      59817
Blogs                      48890
People                     48890

11.3.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

1、思路
1)先找到观看书最高的20个视频所属条目的所有信息(主要是类目),降序排列
2)先把20条信息中的category分裂出来(列转行),形成新的字段category_name
3)在第二步的结果下,按照炸开的视频类别category_name分组,然后统计组内的个数category_count
2、最终代码

 select
    table02.categroy_name,
    count(table02.videoId) num
from (
     select
         videoId,
         categroy_name
     from (
              select
                  videoId,
                  `views`,
                  category
              from gulivideo_orc
              order by `views` desc
              limit 20
          ) table01
              lateral view
                  explode(category) tmp as categroy_name
) table02
group by table02.categroy_nam;
// 结果显示
OK
table02.categroy_name   num
 Blogs                     2
 UNA                        1
Comedy                     6
Entertainment             6
Music                      5
People                     2

11.3.4 统计视频观看数Top50所关联视频的所属类别排序

1、思路
1)先找到观看数前50的视频信息(主要是求出关联视频)
2)炸开第一步求出的关联视频array,形成一个新字段new_relatedid
3)用new_relatedid和gulivideo_orc表进行join,求出new_relatedid的类别
4)炸开第三步结果的category,形成新字段category_name
5)按照catedory_name分组,然后求出每个分组的个数category_count
6)对category_count进行排序,利用开窗函数
2、代码

select 
    t5.category_name,
    t5.num,
    rank() over(order by t5.num desc ) rk
from (
     select
         t4.category_name,
         count(t4.realte_id) num
     from (
              select
                  t3.realte_id,
                  category_name
              from (
                       select
                           t2.realte_id,
                           g.category
                       from (
                                select
                                    realte_id
                                from (
                                         select
                                             videoId,
                                             relatedId,
                                             `views`
                                         from gulivideo_orc
                                         order by `views` desc
                                         limit 50
                                     ) t1
                                         lateral view
                                             explode(t1.relatedId) tmp as realte_id
                            ) t2 join gulivideo_orc g on t2.realte_id = g.videoId
                   ) t3
                       lateral view
                           explode(t3.category) tmp as category_name
          ) t4
     group by t4.category_name        
) t5 ;
// 结果显示OK
t5.category_name        t5.num  rk
Comedy  237     1
Entertainment   216     2
Music   195     3
People  51      4
Blogs   51      4
Animation       47      6
Film    47      6
News    24      8
Politics        24      8
Games   22      10
Gadgets 22      10
Sports  19      12
Howto   14      13
DIY     14      13
UNA     13      15
Travel  12      16
Places  12      16
Animals 11      18
Pets    11      18
Autos   4       20
Vehicles        4       20

11.3.5统计每个类别中的视频热度Top10,以Music为例

1、思路
1)要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开成新的字段categary_name。
2)然后通过category_name过滤“Music”分类的所有视频信息,按照视频观看数倒序排序,取前10
3)统计对应类别(Music)中的视频热度
2、代码

select
    videoId,
    `views` hot
from (
     select
         videoId,
         category_name,
         `views`
     from gulivideo_orc
              lateral view
                  explode(category) tmp as category_name        
) t1
where category_name = "Music"
order by hot desc 
limit 10;
// 结果显示
OK
videoid          hot
QjA5faZF1A8     15256922
tYnn51C3X_w     11823701
pv5zWaTEVkI     11672017
8bbTtPL1jRs     9579911
UMf40daefsI     7533070
-xEzGIuY7kw     6946033
d6C0bNDqf3Y     6935578
HSoVKUVOnfQ     6193057
3URfWTEPmtE     5581171
thtmaZnxk_0     5142238

11.3.6 统计每个类别视频观看数Top10

1、思路
1)把每个原始表的类别炸开,形成新的字段category_name
2)按照炸裂开的类别字段category_name分区,按照视频观看数views倒叙排序进行开窗,求出每个类别下的所有视频的观看次数排序rk
3)按照rk字段对全表进行where过滤,求出每个类别观看书Top10
2、代码

select
    t2.category_name,
    t2.views,
    t2.rk
from (
     select
         t1.category_name,
         t1.views,
         rank() over(partition by t1.category_name order by t1.views desc ) rk
     from (   
           select
               category_name,
               `views`
           from gulivideo_orc
                    lateral view
                        explode(category) tmp as category_name
       ) t1
) t2
where rk <= 10// 结果显示
OK
t2.category_name        t2.views        t2.rk
Comedy  42513417        1
Comedy  20282464        2
Comedy  11970018        3
Comedy  10107491        4
Comedy  9566609 5
Comedy  7066676 6
Comedy  6322117 7
Comedy  5826923 8
Comedy  5587299 9
Comedy  5508079 10
News    4706030 1
News    2899397 2
News    2817078 3
News    2803520 4
News    2348709 5
News    2335060 6
News    2326680 7
News    2318782 8
News    2310583 9
News    2291369 10
……
Time taken: 11.376 seconds, Fetched: 210 row(s)

11.3.7 统计上传视频最多的用户Top10以及它们上传的视频观看次数在前20的视频

有三种理解
理解一:取Top10中所有人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的十个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频,按照观看数取前20
2、代码

SELECT
    t1.uploader,
    t2.videoid,
    t2.views
FROM
    (
        select
            uploader,
            videos
        from gulivideo_user_orc
        order by videos DESC
        limit 10
    ) t1
        JOIN
    gulivideo_orc t2
    on t1.uploader = t2.uploader
ORDER BY t2.views DESC
LIMIT 20;
// 结果显示
OK
t1.uploader     t2.videoid      t2.views
expertvillage   -IxHBW0YpZw     39059
expertvillage   BU-fT5XI_8I     29975
expertvillage   ADOcaBYbMl0     26270
expertvillage   yAqsULIDJFE     25511
expertvillage   vcm-t0TJXNg     25366
expertvillage   0KYGFawp14c     24659
expertvillage   j4DpuPvMLF4     22593
expertvillage   Msu4lZb2oeQ     18822
expertvillage   ZHZVj44rpjE     16304
expertvillage   foATQY3wovI     13576
expertvillage   -UnQ8rcBOQs     13450
expertvillage   crtNd46CDks     11639
expertvillage   D1leA0JKHhE     11553
expertvillage   NJu2oG1Wm98     11452
expertvillage   CapbXdyv4j4     10915
expertvillage   epr5erraEp4     10817
expertvillage   IyQoDgaLM7U     10597
expertvillage   tbZibBnusLQ     10402
expertvillage   _GnCHodc7mk     9422
expertvillage   hvEYlSlRitU     7123
Time taken: 57.272 seconds, Fetched: 20 row(s)

理解二:取Top10中每个人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有视频id,视频观看次数,还要按照uploader分区,views倒叙排序,求出每个uploder的上传的视频的观看排名
3)按照rk进行where过滤,求出rk<=20的数据
2、代码

select
    t3.uploader,
    t3.videoId,
    t3.views,
    t3.rk
from (
     select
         t2.uploader,
         t2.videoId,
         t2.views,
         rank() over(partition by uploader order by t2.views desc ) rk
     from (
              select
                  t1.uploader,
                  g.videoId,
                  g.`views`
              from (
                       select
                           uploader
                       from gulivideo_user_orc
                       order by videos desc
                       limit 10
                   ) t1 join gulivideo_orc g on t1.uploader = g.uploader
          ) t2    
) t3
where rk <=20;
// 结果显示
OK
t3.uploader     t3.videoid      t3.views        t3.rk
expertvillage   -IxHBW0YpZw     39059   1
expertvillage   BU-fT5XI_8I     29975   2
expertvillage   ADOcaBYbMl0     26270   3
expertvillage   yAqsULIDJFE     25511   4
expertvillage   vcm-t0TJXNg     25366   5
expertvillage   0KYGFawp14c     24659   6
expertvillage   j4DpuPvMLF4     22593   7
expertvillage   Msu4lZb2oeQ     18822   8
expertvillage   ZHZVj44rpjE     16304   9
expertvillage   foATQY3wovI     13576   10
expertvillage   -UnQ8rcBOQs     13450   11
expertvillage   crtNd46CDks     11639   12
expertvillage   D1leA0JKHhE     11553   13
expertvillage   NJu2oG1Wm98     11452   14
expertvillage   CapbXdyv4j4     10915   15
expertvillage   epr5erraEp4     10817   16
expertvillage   IyQoDgaLM7U     10597   17
expertvillage   tbZibBnusLQ     10402   18
expertvillage   _GnCHodc7mk     9422    19
expertvillage   hvEYlSlRitU     7123    20
Ruchaneewan     5_T5Inddsuo     3132    1
Ruchaneewan     wje4lUtbYNU     1086    2
Ruchaneewan     i8rLbOUhAlM     549     3
Ruchaneewan     OwnEtde9_Co     453     4
Ruchaneewan     5Zf0lbAdJP0     441     5
Ruchaneewan     wenI5MrYT20     426     6
Ruchaneewan     Iq4e3SopjxQ     420     7
Ruchaneewan     3hzOiFP-5so     420     7
Ruchaneewan     JgyOlXjjuw0     418     9
Ruchaneewan     fGBVShTsuyo     395     10
Ruchaneewan     O3aoL70DlVc     389     11
Ruchaneewan     q4y2ZS5OQ88     344     12
Ruchaneewan     lyUJB2eMVVg     271     13
Ruchaneewan     _RF_3VhaQpw     242     14
Ruchaneewan     DDl2cjI-aJs     231     15
Ruchaneewan     xbYyjUdhtJw     227     16
Ruchaneewan     4dkKeIUkN7E     226     17
Ruchaneewan     qCfuQA6N4K0     213     18
Ruchaneewan     TmYbGQaRcNM     209     19
Ruchaneewan     dOlfPsFSjw0     206     20
Time taken: 30.772 seconds, Fetched: 40 row(s

理解三:Top10用户上传的所有视频,有哪些视频是在视频观看次数前20的视频
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频id,视频观看次数
3)在第二步的结果上,与视频表观看次数前20的数据进行内连接,求出Top10用户上传的视频有哪些是观看次数前20的视频
2、代码

SELECT
    t3.uploader,
    t3.videoid,
    t3.views
FROM
    (
        SELECT
            t1.uploader,
            t2.videoid,
            t2.views
        FROM
            (
                select
                    uploader,
                    videos
                from gulivideo_user_orc
                order by videos DESC
                limit 10
            ) t1
                JOIN
            gulivideo_orc t2
            on t1.uploader = t2.uploader
    ) t3
        JOIN
    (
        select
            videoid,
            `views`
        from gulivideo_orc
        order by `views` desc
        limit 20
    ) t4
on t3.videoid = t4.videoid;

你可能感兴趣的:(大数据,hive,hadoop)