Hive_ Hive 中 TopN 的实现 -》 利用row_number() 函数实现分组TopN

 

参考文章: 

1. Hive row_number() 等用法

https://www.cnblogs.com/Allen-rg/p/9268627.html

 

2.Hive中分组取前N个值

https://www.cnblogs.com/1130136248wlxk/articles/5352145.html

 

 

 

TopN 功能一直是一个热门的话题,下面我们看在Hive 中实现分组 TopN .

 Hive 在0.11 之后引入了 一个函数 ROW_NUMBER() 可以非常方便的解决此类问题

 

0.11 前相近功能的实现

  在Hive 0.11 之前的版本是没有 topN 函数的。那么我们在0.11 之前的版本该如何实现 topN 呢? 这里有一篇不错的文章是通过 

Hive 的udf 去做实现的,我们这里做一个参数。

https://www.cnblogs.com/1130136248wlxk/articles/5352145.html

 

 

函数使用

 

下面是函数的原型“

     ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC) 

 

首先我们构造一个场景,假设提供一个成绩表,有 学生姓名,所选科目 ,分数 3列需要统计每个科目的前三名。

 

表结构如下:

+----------------------------------------------------+--+
|                   createtab_stmt                   |
+----------------------------------------------------+--+
| CREATE TABLE `student2`(                           |
|   `name` string,                                   |
|   `score` double,                                  |
|   `subject` string)                                |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://DMP-BIG001.A1.BJ.JD:8020/user/hive/warehouse/test.db/student2' |
| TBLPROPERTIES (                                    |
|   'COLUMN_STATS_ACCURATE'='true',                  |
|   'last_modified_by'='cloudera-scm',               |
|   'last_modified_time'='1551687368',               |
|   'numFiles'='6',                                  |
|   'numRows'='18',                                  |
|   'rawDataSize'='237',                             |
|   'totalSize'='255',                               |
|   'transient_lastDdlTime'='1551687784')            |
+----------------------------------------------------+--+

表有如下的数据:

+----------------+-----------------+-------------------+--+
| student2.name  | student2.score  | student2.subject  |
+----------------+-----------------+-------------------+--+
| a              | 22.2            | english           |
| a              | 90.2            | chinese           |
| a              | 33.0            | english           |
| b              | 72.2            | english           |
| b              | 80.2            | chinese           |
| b              | 63.0            | math              |
| c              | 64.2            | english           |
| c              | 85.2            | chinese           |
| c              | 73.0            | math              |
| d              | 24.2            | english           |
| d              | 75.2            | chinese           |
| d              | 43.0            | math              |
| e              | 74.2            | english           |
| e              | 55.2            | chinese           |
| e              | 93.0            | math              |
| f              | 76.2            | english           |
| f              | 20.2            | chinese           |
| f              | 84.0            | math              |
| f              | 63.0            | math              |
+----------------+-----------------+-------------------+--+

 

0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, row_number() over(partition by subject order by score desc) rank  from student2) tmp  where tmp.rank <= 3;

 

 select * from (

      select name, subject, score, row_number() over(partition by subject order by score desc) rank  

      from student2)  tmp  

where tmp.rank <= 3;

 

对语句简单讲解一下 :内嵌的子查询 根据科目分组 求该分组下的排名,如果有相同分数,排名+1
 

结果: 

+-----------+--------------+------------+-----------+--+
| tmp.name  | tmp.subject  | tmp.score  | tmp.rank  |
+-----------+--------------+------------+-----------+--+
| a         | chinese      | 90.2       | 1         |
| c         | chinese      | 85.2       | 2         |
| b         | chinese      | 80.2       | 3         |
| f         | english      | 76.2       | 1         |
| e         | english      | 74.2       | 2         |
| b         | english      | 72.2       | 3         |
| e         | math         | 93.0       | 1         |
| f         | math         | 84.0       | 2         |
| c         | math         | 73.0       | 3         |
+-----------+--------------+------------+-----------+--+

 

执行流程

 

     除了会使用函数之外,我们应该也了解下函数的执行顺序。

 

    在使用 row_number() over()函数时候,over()里头的分组以及排序的执行晚于 where group by  order by 的执行。

 

 

相近的函数

    

     除了 Hive 的 row_number() , Hive 有没有提供功能相近的函数呢。答案是有的。

有两个函数, 分别是 

 

rank() over()   

dense_rank() over()

 

rank() over()   

rank() over()   跟 row_number()  over()  的功能基本相同:

不同点在于,分组中存在相同值的处理流程。

rank() over()  更接近于一般的排名逻辑,比如有两个并列第一,那么就会显示

88分 第1 

88分 第1

80分 第3

 

参考例子:

0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, rank() over(partition by subject order by score desc) rank  from student2) tmp  where tmp.rank <= 5;

 

select * from (

     select name, subject, score, rank() over(partition by subject order by score desc) rank  

     from student2) tmp  

where tmp.rank <= 5;
 

+-----------+--------------+------------+-----------+--+
| tmp.name  | tmp.subject  | tmp.score  | tmp.rank  |
+-----------+--------------+------------+-----------+--+
| a         | chinese      | 90.2       | 1         |
| c         | chinese      | 85.2       | 2         |
| b         | chinese      | 80.2       | 3         |
| d         | chinese      | 75.2       | 4         |
| e         | chinese      | 55.2       | 5         |
| f         | english      | 76.2       | 1         |
| e         | english      | 74.2       | 2         |
| b         | english      | 72.2       | 3         |
| c         | english      | 64.2       | 4         |
| a         | english      | 33.0       | 5         |
| e         | math         | 93.0       | 1         |
| f         | math         | 84.0       | 2         |
| c         | math         | 73.0       | 3         |
| f         | math         | 63.0       | 4         |
| b         | math         | 63.0       | 4         |
+-----------+--------------+------------+-----------+--+

可以看到 f 与 b 都是 63分,显示都是第 4名

 

 

dense_rank() over()

 

该函数与 row_number() over() 的区别也在于相同值的处理过程上。

不同于 rank() over() 的跳跃排序, 即两个第2名,之后是第4名,

该函数是连续排序,即两个相同第2名,之后是第3名,示例如下:

 

select name, subject, score, dense_rank() over(partition by subject order by score desc) rank  

from student2;
 

0: jdbc:hive2://10.180.0.26:10000> select name, subject, score, dense_rank() over(partition by subject order by score desc) rank  from student2;

结果:

+-------+----------+--------+-------+--+
| name  | subject  | score  | rank  |
+-------+----------+--------+-------+--+
| a     | chinese  | 90.2   | 1     |
| c     | chinese  | 85.2   | 2     |
| b     | chinese  | 80.2   | 3     |
| d     | chinese  | 75.2   | 4     |
| e     | chinese  | 55.2   | 5     |
| f     | chinese  | 20.2   | 6     |
| f     | english  | 76.2   | 1     |
| e     | english  | 74.2   | 2     |
| b     | english  | 72.2   | 3     |
| c     | english  | 64.2   | 4     |
| a     | english  | 33.0   | 5     |
| d     | english  | 24.2   | 6     |
| a     | english  | 22.2   | 7     |
| e     | math     | 93.0   | 1     |
| f     | math     | 84.0   | 2     |
| c     | math     | 73.0   | 3     |
| f     | math     | 63.0   | 4     |
| b     | math     | 63.0   | 4     |
| d     | math     | 43.0   | 5     |
+-------+----------+--------+-------+--+

 

 

     好了。通过这篇文章你应该对 Hive 的分组排序功能有个非常细致的理解。

祝大家每日进步,加油!

 

 

实际场景

 

我们有一张表 ,记录了 用户的姓名,性别 ,分数,我们想找到 男生与女生的前3名,

表中数据如下:

0: jdbc:hive2://cdh-manager:10000> select * from test_sex_rank
. . . . . . . . . . . . . . . . .> ;
INFO  : Compiling command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_sex_rank.name, type:string, comment:null), FieldSchema(name:test_sex_rank.sex, type:boolean, comment:null), FieldSchema(name:test_sex_rank.score, type:double, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.117 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank
INFO  : Completed executing command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.0 seconds
INFO  : OK
+---------------------+--------------------+----------------------+
| test_sex_rank.name  | test_sex_rank.sex  | test_sex_rank.score  |
+---------------------+--------------------+----------------------+
| a1                  | false              | 82.0                 |
| a2                  | false              | 98.0                 |
| a3                  | false              | 67.4                 |
| a4                  | false              | 87.0                 |
| b1                  | true               | 42.0                 |
| b2                  | true               | 98.0                 |
| b3                  | true               | 77.4                 |
| b4                  | true               | 87.0                 |
+---------------------+--------------------+----------------------+
8 rows selected (0.179 seconds)

 

我们编写的SQL 如下:

0: jdbc:hive2://cdh-manager:10000> select * from (select name, score, sex, rank() over(partition by sex order by score) as rank from test_sex_rank) tmp  where rank <=3;

 

select * from

    (

     select name, score, sex, rank() over(partition by sex order by score) as rank

     from test_sex_rank

    ) tmp

 where rank <=3;
 

 

结果如下:

+-----------+------------+----------+-----------+
| tmp.name  | tmp.score  | tmp.sex  | tmp.rank  |
+-----------+------------+----------+-----------+
| a3        | 67.4       | false    | 1         |
| a1        | 82.0       | false    | 2         |
| a4        | 87.0       | false    | 3         |
| b1        | 42.0       | true     | 1         |
| b3        | 77.4       | true     | 2         |
| b4        | 87.0       | true     | 3         |
+-----------+------------+----------+-----------+

 

你可能感兴趣的:(Hive)