参考文章:
1. Hive row_number() 等用法
https://www.cnblogs.com/Allen-rg/p/9268627.html
2.Hive中分组取前N个值
https://www.cnblogs.com/1130136248wlxk/articles/5352145.html
TopN 功能一直是一个热门的话题,下面我们看在Hive 中实现分组 TopN .
Hive 在0.11 之后引入了 一个函数 ROW_NUMBER() 可以非常方便的解决此类问题
在Hive 0.11 之前的版本是没有 topN 函数的。那么我们在0.11 之前的版本该如何实现 topN 呢? 这里有一篇不错的文章是通过
Hive 的udf 去做实现的,我们这里做一个参数。
https://www.cnblogs.com/1130136248wlxk/articles/5352145.html
下面是函数的原型“
ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC)
首先我们构造一个场景,假设提供一个成绩表,有 学生姓名,所选科目 ,分数 3列需要统计每个科目的前三名。
表结构如下:
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE TABLE `student2`( |
| `name` string, |
| `score` double, |
| `subject` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://DMP-BIG001.A1.BJ.JD:8020/user/hive/warehouse/test.db/student2' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='cloudera-scm', |
| 'last_modified_time'='1551687368', |
| 'numFiles'='6', |
| 'numRows'='18', |
| 'rawDataSize'='237', |
| 'totalSize'='255', |
| 'transient_lastDdlTime'='1551687784') |
+----------------------------------------------------+--+
表有如下的数据:
+----------------+-----------------+-------------------+--+
| student2.name | student2.score | student2.subject |
+----------------+-----------------+-------------------+--+
| a | 22.2 | english |
| a | 90.2 | chinese |
| a | 33.0 | english |
| b | 72.2 | english |
| b | 80.2 | chinese |
| b | 63.0 | math |
| c | 64.2 | english |
| c | 85.2 | chinese |
| c | 73.0 | math |
| d | 24.2 | english |
| d | 75.2 | chinese |
| d | 43.0 | math |
| e | 74.2 | english |
| e | 55.2 | chinese |
| e | 93.0 | math |
| f | 76.2 | english |
| f | 20.2 | chinese |
| f | 84.0 | math |
| f | 63.0 | math |
+----------------+-----------------+-------------------+--+
0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, row_number() over(partition by subject order by score desc) rank from student2) tmp where tmp.rank <= 3;
select * from (
select name, subject, score, row_number() over(partition by subject order by score desc) rank
from student2) tmp
where tmp.rank <= 3;
对语句简单讲解一下 :内嵌的子查询 根据科目分组 求该分组下的排名,如果有相同分数,排名+1
结果:
+-----------+--------------+------------+-----------+--+
| tmp.name | tmp.subject | tmp.score | tmp.rank |
+-----------+--------------+------------+-----------+--+
| a | chinese | 90.2 | 1 |
| c | chinese | 85.2 | 2 |
| b | chinese | 80.2 | 3 |
| f | english | 76.2 | 1 |
| e | english | 74.2 | 2 |
| b | english | 72.2 | 3 |
| e | math | 93.0 | 1 |
| f | math | 84.0 | 2 |
| c | math | 73.0 | 3 |
+-----------+--------------+------------+-----------+--+
除了会使用函数之外,我们应该也了解下函数的执行顺序。
在使用 row_number() over()函数时候,over()里头的分组以及排序的执行晚于 where group by order by 的执行。
除了 Hive 的 row_number() , Hive 有没有提供功能相近的函数呢。答案是有的。
有两个函数, 分别是
rank() over()
dense_rank() over()
rank() over() 跟 row_number() over() 的功能基本相同:
不同点在于,分组中存在相同值的处理流程。
rank() over() 更接近于一般的排名逻辑,比如有两个并列第一,那么就会显示
88分 第1
88分 第1
80分 第3
参考例子:
0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, rank() over(partition by subject order by score desc) rank from student2) tmp where tmp.rank <= 5;
select * from (
select name, subject, score, rank() over(partition by subject order by score desc) rank
from student2) tmp
where tmp.rank <= 5;
+-----------+--------------+------------+-----------+--+
| tmp.name | tmp.subject | tmp.score | tmp.rank |
+-----------+--------------+------------+-----------+--+
| a | chinese | 90.2 | 1 |
| c | chinese | 85.2 | 2 |
| b | chinese | 80.2 | 3 |
| d | chinese | 75.2 | 4 |
| e | chinese | 55.2 | 5 |
| f | english | 76.2 | 1 |
| e | english | 74.2 | 2 |
| b | english | 72.2 | 3 |
| c | english | 64.2 | 4 |
| a | english | 33.0 | 5 |
| e | math | 93.0 | 1 |
| f | math | 84.0 | 2 |
| c | math | 73.0 | 3 |
| f | math | 63.0 | 4 |
| b | math | 63.0 | 4 |
+-----------+--------------+------------+-----------+--+
可以看到 f 与 b 都是 63分,显示都是第 4名
该函数与 row_number() over() 的区别也在于相同值的处理过程上。
不同于 rank() over() 的跳跃排序, 即两个第2名,之后是第4名,
该函数是连续排序,即两个相同第2名,之后是第3名,示例如下:
select name, subject, score, dense_rank() over(partition by subject order by score desc) rank
from student2;
0: jdbc:hive2://10.180.0.26:10000> select name, subject, score, dense_rank() over(partition by subject order by score desc) rank from student2;
结果:
+-------+----------+--------+-------+--+
| name | subject | score | rank |
+-------+----------+--------+-------+--+
| a | chinese | 90.2 | 1 |
| c | chinese | 85.2 | 2 |
| b | chinese | 80.2 | 3 |
| d | chinese | 75.2 | 4 |
| e | chinese | 55.2 | 5 |
| f | chinese | 20.2 | 6 |
| f | english | 76.2 | 1 |
| e | english | 74.2 | 2 |
| b | english | 72.2 | 3 |
| c | english | 64.2 | 4 |
| a | english | 33.0 | 5 |
| d | english | 24.2 | 6 |
| a | english | 22.2 | 7 |
| e | math | 93.0 | 1 |
| f | math | 84.0 | 2 |
| c | math | 73.0 | 3 |
| f | math | 63.0 | 4 |
| b | math | 63.0 | 4 |
| d | math | 43.0 | 5 |
+-------+----------+--------+-------+--+
好了。通过这篇文章你应该对 Hive 的分组排序功能有个非常细致的理解。
祝大家每日进步,加油!
我们有一张表 ,记录了 用户的姓名,性别 ,分数,我们想找到 男生与女生的前3名,
表中数据如下:
0: jdbc:hive2://cdh-manager:10000> select * from test_sex_rank
. . . . . . . . . . . . . . . . .> ;
INFO : Compiling command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_sex_rank.name, type:string, comment:null), FieldSchema(name:test_sex_rank.sex, type:boolean, comment:null), FieldSchema(name:test_sex_rank.score, type:double, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.117 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank
INFO : Completed executing command(queryId=hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.0 seconds
INFO : OK
+---------------------+--------------------+----------------------+
| test_sex_rank.name | test_sex_rank.sex | test_sex_rank.score |
+---------------------+--------------------+----------------------+
| a1 | false | 82.0 |
| a2 | false | 98.0 |
| a3 | false | 67.4 |
| a4 | false | 87.0 |
| b1 | true | 42.0 |
| b2 | true | 98.0 |
| b3 | true | 77.4 |
| b4 | true | 87.0 |
+---------------------+--------------------+----------------------+
8 rows selected (0.179 seconds)
我们编写的SQL 如下:
0: jdbc:hive2://cdh-manager:10000> select * from (select name, score, sex, rank() over(partition by sex order by score) as rank from test_sex_rank) tmp where rank <=3;
select * from
(
select name, score, sex, rank() over(partition by sex order by score) as rank
from test_sex_rank
) tmp
where rank <=3;
结果如下:
+-----------+------------+----------+-----------+
| tmp.name | tmp.score | tmp.sex | tmp.rank |
+-----------+------------+----------+-----------+
| a3 | 67.4 | false | 1 |
| a1 | 82.0 | false | 2 |
| a4 | 87.0 | false | 3 |
| b1 | 42.0 | true | 1 |
| b3 | 77.4 | true | 2 |
| b4 | 87.0 | true | 3 |
+-----------+------------+----------+-----------+