hive sql系列(十)

每天分享一个sql,帮助大家找到sql的快乐

需求

编写sql语句实现每班前三名,分数一样并列,
同时求出前三名按名次排序的分差

建表语句

create table student(
    sid string,--学号
    cid string,--班级号
    score string -- 分数
)
row format delimited fields terminated by '\t'
;

数据

#说明:
数据1具有偶然性,适合不重复的情况,实现可以用扩展部分写法1实现
数据2具有通用性,适合重复数据的情况

#数据1
insert overwrite table student values
("1","1901","90"),
("2","1901","90"),
("3","1901","83"),
("4","1901","60"),
("5","1902","66"),
("6","1902","23"),
("7","1902","99"),
("8","1902","67"),
("9","1902","87");

#数据2
insert overwrite table student values
("1","1901","90"),
("2","1901","80"),
("3","1901","80"),
("4","1901","70"),
("5","1901","75"),
("6","1902","60"),
("7","1902","66"),
("8","1902","23"),
("9","1902","99"),
("10","1902","67");

实现

select
    t1.cid,
    t1.score,
    t1.rank,
    t2.diff_score
from
    (select
        cid,
        score,
        dense_rank() over(partition by cid order by score desc) rank
    from
        student
    )t1
    left join
    (select
        cid,
        score,
        nvl(score-lag(score) over(partition by cid order by score desc),0) diff_score
    from
        student
    group by cid,score
    )t2
    on t1.score = t2.score
where t1.rank <= 3
;

结果

t1.cid	t1.score	t1.rank	t2.diff_score
1901	90	1	0.0
1901	80	2	-10.0
1901	80	2	-10.0
1901	75	3	-5.0
1902	99	1	0.0
1902	67	2	-32.0
1902	66	3	-1.0
Time taken: 88.539 seconds, Fetched: 7 row(s)

分析

1、根据需求发现需要对班级开窗,使用over函数指定班级为分区键
2、根据分数排序,按常识倒序,且分数相同名次一样,所以使用dense_rank,
3、根据名次计算差值,使用lag函数,取出当前行的前第几个数,然后计算差值
4、lag函数,如果是第一个数,则默认取不到,返回null值,需要转化为0,所以使用nvl函数,可以减少使用if函数

#补充
lag函数根据排序规则默认取上一行,比如数据排序是[90,80,80..],那么第二个80取到的还是80,导致计算差值时结果变成了[0,-10,0],显然这不是我们想要的结果。

解决办法【序号和差值分开做,再关联】:先做排序,再补充分数差a表。分数差的做法,做一个去重排序差值,得到有序名词、分数、差值的b表,再用a表和b表通过排序序号关联补充分数差。

扩展

#写法1
select
    t1.cid,
    t1.score,
    t1.rank,
    t1.diff_score
from
    (select
        cid,
        score,
        dense_rank() over(partition by cid order by score desc) rank,
        nvl(score-lag(score) over(partition by cid order by score desc),0) diff_score
    from
        student
    )t1
where rank <= 3
;
OK
1901	90	1	0.0
1901	90	1	0.0
1901	83	2	-7.0
1901	60	3	-23.0
1902	99	1	0.0
1902	87	2	-12.0
1902	67	3	-20.0
Time taken: 29.836 seconds, Fetched: 7 row(s)

#写法2
select
    t1.cid,
    t1.score,
    t1.rank,
    nvl(t1.score-lag(t1.score) over(partition by t1.cid order by t1.score desc),0) diff_score
from
    (select
        cid,
        score,
        dense_rank() over(partition by cid order by score desc) rank
    from
        student
    )t1
where t1.rank <= 3
;
Total MapReduce CPU Time Spent: 6 seconds 740 msec
OK
1901	90	1	0.0
1901	90	1	0.0
1901	83	2	-7.0
1901	60	3	-23.0
1902	99	1	0.0
1902	87	2	-12.0
1902	67	3	-20.0
Time taken: 49.058 seconds, Fetched: 7 row(s)

#写法2日志
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-04-15 15:05:31,734 Stage-1 map = 0%,  reduce = 0%
2021-04-15 15:05:39,196 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.4 sec
2021-04-15 15:05:46,505 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.88 sec
MapReduce Total cumulative CPU time: 3 seconds 880 msec
Ended Job = job_1618468831901_0003
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1618468831901_0004, Tracking URL = http://my-hdp2:8088/proxy/application_1618468831901_0004/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job  -kill job_1618468831901_0004
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2021-04-15 15:05:55,424 Stage-2 map = 0%,  reduce = 0%
2021-04-15 15:06:02,852 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
2021-04-15 15:06:11,485 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 2.86 sec
MapReduce Total cumulative CPU time: 2 seconds 860 msec
Ended Job = job_1618468831901_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.88 sec   HDFS Read: 8747 HDFS Write: 278 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 2.86 sec   HDFS Read: 7919 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 740 msec

#写法1日志
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1618468831901_0002, Tracking URL = http://my-hdp2:8088/proxy/application_1618468831901_0002/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job  -kill job_1618468831901_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-04-15 14:59:54,428 Stage-1 map = 0%,  reduce = 0%
2021-04-15 15:00:06,550 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
2021-04-15 15:00:14,991 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1618468831901_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.91 sec   HDFS Read: 10536 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 910 msec


对比实现部分sql,从控制台任务执行日志来看,写法1性能较高,时间较短

#lag默认用法
hive> select
    >     cid,
    >     score,
    >     dense_rank() over(partition by cid order by score desc) rank,
    >     lag(score) over(partition by cid order by score desc)
    > from
    >     student
    > ;
OK
1901	90	1	NULL
1901	90	1	90
1901	83	2	90
1901	60	3	83
1902	99	1	NULL
1902	87	2	99
1902	67	3	87
1902	66	4	67
1902	23	5	66
Time taken: 22.886 seconds, Fetched: 9 row(s)

知识点

1、nvl(value1,value2):如果value1是null,则返回value2,如果不是,则返回value1
2、dense_rank():基于over开窗函数的排序函数,如果值相同,则排序的序号相同,紧接的序号不跳过。举例123,1123,1223这样
3、lag(字段,n):基于over开窗函数,根据排序规则取当前行前第n个数,如果不指定n,则默认取前一个

 

你可能感兴趣的:(hive,hive,大数据,sql)