现在有一个面试题
北京市学生成绩分析
exercise5_1.txt 文件中的每一行就是一个学生的成绩信息。字段之间的分隔符是","
时间,学校,年纪,姓名,科目,成绩
2013,北大,1,黄渤,语文,97
2013,北大,1,徐峥,语文,52
2013,北大,1,刘德华,语文,85
2012,清华,0,马云,英语,61
2015,北理工,3,李彦宏,物理,81
2016,北科,4,马化腾,化学,92
2014,北航,2,刘强东,数学,70
2012,清华,0,刘诗诗,英语,59
2014,北航,2,刘亦菲,数学,49
2014,北航,2,刘嘉玲,数学,77
create database if not exists exercise;
use exercise;
drop table if exists exercise5_1;
create table exercise5_1(year int, school string, grade int, name string, course string, score int) row format delimited fields terminated by ',';
load data local inpath "/home/hadoop/exercise5_1.txt" into table exercise5_1;
select * from exercise5_1;
desc exercise5_1;
select t.*
from
(select school, grade, course, score,
row_number() over (partition by school, grade, course order by score desc) rank_code
from exercise5_1
where year = "2017"
) t
where t.rank_code <= 3;
详解如下:
row_number函数:row_number() 按指定的列进行分组生成行序列,从 1 开始,如果两行记录的分组列相同,则行序列 +1。
over 函数:是一个窗口函数。
over (order by score) 按照 score 排序进行累计,order by 是个默认的开窗函数。
over (partition by grade) 按照班级分区。
over (partition by grade order by score) 按照班级分区,并按着分数排序。
over (order by score range between 2 preceding and 2 following) 窗口范围为当前行的数据幅度减2加2后的范围内的数据求和。
select school, grade, course, score,
sum(score) over (order by score range between 2 preceding and 2 following) sscore
from exercise5_1
where year = "2017" and school="北航";
select school, grade, name, sum(score) as total_score,
count(1) over (partition by school, grade) nct
from exercise5_1
where year = "2017" and school="清华" and grade = 1
group by school, grade, name
having total_score > 200;
having 是分组(group by)后的筛选条件,分组后的数据组内再筛选,也就是说 HAVING 子句可以让我们筛选成组后的各组数据。
where 则是在分组,聚合前先筛选记录。也就是说作用在 GROUP BY 子句和 HAVING 子句前。