建表语句
create table BD34 (id int,name string, score int,sex int,age int) row format delimited fields terminated by ‘,’;
create table BD12 (id int,name string, score int,sex int,age int) row format delimited fields terminated by ‘,’;
create table JAVA12 (id int,name string, score int,sex int,age int) row format delimited fields terminated by ‘,’;
create table JAVA34 (id int,name string, score int,sex int,age int) row format delimited fields terminated by ‘,’;
数据加载语句
load data local inpath ‘/opt/hive/BD12.txt’ overwrite into table BD34;
load data local inpath ‘/opt/hive/BD34.txt’ overwrite into table BD12;
load data local inpath ‘/opt/hive/JAVA12.txt’ overwrite into table JAVA34;
load data local inpath ‘/opt/hive/JAVA34.txt’ overwrite into table JAVA12;
BD12.txt
1,BD12name1,1,19
2,BD12name2,1,20
3,BD12name3,1,21
4,BD12name4,1,22
5,BD12name5,1,23
6,BD12name6,1,24
7,BD12name7,1,25
BD34.txt
3,BD34name1,1,21
4,BD34name2,1,22
5,BD34name3,1,23
6,BD34name4,1,24
7,BD34name5,1,25
8,BD34name6,1,26
9,BD34name7,1,27
JAVA34.txt
3,JAVA12name1,1,24
4,JAVA12name2,1,25
5,JAVA12name3,1,26
6,JAVA12name4,1,27
7,JAVA12name5,1,28
8,JAVA12name6,1,29
9,JAVA12name7,1,30
JAVA12.txt
1,JAVA12name1,1,24
2,JAVA12name2,1,25
3,JAVA12name3,1,26
4,JAVA12name4,1,27
5,JAVA12name5,1,28
6,JAVA12name6,0,29
7,JAVA12name7,0,30
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。
案例实操:
(1)计算每个学生的平均分数
select s_id ,avg(s_score) from score group by s_id;
(2)计算每个学生最高成绩
select s_id ,max(s_score) from score group by s_id;
结论1:group by的字段,必须是select后面的字段,而且需要和select后面的字段相同
错误示例: select id,score from bd34 group by id;
正确示例: select id,score from bd34 group by id,score;
结论2: select 后面的字段可以不出现在,group by 内,但必须使用聚合函数
错误示例: select id,score from bd34 group by id;
正确示例:
select id,max(score) from bd34 group by id;
select id,max(score),min(age) from bd34 group by id;
注: 注意group by的字段,必须是select后面的字段,select后面的字段不能比group by的字段多
group by语法中出现在select 后面的字段两个要求
1字段是分组字段
2必须使用聚合函数应用
select by s_id,s_score from score group by s_id;
1)having与where不同点
(1)where针对表中的列发挥作用,查询数据;having针对查询结果中的列发挥作用,筛选数据。
select id,name,score from bd34 group by id,name,score;
select id,name,score from bd34
group by id,name,score having age>18 ;
正确示例:
select id,name,score from bd34
group by id,name,score having id >5 ;
(2)where后面不能写分组函数,而having后面可以使用分组函数。
select id,name,score from bd34
group by id,name,score having avg(id) >5;
(3)having只用于group by分组统计语句。
2)案例实操:
求每个学生的平均分数
select s_id ,avg(s_score) from score group by s_id;
求每个学生平均分数大于85的人
select s_id ,avg(s_score) avgscore
from score group by s_id having avgscore > 85;
Distribute By:类似MR中partition,进行分区,结合sort by使用。
注意,Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
对于distribute by进行测试,一定要分配多reduce进行处理,否则无法看到distribute by的效果。
案例实操:
(1)先按照学生id进行分区,再按照学生成绩进行排序。
设置reduce的个数,将我们对应的s_id划分到对应的reduce当中去
set mapreduce.job.reduces=3;
通过distribute by 进行数据的分区
insert overwrite local directory
'/export/servers/hivedatas/sort' select * from score distribute by s_id ;
Sort By:每个MapReduce内部进行排序,对全局结果集来说不是排序。
1)设置reduce个数
set mapreduce.job.reduces=3;
2)查看设置reduce个数
set mapreduce.job.reduces;
3)查询成绩,使用s_id将成绩分为三个reduce,并按照成绩降序排列
select * from score distribute by s_id sort by s_score;
1)将查询结果导入到文件中(按照成绩降序排列)
insert overwrite local directory
'/export/servers/hivedatas/sort1' select * from score distribute by s_id sort by s_score;
当distribute by和sort by字段相同时,可以使用cluster by方式。
D+S=C(DS字段相同)
Distribute by+Sort by=Cluster by
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒序排序,不能指定排序规则为ASC或者DESC。
1)以下两种写法等价
select * from score cluster by s_id;
select * from score distribute by s_id sort by s_id;