Hive高级查询orderby、groupby

1 Hive的高级查询操作有很多,主要有:

?
1
2
3
4
5
6
7
class = "hljs vbnet" >group by #按K来把数据进行分组
order by #全局排序
join    #两个表进行连接
distribute by  #把数据打散,按字段把数据分到不同的文件里面
sort by  #会把每个reducer上的数据进行排序,局部排序
cluster by  #cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。
union all 把多个表进行组合起来形成一个新表

这些操作其底层实现的都是mapreduce.

2 几个简单得聚合操作

?
1
2
3
4
5
6
7
8
9
10
class = "hljs axapta" >count计数
     count(*)     count( 1 )  count(col)
sum求和
     sum(可转成数字的值)返回bigint
     sum(col)+cast( 1 as bigint)
avg求平均值
     avg(可转成数字的值)返回 double
distinct不同值个数
     count(distinct col)

3 order by
这个函数的功能是:按照某些字段排序
样例是:

?
1
2
3
4
class = "hljs sql" >    select col1,other...
     from table
     where condition
     order by col1,col2[asc|desc]

关于order by值得注意的是:
order by 后面可以有多列进行排序,默认按字典排序;
order by为全局排序;
order by需要reduce操作,且只有一个reduce,与配置有关。

4好的,接下来我们来实战一下:建立一个M表。

?
1
2
3
4
5
6
7
8
9
10
class = "hljs vbnet" >hive> create table M(
     > col string,
     > col2 string
     > )
     > row format delimited fields terminated by '\t'
     > lines terminated by '\n'
     > stored as textfile;
OK
Time taken: 0.283 seconds
hive>

加载本地的数据进入M表中:

?
1
2
3
4
5
6
7
class = "hljs applescript" >hive> load data local inpath '/usr/host/M' into table M;
Copying data from file:/usr/host/M
Copying file: file:/usr/host/M
Loading data to table default .m
OK
Time taken: 0.721 seconds
hive>

接下来进行查询:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class = "hljs vhdl" >hive> select * from M;
OK
A   1
B   5
B   2
C   3
Time taken: 0.288 seconds
hive> select * from M order by col desc,col2 asc;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
Job running in-process (local Hadoop)
Hadoop job information for null : number of mappers: 1 ; number of reducers: 1
2016 - 06 - 06 01 : 28 : 20 , 284 null map = 0 %,  reduce = 0 %
2016 - 06 - 06 01 : 28 : 40 , 233 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 43 , 409 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 44 , 480 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 45 , 560 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 46 , 621 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 47 , 676 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 48 , 753 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 49 , 831 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 50 , 918 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 51 , 987 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 53 , 041 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 54 , 137 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 55 , 198 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.18 sec
2016 - 06 - 06 01 : 28 : 56 , 242 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.86 sec
2016 - 06 - 06 01 : 28 : 57 , 284 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.86 sec
2016 - 06 - 06 01 : 28 : 58 , 326 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.86 sec
MapReduce Total cumulative CPU time: 1 seconds 860 msec
Ended Job = job_1465200327080_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
C   3
B   2
B   5
A   1
Time taken: 80.999 seconds
hive>

注意:(desc降序,asc升序)。很显然col列是按照降序拍的,col2是按照升序排的,所以会出现

?
1
2
class = "hljs " >B   2
B   5

另外:听说生产中一般都不会在hive里面做order by,会很慢,而是在hive里面统计结果后导入一部分去关系型数据库中,在关系型数据库中做order by,那就会很快。我觉得确实是如此,因为导入mysql中查询会快很多。
Hive高级查询orderby、groupby_第1张图片
group by<喎�"/kf/ware/vc/" target="_blank" class="keylink">vc3Ryb25nPjwvcD4NCjxwPtXiuPa6r8r9tcS5psTcysejurC01dXEs9Cp19a2zrXE1rW9+NDQt9bX6aOs09DP4M2s1rW3xbW90rvG8KGjPGJyIC8+DQrR+cD9o7o8L3A+DQo8cHJlIGNsYXNzPQ=="brush:java;">select col1[,col2],count(1),sel_expr(聚合操作) from table where condition group by col1[,col2] [having]

注意:
select 后面非聚合列必须出现在gruopby中
除了普通列就是一些聚合操作
groupby后面也可以跟表达式,比如substr(col)

我们来实际实验一下:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class = "hljs lasso" >hive> desc M;     
OK
col string 
col2    string 
Time taken: 0.28 seconds
hive> select col from M group by col;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of **reducers:
   set hive.exec.reducers.max=
In order to set a constant number of reducers:
   set mapred.reduce.tasks=**
Job running in-process (local Hadoop)
Hadoop job information for null : number of mappers: 1 ; number of reducers: 1
2016 - 06 - 06 02 : 33 : 50 , 712 null map = 0 %,  reduce = 0 %
2016 - 06 - 06 02 : 34 : 12 , 802 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 13 , 911 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 15 , 018 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 16 , 099 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 17 , 315 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 18 , 452 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 19 , 558 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 20 , 612 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 21 , 699 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 22 , 804 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 23 , 870 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 24 , 937 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 25 , 978 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 27 , 075 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.53 sec
2016 - 06 - 06 02 : 34 : 28 , 145 null map = 100 %,  reduce = 100 %, Cumulative CPU 2.33 sec
2016 - 06 - 06 02 : 34 : 29 , 255 null map = 100 %,  reduce = 100 %, Cumulative CPU 2.33 sec
MapReduce Total cumulative CPU time: 2 seconds 330 msec
Ended Job = job_1465200327080_0002
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
A
B
C
Time taken: 63.381 seconds
hive>

其实group by语句是可以去重的。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class = "hljs lasso" >hive> select distinct col from M;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=
In order to set a constant number of reducers:
   set mapred.reduce.tasks=
16 / 06 / 06 02 : 36 : 49 INFO Configuration.deprecation: mapred.job.name is
Job running in-process (local Hadoop)
Hadoop job information for null : number of mappers: 1 ; number of reducers: 1
2016 - 06 - 06 02 : 37 : 00 , 159 null map = 0 %,  reduce = 0 %
2016 - 06 - 06 02 : 37 : 18 , 943 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 20 , 203 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 21 , 344 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 23 , 459 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 24 , 554 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 25 , 589 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 26 , 660 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 27 , 735 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 28 , 815 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 29 , 906 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 30 , 989 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 32 , 054 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.1 sec
2016 - 06 - 06 02 : 37 : 33 , 111 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.96 sec
2016 - 06 - 06 02 : 37 : 34 , 223 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.96 sec
MapReduce Total cumulative CPU time: 1 seconds 960 msec
Ended Job = job_1465200327080_0003
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
A
B
C
Time taken: 55.682 seconds
?
1
class = "hljs sql" >select distinct col from M; //(跟上一句话是一样的结果),可以用来去重
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class = "hljs lasso" >hive> select col from m group by col,col2;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=
In order to set a constant number of reducers:
   set mapred.reduce.tasks=
Job running in-process (local Hadoop)
Hadoop job information for null : number of mappers: 1 ; number of reducers: 1
2016 - 06 - 06 02 : 38 : 48 , 837 null map = 0 %,  reduce = 0 %
2016 - 06 - 06 02 : 39 : 06 , 717 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 08 , 045 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 09 , 271 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 10 , 428 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 11 , 590 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 12 , 696 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 13 , 765 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 14 , 879 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 15 , 949 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 17 , 099 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 18 , 173 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 19 , 281 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 20 , 357 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.24 sec
2016 - 06 - 06 02 : 39 : 21 , 420 null map = 100 %,  reduce = 100 %, Cumulative CPU 2.05 sec
2016 - 06 - 06 02 : 39 : 22 , 560 null map = 100 %,  reduce = 100 %, Cumulative CPU 2.05 sec
MapReduce Total cumulative CPU time: 2 seconds 50 msec
Ended Job = job_1465200327080_0004
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
A
B
B
C
Time taken: 56.956 seconds
hive>

Hive高级查询orderby、groupby_第2张图片
5 特性:
使用了reduce操作,受限于reduce数量,设置reduce参数mared.reduce.tasks
输出文件个数与reduce数相同,文件大小与reduce处理的数据量有关

**问题:网络负载过重;
数据倾斜,优化参数**:hive.groupby.skewindata

6 什么叫数据倾斜呢?
可以这么简单理解比如说:如果说某一个K值数据量过大,如果有10个reducer,其中9个数据量不大,很快执行完了,剩下一个数据量巨大,那么这9个就会等这一个reducer执行完。换句话说就是处理某值的reduce灰常耗时。
解决思路:Hive的执行是分阶段的,map处理数据量的差异取决于上一个stage的reduce输出,所以如何将数据均匀的分配到各个reduce中,就是解决数据倾斜的根本所在。

set mapred.reduce.task=5;
select * from M order by col desc,col asc;

set hive.groupby.skewindata=true;//避免数据倾斜,total job 变为了2个,这个参数是有用的,启用两个job,避免数据倾斜

?
1
class = "hljs sql" >select country,count( 1 ) as num from city1 group by country;

验证一下:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class = "hljs avrasm" >hive> set hive.groupby.skewindata= true ;
hive> select country,count( 1 ) as num from city1 group by country;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=
In order to set a constant number of reducers:
   set mapred.reduce.tasks=
Job running in-process (local Hadoop)
Hadoop job information for null : number of mappers: 1 ; number of reducers: 1
2016 - 06 - 06 03 : 03 : 27 , 536 null map = 0 %,  reduce = 0 %
2016 - 06 - 06 03 : 03 : 45 , 874 null map = 100 %,  reduce = 0 %
2016 - 06 - 06 03 : 04 : 00 , 051 null map = 100 %,  reduce = 100 %, Cumulative CPU 2.72 sec
2016 - 06 - 06 03 : 04 : 01 , 156 null map = 100 %,  reduce = 100 %, Cumulative CPU 2.72 sec
2016 - 06 - 06 03 : 04 : 02 , 280 null map = 100 %,  reduce = 100 %, Cumulative CPU 2.72 sec
MapReduce Total cumulative CPU time: 2 seconds 720 msec
Ended Job = job_1465200327080_0005
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=
In order to set a constant number of reducers:
   set mapred.reduce.tasks=
16 / 06 / 06 03 : 04 : 13 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16 / 06 / 06 03 : 04 : 13 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
16 / 06 / 06 03 : 04 : 13 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/root/root_20160606030303_333ea55a-be99-4d58-8c47-ba7f02def6b0.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/host/hadoop/hadoop- 2.2 . 0 /share/hadoop/common/lib/slf4j-log4j12- 1.7 . 5 .jar!/org/slf4j/impl/StaticLoggerBinder. class ]
SLF4J: Found binding in [jar:file:/usr/host/hive/lib/slf4j-log4j12- 1.6 . 1 .jar!/org/slf4j/impl/StaticLoggerBinder. class ]
SLF4J: See https: //www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Job running in-process (local Hadoop)
Hadoop job information for null : number of mappers: 1 ; number of reducers: 1
2016 - 06 - 06 03 : 04 : 27 , 453 null map = 0 %,  reduce = 0 %
2016 - 06 - 06 03 : 04 : 41 , 642 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 43 , 014 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 44 , 169 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 45 , 329 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 46 , 549 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 47 , 637 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 49 , 667 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 50 , 747 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 51 , 827 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 52 , 907 null map = 100 %,  reduce = 0 %, Cumulative CPU 1.15 sec
2016 - 06 - 06 03 : 04 : 53 , 969 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.92 sec
2016 - 06 - 06 03 : 04 : 55 , 053 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.92 sec
2016 - 06 - 06 03 : 04 : 56 , 139 null map = 100 %,  reduce = 100 %, Cumulative CPU 1.92 sec
MapReduce Total cumulative CPU time: 1 seconds 920 msec
Ended Job = job_1465200327080_0006
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
china   10
Time taken: 117.33 seconds
hive>

这样就会有两个job了。
hive.groupby.skewindata=true的原理是:当有数据倾斜的时候进行负载均衡,当选项设定为 true,生成的查询计划会有两个 MR Job。第一个 MR Job 中,Map 的输出结果集合会随机分布到 Reduce 中,每个 Reduce 做部分聚合操作,并输出结果,这样处理的结果是相同的 Group By Key 有可能被分发到不同的 Reduce 中,从而达到负载均衡的目的;第二个 MR Job 再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中(这个过程可以保证相同的 Group By Key 被分布到同一个 Reduce 中),最后完成最终的聚合操作。

总结:避免数据倾斜的问题,如果对于group by或distinct,设定 hive.groupby.skewindata=true

你可能感兴趣的:(大数据~Hive)