1. 排序定义
所谓排序就是使一串记录,按照其中的某个或某些关键字,递增或是递减的排列。
2. hive 中排序相关内容
2.1 order by
order by 会对输入做全局排序,故只有一个reducer,若数据的规模比较大时,需要较长的计算时间。hive中order by 也是对一个结果集进行排序,不同于关系型数据库是底层架构。hive的hive-site.xml配置文件中的参数hive.mapred.mode控制 着hive的执行方式,若选择strict,则order by 则需要指定limit(若有分区还有指定哪个分区) ;若为nostrict,则与关系型数据库差不多。由于order by 执行时,只有一个reducer ,如果结果集过大,那执行时间相对会比较漫长。
注:若不想修改配置文件,可临时执行:set hive.mapred.mode=nonstrict 或set hive.mapred.mode=strict;也可以在当前会话中达到同样的效果。
测试:
--未开启strict模式,即nostrict模式
hive> select id,devid,devname from tb_in_base order by devid;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15431, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15431
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15431
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 16:33:21,817 Stage-1 map = 0%, reduce = 0%
2013-08-05 16:33:23,828 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:24,834 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:25,843 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:26,849 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:27,855 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:28,860 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:29,873 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:30,880 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.94 sec
2013-08-05 16:33:31,888 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.51 sec
2013-08-05 16:33:32,893 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.51 sec
2013-08-05 16:33:33,899 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.51 sec
MapReduce Total cumulative CPU time: 2 seconds 510 msec
Ended Job = job_201307151509_15431
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.51 sec HDFS Read: 559 HDFS Write: 138 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 510 msec
OK
1 121212 test1
2 131313 test2
3 141414 test3
4 151515 test5
5 161616 test6
6 171717 test7
8 191919 test9overwrite
8 191919 test9overwrite
Time taken: 16.872 seconds
结果说明:没有开启严格模式时,order by 与关系型数据库效果类似。
--开启scrict模式,且未在order by 后面加limit
hive> select id,devid from tb_in_base order by devid;
FAILED: Error in semantic analysis: 1:41 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'devid'
注:没有指定limit 报错
--开启scrict模式,且未在order by 后面加limit ,且未指定分区
hive> select * from tb_in_base;
FAILED: Error in semantic analysis: No partition predicate found for Alias "tb_in_base" Table "tb_in_base"
结果说明:严格模式下,无法直接进行查询。
--开启scrict模式,且未在order by 后面加limit ,且指定分区
hive> select * from tb_in_base where job_time=030729 order by devid limit 2;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15432, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15432
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15432
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 16:47:32,900 Stage-1 map = 0%, reduce = 0%
2013-08-05 16:47:34,920 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:35,927 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:36,934 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:37,941 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:38,946 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:39,953 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:40,959 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:41,965 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:42,971 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.05 sec
2013-08-05 16:47:43,977 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.05 sec
2013-08-05 16:47:44,983 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.05 sec
MapReduce Total cumulative CPU time: 3 seconds 50 msec
Ended Job = job_201307151509_15432
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.05 sec HDFS Read: 458 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 50 msec
OK
1 121212 test1 030729
2 131313 test2 030729
Time taken: 17.597 seconds
结果说明:严格模式下,使用order by 不仅需要指定limit 数量,若有表分区还需要指定表分区。
2.2 sort by
sort可以控制每个reduce产生的文件都是排序,再对多个排序的好的文件做二次归并排序。sort by 特点如下:
1) . sort by 基本受hive.mapred.mode是否为strict、nonstrict的影响,但若有分区需要指定分区。
2). sort by 的数据在同一个reduce中数据是按指定字段排序。
3). sort by 可以指定执行的reduce个数,如:set mapred.reduce.tasks=5 ,对输出的数据再执行归并排序,即可以得到全部结果。
-- 在hive.mapred.mode为nonstrict时
hive> select id,devid from tb_in_base sort by devid;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15434, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15434
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15434
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 17:15:38,244 Stage-1 map = 0%, reduce = 0%
2013-08-05 17:15:40,258 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:41,269 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:42,275 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:43,281 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:44,286 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:45,292 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:46,298 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:47,305 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2013-08-05 17:15:48,311 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.07 sec
2013-08-05 17:15:49,317 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.07 sec
2013-08-05 17:15:50,323 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.07 sec
MapReduce Total cumulative CPU time: 3 seconds 70 msec
Ended Job = job_201307151509_15434
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.07 sec HDFS Read: 559 HDFS Write: 72 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 70 msec
OK
1 121212
2 131313
3 141414
4 151515
5 161616
6 171717
8 191919
8 191919
Time taken: 15.969 seconds
---开启hive.mapred.mode为strict模式
hive> set hive.mapred.mode=strict;
hive> select * from tb_in_base order by devid; //注:测试set hive.mapred.mode=strict;是否生效
FAILED: Error in semantic analysis: 1:34 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'devid'
---开启hive.mapred.mode为strict模式,查询时未指定分区
hive> select * from tb_in_base sort by devid;
FAILED: Error in semantic analysis: No partition predicate found for Alias "tb_in_base" Table "tb_in_base"
---开启hive.mapred.mode为strict模式,查询时指定分区
hive> select id,devid,job_time from tb_in_base where job_time=030729 sort by devid;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15438, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15438
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15438
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 17:22:02,631 Stage-1 map = 0%, reduce = 0%
2013-08-05 17:22:05,645 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2013-08-05 17:22:06,651 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2013-08-05 17:22:07,656 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2013-08-05 17:22:08,662 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2013-08-05 17:22:09,668 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2013-08-05 17:22:10,673 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2013-08-05 17:22:11,679 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2013-08-05 17:22:12,685 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.89 sec
2013-08-05 17:22:13,691 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.61 sec
2013-08-05 17:22:14,696 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.61 sec
2013-08-05 17:22:15,704 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.61 sec
MapReduce Total cumulative CPU time: 2 seconds 610 msec
Ended Job = job_201307151509_15438
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.61 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 610 msec
OK
1 121212 030729
2 131313 030729
3 141414 030729
4 151515 030729
5 161616 030729
6 171717 030729
8 191919 030729
Time taken: 18.044 seconds
结果说明:严格模式下,sort by 不指定limit 数,可以正常执行。sort by 受hive.mapred.mode=sctrict 的影响较小。
---set mapred.reduce.tasks=2,使用sort by
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/sortby' select id,devid,job_time from tb_in_base where job_time=030729 sort by devid; //注:将查询结果写入到指定目录中
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2 //注:说明有两个reducer
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15466, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15466
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15466
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2013-08-05 18:03:48,298 Stage-1 map = 0%, reduce = 0%
2013-08-05 18:03:50,307 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:03:51,312 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:03:52,317 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:03:53,322 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:03:54,327 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:03:55,333 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:03:56,338 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:03:57,343 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 1.33 sec
2013-08-05 18:03:58,351 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.57 sec
2013-08-05 18:03:59,356 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.57 sec
2013-08-05 18:04:00,362 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.57 sec
MapReduce Total cumulative CPU time: 4 seconds 570 msec
Ended Job = job_201307151509_15466
Copying data to local directory /tmp/hivetest/sortby
Copying data to local directory /tmp/hivetest/sortby
7 Rows loaded to /tmp/hivetest/sortby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 2 Cumulative CPU: 4.57 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 570 msec
OK
Time taken: 16.712 seconds
查看/tmp/hivetest/sort下查询结果
结果说明:sort by 可以通过mapred.reduce.task指定reduce个数,可以将查询后的数据分别写入两个不同的reduce文件中。
---set mapred.reduce.tasks=2,使用order by
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/orderby' select id,devid,job_time from tb_in_base where job_time=030729 order by devid;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1 //注:说明只有一个reducer
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15469, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15469
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15469
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 18:14:22,528 Stage-1 map = 0%, reduce = 0%
2013-08-05 18:14:24,538 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:25,548 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:26,562 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:27,568 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:28,574 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:29,581 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:30,587 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:31,592 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 1.05 sec
2013-08-05 18:14:32,598 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.56 sec
2013-08-05 18:14:33,604 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.56 sec
2013-08-05 18:14:34,611 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.56 sec
MapReduce Total cumulative CPU time: 2 seconds 560 msec
Ended Job = job_201307151509_15469
Copying data to local directory /tmp/hivetest/orderby
Copying data to local directory /tmp/hivetest/orderby
7 Rows loaded to /tmp/hivetest/orderby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.56 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 560 msec
OK
Time taken: 16.804 seconds
查看/tmp/hivetest/orderby下查询结果
结果说明:设置mapred.reduce.task=2,使用order by做排序,所有的数据都是写入同一个文件中。由此可说明不论设计多少个reduce任务数,order by 只使用一个reduce。
注:可以用limit 子句大大减少数据量,使用limit n ,后,传输到reduce商的数据记录就减少到n* map,否则可能由于数据过大可能出不了结果。
疑问:使用limit n之后,数据是减少了,但是对于统计数据的话,少了数据的统计还有意义吗?
2.3 distribute by
distribute by 是控制在map端如何拆分给reduce端。根据distribute by 后面的列及reduce个数进行数据分发,默认采用hash算法。distribute可以使用length方法会根据string类型的长度划分到不同 的reduce中,最终输出到不同的文件中。 length 是内建函数,也可以指定其他的函数或这使用自定义函数。
注:对distribute by 进行测试,一定要分配多个reduce进行处理,否则无法看到distribute by 效果。
--使用id进行排序且set mapred.reduce.tasks=2;
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/distributeby' select id,devid,job_time from tb_in_base where job_time=030729 and id<8 distribute by id ;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15499, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15499
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15499
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2013-08-05 18:37:14,681 Stage-1 map = 0%, reduce = 0%
2013-08-05 18:37:16,691 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:37:17,697 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:37:18,703 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:37:19,710 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:37:20,717 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:37:21,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:37:22,733 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
2013-08-05 18:37:23,739 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 3.1 sec
2013-08-05 18:37:24,745 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.89 sec
2013-08-05 18:37:25,751 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.89 sec
2013-08-05 18:37:26,757 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.89 sec
MapReduce Total cumulative CPU time: 4 seconds 890 msec
Ended Job = job_201307151509_15499
Copying data to local directory /tmp/hivetest/distributeby
Copying data to local directory /tmp/hivetest/distributeby
7 Rows loaded to /tmp/hivetest/distributeby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 2 Cumulative CPU: 4.89 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 890 msec
OK
Time taken: 16.785 seconds
查看写入的查询数据:
结果说明:distribute by采用hash算法,将查询的结果写入不同的reduce文件中。数据分配到哪个reduce文件中,是在map端控制的。
--使用job_time进行排序且set mapred.reduce.tasks=2;
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/distributeby' select id,devid,job_time from tb_in_base where job_time=030729 distribute by job_time;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15500, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15500
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15500
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2013-08-05 18:42:07,764 Stage-1 map = 0%, reduce = 0%
2013-08-05 18:42:10,778 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec
2013-08-05 18:42:11,784 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec
2013-08-05 18:42:12,789 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec
2013-08-05 18:42:13,795 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec
2013-08-05 18:42:14,801 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec
2013-08-05 18:42:15,810 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec
2013-08-05 18:42:16,816 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec
2013-08-05 18:42:17,821 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.61 sec
2013-08-05 18:42:18,827 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.89 sec
2013-08-05 18:42:19,833 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.89 sec
2013-08-05 18:42:20,839 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.89 sec
MapReduce Total cumulative CPU time: 3 seconds 890 msec
Ended Job = job_201307151509_15500
Copying data to local directory /tmp/hivetest/distributeby
Copying data to local directory /tmp/hivetest/distributeby
7 Rows loaded to /tmp/hivetest/distributeby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 2 Cumulative CPU: 3.89 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 890 msec
OK
Time taken: 16.678 seconds
查看查询后的结果:
结果说明:hash值一样的数据被分配到同一个reduce中。
额外验证:
---sort by 是否也像distribute by 一样使用hash算法
set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/sortby' select id,devid,job_time from tb_in_base where job_time=030729 sort by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15501, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15501
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15501
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2013-08-05 18:57:33,616 Stage-1 map = 0%, reduce = 0%
2013-08-05 18:57:35,625 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:36,631 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:37,636 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:38,642 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:39,648 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:40,653 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:41,659 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:42,669 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2013-08-05 18:57:43,675 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 3.02 sec
2013-08-05 18:57:44,681 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.68 sec
2013-08-05 18:57:45,687 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.68 sec
2013-08-05 18:57:46,693 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.68 sec
MapReduce Total cumulative CPU time: 4 seconds 680 msec
Ended Job = job_201307151509_15501
Copying data to local directory /tmp/hivetest/sortby
Copying data to local directory /tmp/hivetest/sortby
7 Rows loaded to /tmp/hivetest/sortby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 2 Cumulative CPU: 4.68 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 680 msec
OK
查询结果:
结果说明:sort by 不支持通过hash算法将数据分配到不同的reduce文件。
2.4 cluster by
cluster by 除了distribute by 的功能外,还会对该字段进行排序,所以cluster by = distribute by +sort by
---cluster by 是否可以指定asc 或desc
hive> select id,devid,job_time from tb_in_base where job_time=030729 cluster by id desc;
FAILED: Parse Error: line 1:77 mismatched input 'desc' expecting EOF near 'id'
hive> select id,devid,job_time from tb_in_base where job_time=030729 cluster by id asc;
FAILED: Parse Error: line 1:77 mismatched input 'asc' expecting EOF near 'id'
注:cluster by 默认倒序排序
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/clusterby' select id,devid,job_time from tb_in_base where job_time=030729 cluster by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15532, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15532
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15532
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2013-08-05 19:41:15,138 Stage-1 map = 0%, reduce = 0%
2013-08-05 19:41:17,147 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:18,153 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:19,158 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:20,163 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:21,169 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:22,174 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:23,180 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:24,186 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2013-08-05 19:41:25,193 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 2.5 sec
2013-08-05 19:41:26,199 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.34 sec
2013-08-05 19:41:27,205 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.34 sec
2013-08-05 19:41:28,210 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.34 sec
MapReduce Total cumulative CPU time: 4 seconds 340 msec
Ended Job = job_201307151509_15532
Copying data to local directory /tmp/hivetest/clusterby
Copying data to local directory /tmp/hivetest/clusterby
7 Rows loaded to /tmp/hivetest/clusterby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 2 Cumulative CPU: 4.34 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 340 msec
OK
Time taken: 17.637 seconds
写入文件后的查询结果:
结果分析:采用hash算法,根据hash值将查询的结果写入不同的reduce文件中。
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/clusterby' select id,devid,job_time from tb_in_base where job_time=030729 cluster by job_time;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15533, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15533
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15533
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2013-08-05 19:43:38,722 Stage-1 map = 0%, reduce = 0%
2013-08-05 19:43:40,732 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:41,738 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:42,743 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:43,748 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:44,754 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:45,759 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:46,765 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:47,770 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 1.5 sec
2013-08-05 19:43:48,776 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 3.02 sec
2013-08-05 19:43:49,781 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.55 sec
2013-08-05 19:43:50,787 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.55 sec
MapReduce Total cumulative CPU time: 4 seconds 550 msec
Ended Job = job_201307151509_15533
Copying data to local directory /tmp/hivetest/clusterby
Copying data to local directory /tmp/hivetest/clusterby
7 Rows loaded to /tmp/hivetest/clusterby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 2 Cumulative CPU: 4.55 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
Time taken: 16.613 seconds
写入文件后的查询结果:
结果说明:cluster by 对其指定的列,进行hash,若hash值相同则被分配到相同的reduce文件中。
3 . 总结分析
1). order by 只有一个reduce负责对所有的数据进行排序,若大数据量,则需要较长的时间。建议在小的数据集中使用order by 进行排序。
2). order by 可以通过设置hive.mapred.mode参数控制执行方式,若选择strict,则order by 则需要指定limit(若有分区还有指定哪个分区) ;若为nostrict,则与关系型数据库差不多。
3). sort by 基本上不受hive.mapred.mode影响,可以通过mapred.reduce.task 指定reduce个数,查询后的数据被分发到相关的reduce中。
4). sort by 的数据在进入reduce前就完成排序,如果要使用sort by 是行排序,并且设置map.reduce.tasks>1,则sort by 才能保证每个reducer输出有序,不能保证全局数据有序。
5). distribute by 采集hash算法,在map端将查询的结果中hash值相同的结果分发到对应的reduce文件中。
6). distribute by 可以使用length方法会根据string类型的长度划分到不同的reduce中,最终输出到不同的文件中。 length 是内建函数,也可以指定其他的函数或这使用自定义函数。
7). cluster by 除了distribute by 的功能外,还会对该字段进行排序,所以cluster by = distribute by +sort by 。