之前的博文中已经介绍过了,Hive的原理、Hive的数据定义、Hive的数据插入的相关知识,接下来学习下Hive的数据查询,总体来说感觉查询部分和Mysql相差并不是很大,当然并不是说没有差别的,个别的地方我会点出来。
转载请注明出处:Hive数据仓库--HiveQL查询
这里是之前的文章中操作的表的信息,并且,我们简单的查询下,你可以看到他的部分数据。
hive> show tables; OK salaries salaries_external salaries_partition wt Time taken: 0.022 seconds, Fetched: 4 row(s) hive> select * from salaries_external limit 10; OK 1985 BAL AL murraed02 1472819.0 1985 BAL AL lynnfr01 1090000.0 1985 BAL AL ripkeca01 800000.0 1985 BAL AL lacyle01 725000.0 1985 BAL AL flanami01 641667.0 1985 BAL AL boddimi01 625000.0 1985 BAL AL stewasa01 581250.0 1985 BAL AL martide01 560000.0 1985 BAL AL roeniga01 558333.0 1985 BAL AL mcgresc01 547143.0 Time taken: 1.142 seconds, Fetched: 10 row(s) hive>
hive> describe salaries_external; OK yearid int year teamid string team lgid string playerid string salary float Time taken: 0.148 seconds, Fetched: 5 row(s) hive>
这里的基本查询和Mysql基本上没有太大的区别的,所以,这里不再展示过多的查询操作,下面列举一个查询的例子供参考即可。
hive> > select yearid, teamid from salaries_external limit 10; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475137014881_0001, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0001/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-09-29 01:38:37,119 Stage-1 map = 0%, reduce = 0% 2016-09-29 01:39:37,511 Stage-1 map = 0%, reduce = 0% 2016-09-29 01:40:37,574 Stage-1 map = 0%, reduce = 0% 2016-09-29 01:40:52,968 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.96 sec MapReduce Total cumulative CPU time: 1 seconds 960 msec Ended Job = job_1475137014881_0001 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.96 sec HDFS Read: 4422 HDFS Write: 90 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 960 msec OK 1985 BAL 1985 BAL 1985 BAL 1985 BAL 1985 BAL 1985 BAL 1985 BAL 1985 BAL 1985 BAL 1985 BAL Time taken: 338.211 seconds, Fetched: 10 row(s)
返回类型
|
签名 | 描述 | |
---|---|---|---|
BIGINT | round(double a) | 返回BIGINT最近的double值。 | |
BIGINT | floor(double a) | 返回最大BIGINT值等于或小于double。 | |
BIGINT | ceil(double a) | 它返回最小BIGINT值等于或大于double。 | |
double | rand(), rand(int seed) | 它返回一个随机数,从行改变到行。 | |
string | concat(string A, string B,...) | 它返回从A后串联B产生的字符串 | |
string | substr(string A, int start) | 它返回一个起始,从起始位置的子字符串,直到A.结束 | |
string | substr(string A, int start, int length) | 返回从给定长度的起始start位置开始的字符串。 | |
string | upper(string A) | 它返回从转换的所有字符为大写产生的字符串。 | |
string | ucase(string A) | 和上面的一样 | |
string | lower(string A) | 它返回转换B的所有字符为小写产生的字符串。 | |
string | lcase(string A) | 和上面的一样 | |
string | trim(string A) | 它返回字符串从A.两端修剪空格的结果 | |
string | ltrim(string A) | 它返回A从一开始修整空格产生的字符串(左手侧) | |
string | rtrim(string A) | rtrim(string A),它返回A从结束修整空格产生的字符串(右侧) | |
string | regexp_replace(string A, string B, string C) | 它返回从替换所有子在B结果配合C.在Java正则表达式语法的字符串 | |
int | size(Map<K.V>) | 它返回在映射类型的元素的数量。 | |
int | size(Array<T>) | 它返回在数组类型元素的数量。 | |
value of <type> | cast(<expr> as <type>) | 它把表达式的结果expr<类型>如cast('1'作为BIGINT)代表整体转换为字符串'1'。如果转换不成功,返回的是NULL。 | |
string | from_unixtime(int unixtime) | 转换的秒数从Unix纪元(1970-01-0100:00:00 UTC)代表那一刻,在当前系统时区的时间戳字符的串格式:"1970-01-01 00:00:00" | |
string | to_date(string timestamp) | 返回一个字符串时间戳的日期部分:to_date("1970-01-01 00:00:00") = "1970-01-01" | |
int | year(string date) | 返回年份部分的日期或时间戳字符串:year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 | |
int | month(string date) | 返回日期或时间戳记字符串月份部分:month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 | |
int | day(string date) | 返回日期或时间戳记字符串当天部分:day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 | |
string | get_json_object(string json_string, string path) | 提取从基于指定的JSON路径的JSON字符串JSON对象,并返回提取的JSON字符串的JSON对象。如果输入的JSON字符串无效,返回NULL。 |
<span style="color:#ff0000;">hive> select concat(playerid, salary) from salaries_external limit 10;</span><span style="color:#535b60;"> Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475137014881_0004, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0004/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-09-29 02:03:04,828 Stage-1 map = 0%, reduce = 0% 2016-09-29 02:03:25,653 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec MapReduce Total cumulative CPU time: 1 seconds 800 msec Ended Job = job_1475137014881_0004 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 4422 HDFS Write: 180 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 800 msec OK murraed021472819.0 lynnfr011090000.0 ripkeca01800000.0 lacyle01725000.0 flanami01641667.0 boddimi01625000.0 stewasa01581250.0 martide01560000.0 roeniga01558333.0 mcgresc01547143.0 Time taken: 42.353 seconds, Fetched: 10 row(s) </span><span style="color:#ff0000;">hive> select concat(playerid, concat('->', salary)) from salaries_external limit 10;</span><span style="color:#535b60;"> Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475137014881_0005, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0005/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0005 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-09-29 02:04:23,813 Stage-1 map = 0%, reduce = 0% 2016-09-29 02:04:32,562 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.41 sec MapReduce Total cumulative CPU time: 1 seconds 410 msec Ended Job = job_1475137014881_0005 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.41 sec HDFS Read: 4422 HDFS Write: 200 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 410 msec OK murraed02->1472819.0 lynnfr01->1090000.0 ripkeca01->800000.0 lacyle01->725000.0 flanami01->641667.0 boddimi01->625000.0 stewasa01->581250.0 martide01->560000.0 roeniga01->558333.0 mcgresc01->547143.0 Time taken: 28.394 seconds, Fetched: 10 row(s) hive> </span>这里使用了字符串拼接的内置函数,第一个是直接将两个字段进行拼接的,第二个是先将字符串和salary字段进行拼接,然后再和playerid进行拼接。
Hive支持以下内置聚合函数。这些函数的用法类似于SQL聚合函数。
返回类型 | 签名 | 描述 |
---|---|---|
BIGINT | count(*), count(expr), | count(*) - 返回检索行的总数。 |
DOUBLE | sum(col), sum(DISTINCT col) | 返回该组或该组中的列的不同值的分组和所有元素的总和。 |
DOUBLE | avg(col), avg(DISTINCT col) | 返回上述组或该组中的列的不同值的元素的平均值。 |
DOUBLE | min(col) | 返回该组中的列的最小值。 |
DOUBLE | max(col) | 返回该组中的列的最大值。 |
hive> > select count(*) from salaries_external; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1475137014881_0002, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0002/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-09-29 01:43:59,754 Stage-1 map = 0%, reduce = 0% 2016-09-29 01:45:00,769 Stage-1 map = 0%, reduce = 0% 2016-09-29 01:46:01,222 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 1.87 sec 2016-09-29 01:46:28,834 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.82 sec 2016-09-29 01:46:58,562 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.19 sec MapReduce Total cumulative CPU time: 4 seconds 190 msec Ended Job = job_1475137014881_0002 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.43 sec HDFS Read: 1354022 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 430 msec OK 46284 Time taken: 242.819 seconds, Fetched: 1 row(s)
Hive的嵌套查询,这种嵌套查询在实际的使用中会经常用到,在这里给出一个例子,可以参考下。根据实际的业务来设计。
hive> from ( > select * from salaries_external where yearid = 2012 > )e > select e.yearid as year , e.playerid as player > where e.salary > 10000 limit 10; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475137014881_0007, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0007/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0007 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-09-29 02:47:21,350 Stage-1 map = 0%, reduce = 0% 2016-09-29 02:47:29,194 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.78 sec MapReduce Total cumulative CPU time: 1 seconds 780 msec Ended Job = job_1475137014881_0007 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.78 sec HDFS Read: 655686 HDFS Write: 149 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 780 msec OK 2012 markani01 2012 roberbr01 2012 reynoma01 2012 hardyjj01 2012 jonesad01 2012 greggke01 2012 hammeja01 2012 lindsma01 2012 chenwe02 2012 johnsji04 Time taken: 29.807 seconds, Fetched: 10 row(s)
hive> select yearid, salary, > case > when salary < 10000 then 'low' > when salary >= 10000 and salary < 20000 then 'Mid' > else 'High' > end as bracket from salaries_external limit 10; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475137014881_0008, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0008/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0008 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-09-29 03:03:52,323 Stage-1 map = 0%, reduce = 0% 2016-09-29 03:04:10,418 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec MapReduce Total cumulative CPU time: 1 seconds 870 msec Ended Job = job_1475137014881_0008 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.87 sec HDFS Read: 4422 HDFS Write: 192 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 870 msec OK 1985 1472819.0 High 1985 1090000.0 High 1985 800000.0 High 1985 725000.0 High 1985 641667.0 High 1985 625000.0 High 1985 581250.0 High 1985 560000.0 High 1985 558333.0 High 1985 547143.0 High Time taken: 31.771 seconds, Fetched: 10 row(s)
hive> > select * from salaries_external limit 10; OK 1985 BAL AL murraed02 1472819.0 1985 BAL AL lynnfr01 1090000.0 1985 BAL AL ripkeca01 800000.0 1985 BAL AL lacyle01 725000.0 1985 BAL AL flanami01 641667.0 1985 BAL AL boddimi01 625000.0 1985 BAL AL stewasa01 581250.0 1985 BAL AL martide01 560000.0 1985 BAL AL roeniga01 558333.0 1985 BAL AL mcgresc01 547143.0 Time taken: 0.204 seconds, Fetched: 10 row(s)像上面这种简单查询的语句不会触发MapReduce进行查询。当然包括select * from tableName这样的语句
hive> select * from salaries_partition where yearid = 1985 limit 10; OK Time taken: 0.705 seconds
但是,不分区的话行不行呢?
答案是不行。
hive> > select * from salaries_external where yearid = 1985 limit 10; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475137014881_0009, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0009/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0009 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-09-29 03:13:42,533 Stage-1 map = 0%, reduce = 0% 2016-09-29 03:14:11,089 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.37 sec MapReduce Total cumulative CPU time: 2 seconds 370 msec Ended Job = job_1475137014881_0009 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 2.37 sec HDFS Read: 4422 HDFS Write: 310 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 370 msec OK 1985 BAL AL murraed02 1472819.0 1985 BAL AL lynnfr01 1090000.0 1985 BAL AL ripkeca01 800000.0 1985 BAL AL lacyle01 725000.0 1985 BAL AL flanami01 641667.0 1985 BAL AL boddimi01 625000.0 1985 BAL AL stewasa01 581250.0 1985 BAL AL martide01 560000.0 1985 BAL AL roeniga01 558333.0 1985 BAL AL mcgresc01 547143.0 Time taken: 46.694 seconds, Fetched: 10 row(s)
从这点来看的话,当我们需要对Hive进行优化的时候,是不是就可以考虑进行分区了呢?至少我觉得是的。
当然了有时候并不是这里需要优化了我们就去分区,分区不能盲目的去做,而是需要你根据自己的业务实现的。
hive> select * from salaries_partition where playerid like '%AL' limit 10; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475137014881_0010, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0010/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475137014881_0010 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-09-29 03:56:20,951 Stage-1 map = 0%, reduce = 0% 2016-09-29 03:56:37,598 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.89 sec MapReduce Total cumulative CPU time: 1 seconds 890 msec Ended Job = job_1475137014881_0010 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.89 sec HDFS Read: 44567 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 890 msec OK Time taken: 29.258 seconds
group by 语句通常需要和聚合函数一起使用,按照group by后面的字段进行分组,然后,对这些分组统一使用聚合函数进行计算。
hive> select avg(salary) from salaries_partition where yearid = 2012 group by playerid limit 10; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1475147088438_0001, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0001/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475147088438_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-09-29 04:06:27,538 Stage-1 map = 0%, reduce = 0% 2016-09-29 04:06:43,779 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec 2016-09-29 04:06:54,461 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.58 sec MapReduce Total cumulative CPU time: 2 seconds 580 msec Ended Job = job_1475147088438_0001 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.58 sec HDFS Read: 44567 HDFS Write: 96 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 580 msec OK 500000.0 485000.0 9000000.0 1200000.0 2100000.0 875000.0 4400000.0 5000000.0 1075000.0 495000.0 Time taken: 40.241 seconds, Fetched: 10 row(s)
hive> select avg(salary) from salaries_partition where yearid = 2012 group by playerid having avg(salary) > 1000000 limit 10; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1475147088438_0002, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0002/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475147088438_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-09-29 04:11:40,153 Stage-1 map = 0%, reduce = 0% 2016-09-29 04:12:07,548 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.88 sec 2016-09-29 04:12:31,211 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.07 sec MapReduce Total cumulative CPU time: 4 seconds 70 msec Ended Job = job_1475147088438_0002 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.07 sec HDFS Read: 44567 HDFS Write: 100 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 70 msec OK 9000000.0 1200000.0 2100000.0 4400000.0 5000000.0 1075000.0 1400000.0 2200000.0 3250000.0 1300000.0 Time taken: 66.952 seconds, Fetched: 10 row(s)
这里主要列一下Hive中不支持的一些语句形式。
hive<span style="color:#ff0000;">></span> <span style="color:#ff0000;">select a.yearid, a.salary > from salaries_external a join salaries_partition b > on a.yearid = b.yearid > where a.yearid = 2012 limit 10;</span> Total jobs = 1 16/09/29 04:43:06 WARN conf.Configuration: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 16/09/29 04:43:06 WARN conf.Configuration: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. Execution log at: /tmp/root/root_20160929044343_55cc7606-3f32-4f0e-ac77-fc9d5049dd5a.log 2016-09-29 04:43:07 Starting to launch local task to process map join; maximum memory = 518979584 2016-09-29 04:43:08 Dump the side-table into file: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable 2016-09-29 04:43:08 Uploaded 1 File to: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (7093 bytes) 2016-09-29 04:43:08 End of local task; Time Taken: 1.164 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1475147088438_0004, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0004/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475147088438_0004 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2016-09-29 04:43:19,720 Stage-3 map = 0%, reduce = 0% 2016-09-29 04:43:27,527 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.71 sec MapReduce Total cumulative CPU time: 1 seconds 710 msec Ended Job = job_1475147088438_0004 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.71 sec HDFS Read: 655686 HDFS Write: 130 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 710 msec OK 2012 1.235E7 2012 1.235E7 2012 1.235E7 2012 1.235E7 2012 1.235E7 2012 1.235E7 2012 1.235E7 2012 1.235E7 2012 1.235E7 2012 1.235E7 Time taken: 25.079 seconds, Fetched: 10 row(s)
那么这个语句on后面跟的条件中的=号能不能改为>=呢?
hive> > select a.yearid, a.salary > from salaries_external a join salaries_partition b > on a.yearid >= b.yearid > where a.yearid = 2012 limit 10; FAILED: SemanticException [Error 10017]: Line 3:3 Both left and right aliases encountered in JOIN 'yearid'
在Join的连接条件中,Hive默认将最后面的那个表作为最大的表,当进行Join操作的时候,会把其他的小一些的表缓存起来,然后扫描最后的那个表进行计算。所以在实际的开发中,我们要尽量使得,表的大小依次是升高的。
不过当然可以通过标记告诉Hive哪个表是大表,标记是:/*+streamtable(s)*/,将其放在表的前面即可。
OrderBy会对数据进行全局的排序的,而SortedBy则会对数据进行局部的排序,那么,如果想一定按照指定的顺序来排序的话,那么,采用OrderBy即可,但速度要慢很多,SortBy只保证局部的排序,是Reduce多个的时候,顺序会别打乱。
按照某一个字段分发到某一个Reduce中去。
DistributeBy要写在Sort by
这里的DistributeBy应该映射为MapReduce中的partitioner
这里的bucket 后面参数表示取出数据桶数量,out of 后面参数表示把数据分为10个桶
hive> > select count(*) from salaries tablesample(bucket 3 out of 10 on rand()) s; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1475147088438_0005, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0005/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475147088438_0005 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-09-29 05:08:13,345 Stage-1 map = 0%, reduce = 0% 2016-09-29 05:08:54,340 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.15 sec 2016-09-29 05:09:12,109 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.62 sec MapReduce Total cumulative CPU time: 3 seconds 620 msec Ended Job = job_1475147088438_0005 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.62 sec HDFS Read: 53025 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 620 msec OK 167 Time taken: 69.265 seconds, Fetched: 1 row(s)
抽取百分之一的数据出来。
hive> > select count(*) from salaries_external tablesample(0.1 percent); Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1475147088438_0006, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0006/ Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1475147088438_0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-09-29 05:13:14,531 Stage-1 map = 0%, reduce = 0% 2016-09-29 05:13:35,510 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.62 sec 2016-09-29 05:13:47,117 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.0 sec MapReduce Total cumulative CPU time: 3 seconds 0 msec Ended Job = job_1475147088438_0006 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.0 sec HDFS Read: 677182 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 0 msec OK 23142 Time taken: 45.085 seconds, Fetched: 1 row(s)