Hive数据仓库--HiveQL查询

之前的博文中已经介绍过了,Hive的原理、Hive的数据定义、Hive的数据插入的相关知识,接下来学习下Hive的数据查询,总体来说感觉查询部分和Mysql相差并不是很大,当然并不是说没有差别的,个别的地方我会点出来。

转载请注明出处:Hive数据仓库--HiveQL查询


这里是之前的文章中操作的表的信息,并且,我们简单的查询下,你可以看到他的部分数据。

hive> show tables;
OK
salaries
salaries_external
salaries_partition
wt
Time taken: 0.022 seconds, Fetched: 4 row(s)
hive> select * from salaries_external limit 10;
OK
1985    BAL     AL      murraed02       1472819.0
1985    BAL     AL      lynnfr01        1090000.0
1985    BAL     AL      ripkeca01       800000.0
1985    BAL     AL      lacyle01        725000.0
1985    BAL     AL      flanami01       641667.0
1985    BAL     AL      boddimi01       625000.0
1985    BAL     AL      stewasa01       581250.0
1985    BAL     AL      martide01       560000.0
1985    BAL     AL      roeniga01       558333.0
1985    BAL     AL      mcgresc01       547143.0
Time taken: 1.142 seconds, Fetched: 10 row(s)
hive>

表的数据定义

hive> describe salaries_external;
OK
yearid                  int                     year
teamid                  string                  team
lgid                    string
playerid                string
salary                  float
Time taken: 0.148 seconds, Fetched: 5 row(s)
hive>

下面进行相关的查询操作

这里的基本查询和Mysql基本上没有太大的区别的,所以,这里不再展示过多的查询操作,下面列举一个查询的例子供参考即可。

hive>
    > select yearid, teamid from salaries_external limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475137014881_0001, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0001/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-29 01:38:37,119 Stage-1 map = 0%,  reduce = 0%
2016-09-29 01:39:37,511 Stage-1 map = 0%,  reduce = 0%
2016-09-29 01:40:37,574 Stage-1 map = 0%,  reduce = 0%
2016-09-29 01:40:52,968 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.96 sec
MapReduce Total cumulative CPU time: 1 seconds 960 msec
Ended Job = job_1475137014881_0001
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.96 sec   HDFS Read: 4422 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 960 msec
OK
1985    BAL
1985    BAL
1985    BAL
1985    BAL
1985    BAL
1985    BAL
1985    BAL
1985    BAL
1985    BAL
1985    BAL
Time taken: 338.211 seconds, Fetched: 10 row(s)


Hive的部分内置函数

返回类型

签名 描述
BIGINT round(double a) 返回BIGINT最近的double值。
BIGINT floor(double a) 返回最大BIGINT值等于或小于double。
BIGINT ceil(double a) 它返回最小BIGINT值等于或大于double。
double rand(), rand(int seed) 它返回一个随机数,从行改变到行。
string concat(string A, string B,...) 它返回从A后串联B产生的字符串
string substr(string A, int start) 它返回一个起始,从起始位置的子字符串,直到A.结束
string substr(string A, int start, int length) 返回从给定长度的起始start位置开始的字符串。
string upper(string A) 它返回从转换的所有字符为大写产生的字符串。
string ucase(string A) 和上面的一样
string lower(string A) 它返回转换B的所有字符为小写产生的字符串。
string lcase(string A) 和上面的一样
string trim(string A) 它返回字符串从A.两端修剪空格的结果
string ltrim(string A) 它返回A从一开始修整空格产生的字符串(左手侧)
string rtrim(string A) rtrim(string A),它返回A从结束修整空格产生的字符串(右侧)
string regexp_replace(string A, string B, string C) 它返回从替换所有子在B结果配合C.在Java正则表达式语法的字符串
int size(Map<K.V>) 它返回在映射类型的元素的数量。
int size(Array<T>) 它返回在数组类型元素的数量。
value of <type> cast(<expr> as <type>) 它把表达式的结果expr<类型>如cast('1'作为BIGINT)代表整体转换为字符串'1'。如果转换不成功,返回的是NULL。
string from_unixtime(int unixtime) 转换的秒数从Unix纪元(1970-01-0100:00:00 UTC)代表那一刻,在当前系统时区的时间戳字符的串格式:"1970-01-01 00:00:00"
string to_date(string timestamp) 返回一个字符串时间戳的日期部分:to_date("1970-01-01 00:00:00") = "1970-01-01"
int year(string date) 返回年份部分的日期或时间戳字符串:year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970
int month(string date) 返回日期或时间戳记字符串月份部分:month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11
int day(string date) 返回日期或时间戳记字符串当天部分:day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string get_json_object(string json_string, string path) 提取从基于指定的JSON路径的JSON字符串JSON对象,并返回提取的JSON字符串的JSON对象。如果输入的JSON字符串无效,返回NULL。 

<span style="color:#ff0000;">hive> select concat(playerid, salary) from salaries_external limit 10;</span><span style="color:#535b60;">
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475137014881_0004, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0004/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-29 02:03:04,828 Stage-1 map = 0%,  reduce = 0%
2016-09-29 02:03:25,653 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.8 sec
MapReduce Total cumulative CPU time: 1 seconds 800 msec
Ended Job = job_1475137014881_0004
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.8 sec   HDFS Read: 4422 HDFS Write: 180 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
murraed021472819.0
lynnfr011090000.0
ripkeca01800000.0
lacyle01725000.0
flanami01641667.0
boddimi01625000.0
stewasa01581250.0
martide01560000.0
roeniga01558333.0
mcgresc01547143.0
Time taken: 42.353 seconds, Fetched: 10 row(s)
</span><span style="color:#ff0000;">hive> select concat(playerid, concat('->', salary)) from salaries_external limit 10;</span><span style="color:#535b60;">
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475137014881_0005, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0005/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-29 02:04:23,813 Stage-1 map = 0%,  reduce = 0%
2016-09-29 02:04:32,562 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.41 sec
MapReduce Total cumulative CPU time: 1 seconds 410 msec
Ended Job = job_1475137014881_0005
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.41 sec   HDFS Read: 4422 HDFS Write: 200 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 410 msec
OK
murraed02->1472819.0
lynnfr01->1090000.0
ripkeca01->800000.0
lacyle01->725000.0
flanami01->641667.0
boddimi01->625000.0
stewasa01->581250.0
martide01->560000.0
roeniga01->558333.0
mcgresc01->547143.0
Time taken: 28.394 seconds, Fetched: 10 row(s)
hive>
</span>
这里使用了字符串拼接的内置函数,第一个是直接将两个字段进行拼接的,第二个是先将字符串和salary字段进行拼接,然后再和playerid进行拼接。


Hive聚合函数

Hive支持以下内置聚合函数。这些函数的用法类似于SQL聚合函数。

返回类型 签名 描述
BIGINT count(*), count(expr), count(*) - 返回检索行的总数。
DOUBLE sum(col), sum(DISTINCT col) 返回该组或该组中的列的不同值的分组和所有元素的总和。
DOUBLE avg(col), avg(DISTINCT col) 返回上述组或该组中的列的不同值的元素的平均值。
DOUBLE min(col) 返回该组中的列的最小值。
DOUBLE max(col) 返回该组中的列的最大值。
hive>
    > select count(*) from salaries_external;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1475137014881_0002, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0002/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-29 01:43:59,754 Stage-1 map = 0%,  reduce = 0%
2016-09-29 01:45:00,769 Stage-1 map = 0%,  reduce = 0%
2016-09-29 01:46:01,222 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 1.87 sec
2016-09-29 01:46:28,834 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.82 sec
2016-09-29 01:46:58,562 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.19 sec
MapReduce Total cumulative CPU time: 4 seconds 190 msec
Ended Job = job_1475137014881_0002
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.43 sec   HDFS Read: 1354022 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 430 msec
OK
46284
Time taken: 242.819 seconds, Fetched: 1 row(s)


Hive嵌套查询

Hive的嵌套查询,这种嵌套查询在实际的使用中会经常用到,在这里给出一个例子,可以参考下。根据实际的业务来设计。

hive> from (
    > select * from salaries_external where yearid = 2012
    > )e
    > select e.yearid as year , e.playerid as player
    > where e.salary > 10000 limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475137014881_0007, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0007/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-29 02:47:21,350 Stage-1 map = 0%,  reduce = 0%
2016-09-29 02:47:29,194 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.78 sec
MapReduce Total cumulative CPU time: 1 seconds 780 msec
Ended Job = job_1475137014881_0007
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.78 sec   HDFS Read: 655686 HDFS Write: 149 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
2012    markani01
2012    roberbr01
2012    reynoma01
2012    hardyjj01
2012    jonesad01
2012    greggke01
2012    hammeja01
2012    lindsma01
2012    chenwe02
2012    johnsji04
Time taken: 29.807 seconds, Fetched: 10 row(s)


Case When 语句


hive> select yearid, salary,
    > case
    > when salary < 10000 then 'low'
    > when salary >= 10000 and salary < 20000 then 'Mid'
    > else 'High'
    > end as bracket from salaries_external limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475137014881_0008, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0008/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-29 03:03:52,323 Stage-1 map = 0%,  reduce = 0%
2016-09-29 03:04:10,418 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.87 sec
MapReduce Total cumulative CPU time: 1 seconds 870 msec
Ended Job = job_1475137014881_0008
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.87 sec   HDFS Read: 4422 HDFS Write: 192 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 870 msec
OK
1985    1472819.0       High
1985    1090000.0       High
1985    800000.0        High
1985    725000.0        High
1985    641667.0        High
1985    625000.0        High
1985    581250.0        High
1985    560000.0        High
1985    558333.0        High
1985    547143.0        High
Time taken: 31.771 seconds, Fetched: 10 row(s)

什么情况下Hive可以避免进行MapReduce

hive>
    > select * from salaries_external limit 10;
OK
1985    BAL     AL      murraed02       1472819.0
1985    BAL     AL      lynnfr01        1090000.0
1985    BAL     AL      ripkeca01       800000.0
1985    BAL     AL      lacyle01        725000.0
1985    BAL     AL      flanami01       641667.0
1985    BAL     AL      boddimi01       625000.0
1985    BAL     AL      stewasa01       581250.0
1985    BAL     AL      martide01       560000.0
1985    BAL     AL      roeniga01       558333.0
1985    BAL     AL      mcgresc01       547143.0
Time taken: 0.204 seconds, Fetched: 10 row(s)
像上面这种简单查询的语句不会触发MapReduce进行查询。当然包括select * from tableName这样的语句

hive> select * from salaries_partition where yearid = 1985 limit 10;
OK
Time taken: 0.705 seconds

这里是分区的时候的查询结果,当然这里根据yearid进行的分区。

但是,不分区的话行不行呢?

答案是不行。

hive>
    > select * from salaries_external where yearid = 1985 limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475137014881_0009, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0009/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-29 03:13:42,533 Stage-1 map = 0%,  reduce = 0%
2016-09-29 03:14:11,089 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.37 sec
MapReduce Total cumulative CPU time: 2 seconds 370 msec
Ended Job = job_1475137014881_0009
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 2.37 sec   HDFS Read: 4422 HDFS Write: 310 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 370 msec
OK
1985    BAL     AL      murraed02       1472819.0
1985    BAL     AL      lynnfr01        1090000.0
1985    BAL     AL      ripkeca01       800000.0
1985    BAL     AL      lacyle01        725000.0
1985    BAL     AL      flanami01       641667.0
1985    BAL     AL      boddimi01       625000.0
1985    BAL     AL      stewasa01       581250.0
1985    BAL     AL      martide01       560000.0
1985    BAL     AL      roeniga01       558333.0
1985    BAL     AL      mcgresc01       547143.0
Time taken: 46.694 seconds, Fetched: 10 row(s)

总结下,过滤语句中都是分区字段的话,这时候就不会去触发MapReduce操作,但是普通的未分区的表会触发MapReduce。

从这点来看的话,当我们需要对Hive进行优化的时候,是不是就可以考虑进行分区了呢?至少我觉得是的。

当然了有时候并不是这里需要优化了我们就去分区,分区不能盲目的去做,而是需要你根据自己的业务实现的。


Like语句

hive> select * from salaries_partition where playerid like '%AL' limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475137014881_0010, Tracking URL = http://hadoopwy1:8088/proxy/application_1475137014881_0010/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475137014881_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-29 03:56:20,951 Stage-1 map = 0%,  reduce = 0%
2016-09-29 03:56:37,598 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.89 sec
MapReduce Total cumulative CPU time: 1 seconds 890 msec
Ended Job = job_1475137014881_0010
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.89 sec   HDFS Read: 44567 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 890 msec
OK
Time taken: 29.258 seconds

Group By 语句

group by 语句通常需要和聚合函数一起使用,按照group by后面的字段进行分组,然后,对这些分组统一使用聚合函数进行计算。

hive> select avg(salary) from salaries_partition where yearid = 2012 group by playerid limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1475147088438_0001, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0001/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475147088438_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-29 04:06:27,538 Stage-1 map = 0%,  reduce = 0%
2016-09-29 04:06:43,779 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2016-09-29 04:06:54,461 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.58 sec
MapReduce Total cumulative CPU time: 2 seconds 580 msec
Ended Job = job_1475147088438_0001
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 2.58 sec   HDFS Read: 44567 HDFS Write: 96 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 580 msec
OK
500000.0
485000.0
9000000.0
1200000.0
2100000.0
875000.0
4400000.0
5000000.0
1075000.0
495000.0
Time taken: 40.241 seconds, Fetched: 10 row(s)


Having对Group by分组后的数据进行筛选

hive> select avg(salary) from salaries_partition where yearid = 2012 group by playerid having avg(salary) > 1000000 limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1475147088438_0002, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0002/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475147088438_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-29 04:11:40,153 Stage-1 map = 0%,  reduce = 0%
2016-09-29 04:12:07,548 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.88 sec
2016-09-29 04:12:31,211 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.07 sec
MapReduce Total cumulative CPU time: 4 seconds 70 msec
Ended Job = job_1475147088438_0002
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.07 sec   HDFS Read: 44567 HDFS Write: 100 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 70 msec
OK
9000000.0
1200000.0
2100000.0
4400000.0
5000000.0
1075000.0
1400000.0
2200000.0
3250000.0
1300000.0
Time taken: 66.952 seconds, Fetched: 10 row(s)

Join语句

这里主要列一下Hive中不支持的一些语句形式。

hive<span style="color:#ff0000;">></span> <span style="color:#ff0000;">select a.yearid, a.salary
    > from salaries_external a join salaries_partition b
    > on a.yearid = b.yearid
    > where a.yearid = 2012 limit 10;</span>
Total jobs = 1
16/09/29 04:43:06 WARN conf.Configuration: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
16/09/29 04:43:06 WARN conf.Configuration: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
Execution log at: /tmp/root/root_20160929044343_55cc7606-3f32-4f0e-ac77-fc9d5049dd5a.log
2016-09-29 04:43:07     Starting to launch local task to process map join;      maximum memory = 518979584
2016-09-29 04:43:08     Dump the side-table into file: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2016-09-29 04:43:08     Uploaded 1 File to: file:/tmp/root/hive_2016-09-29_04-43-03_599_2586737001473820252-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (7093 bytes)
2016-09-29 04:43:08     End of local task; Time Taken: 1.164 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475147088438_0004, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0004/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475147088438_0004
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2016-09-29 04:43:19,720 Stage-3 map = 0%,  reduce = 0%
2016-09-29 04:43:27,527 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.71 sec
MapReduce Total cumulative CPU time: 1 seconds 710 msec
Ended Job = job_1475147088438_0004
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.71 sec   HDFS Read: 655686 HDFS Write: 130 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
2012    1.235E7
2012    1.235E7
2012    1.235E7
2012    1.235E7
2012    1.235E7
2012    1.235E7
2012    1.235E7
2012    1.235E7
2012    1.235E7
2012    1.235E7
Time taken: 25.079 seconds, Fetched: 10 row(s)

这里只列举这样的一个例子,其余的join语句大家应该在数据库中也都用过了。

那么这个语句on后面跟的条件中的=号能不能改为>=呢?

hive>
    > select a.yearid, a.salary
    > from salaries_external a join salaries_partition b
    > on a.yearid >= b.yearid
    > where a.yearid = 2012 limit 10;
FAILED: SemanticException [Error 10017]: Line 3:3 Both left and right aliases encountered in JOIN 'yearid'

这样是不行的,在Hive中不支持这样的语句,主要是因为MapReduce很难去实现这样的连接。


Join优化

在Join的连接条件中,Hive默认将最后面的那个表作为最大的表,当进行Join操作的时候,会把其他的小一些的表缓存起来,然后扫描最后的那个表进行计算。所以在实际的开发中,我们要尽量使得,表的大小依次是升高的。

不过当然可以通过标记告诉Hive哪个表是大表,标记是:/*+streamtable(s)*/,将其放在表的前面即可。


OrderBy和Sort By

OrderBy会对数据进行全局的排序的,而SortedBy则会对数据进行局部的排序,那么,如果想一定按照指定的顺序来排序的话,那么,采用OrderBy即可,但速度要慢很多,SortBy只保证局部的排序,是Reduce多个的时候,顺序会别打乱。


DistributeBy 语句

按照某一个字段分发到某一个Reduce中去。

DistributeBy要写在Sort by

这里的DistributeBy应该映射为MapReduce中的partitioner


抽样

这里的bucket 后面参数表示取出数据桶数量,out of 后面参数表示把数据分为10个桶

hive>
    > select count(*) from salaries tablesample(bucket 3 out of 10 on rand()) s;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1475147088438_0005, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0005/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475147088438_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-29 05:08:13,345 Stage-1 map = 0%,  reduce = 0%
2016-09-29 05:08:54,340 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.15 sec
2016-09-29 05:09:12,109 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.62 sec
MapReduce Total cumulative CPU time: 3 seconds 620 msec
Ended Job = job_1475147088438_0005
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.62 sec   HDFS Read: 53025 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 620 msec
OK
167
Time taken: 69.265 seconds, Fetched: 1 row(s)

数据块采样

抽取百分之一的数据出来。

hive>
    > select count(*) from salaries_external tablesample(0.1 percent);
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1475147088438_0006, Tracking URL = http://hadoopwy1:8088/proxy/application_1475147088438_0006/
Kill Command = /usr/local/hadoop2/bin/hadoop job  -kill job_1475147088438_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-29 05:13:14,531 Stage-1 map = 0%,  reduce = 0%
2016-09-29 05:13:35,510 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.62 sec
2016-09-29 05:13:47,117 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.0 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1475147088438_0006
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.0 sec   HDFS Read: 677182 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 0 msec
OK
23142
Time taken: 45.085 seconds, Fetched: 1 row(s)


转载请注明出处: Hive数据仓库--HiveQL查询



你可能感兴趣的:(大数据,数据仓库,hiveQL,Hive查询语句,Hive实战)