Etl之HiveSql调优(left join where的位置)

一、前言

公司实用Hadoop构建数据仓库,期间不可避免的实用HiveSql,在Etl过程中,速度成了避无可避的问题。本人有过几个数据表关联跑1个小时的经历,你可能觉得无所谓,可是多次Etl就要多个小时,非常浪费时间,所以HiveSql优化不可避免。

注:本文只是从sql层面介绍一下日常需要注意的点,不涉及Hadoop、MapReduce等层面,关于Hive的编译过程,请参考文章:http://tech.meituan.com/hive-sql-to-mapreduce.html

二、准备数据

假设咱们有两张数据表。

景区表:sight,12W条记录,数据表结构:

hive> desc sight;
OK
area                    string                  None                
city                    string                  None                
country                 string                  None                
county                  string                  None                
id                      string                  None                
name                    string                  None                
region                  string                  None     

景区订单明细表:order_sight,1040W条记录,数据表结构:

hive> desc order_sight;    
OK
create_time             string                  None                
id                      string                  None                
order_id                string                  None                
sight_id                bigint                  None  

 三、分析

3.1 where条件

那么咱们希望看见景区id是9718,日期是2015-10-10的所有订单id,那么sql需要如下书写:

hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';         
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3562174, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562174/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3562174
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1
2015-10-12 22:58:22,706 Stage-1 map = 0%,  reduce = 0%
2015-10-12 22:58:29,882 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.73 sec
2015-10-12 22:58:30,907 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.73 sec
2015-10-12 22:58:31,933 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:32,968 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:33,995 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:35,020 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:36,046 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:37,070 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:38,096 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:39,121 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:40,153 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:41,182 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
2015-10-12 22:58:42,209 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
2015-10-12 22:58:43,236 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
2015-10-12 22:58:44,263 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
2015-10-12 22:58:45,289 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
2015-10-12 22:58:46,316 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
2015-10-12 22:58:47,344 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:48,370 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:49,397 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:50,424 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:51,452 Stage-1 map = 83%,  reduce = 17%, Cumulative CPU 37.62 sec
2015-10-12 22:58:52,478 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 38.06 sec
2015-10-12 22:58:53,506 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 38.06 sec
2015-10-12 22:58:54,534 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
2015-10-12 22:58:55,560 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
2015-10-12 22:58:56,587 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
2015-10-12 22:58:57,615 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
2015-10-12 22:58:58,642 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
2015-10-12 22:58:59,674 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
2015-10-12 22:59:00,708 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
2015-10-12 22:59:01,736 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
2015-10-12 22:59:02,763 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
2015-10-12 22:59:03,791 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.41 sec
2015-10-12 22:59:04,817 Stage-1 map = 96%,  reduce = 29%, Cumulative CPU 49.13 sec
2015-10-12 22:59:05,843 Stage-1 map = 100%,  reduce = 29%, Cumulative CPU 49.59 sec
2015-10-12 22:59:06,870 Stage-1 map = 100%,  reduce = 41%, Cumulative CPU 49.76 sec
2015-10-12 22:59:07,897 Stage-1 map = 100%,  reduce = 41%, Cumulative CPU 49.76 sec
2015-10-12 22:59:08,922 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 52.79 sec
2015-10-12 22:59:09,947 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 52.79 sec
MapReduce Total cumulative CPU time: 52 seconds 790 msec
Ended Job = job_1434099279301_3562174
MapReduce Jobs Launched: 
Job 0: Map: 8  Reduce: 1   Cumulative CPU: 52.79 sec   HDFS Read: 371210469 HDFS Write: 330 SUCCESS
Total MapReduce CPU Time Spent: 52 seconds 790 msec
OK
9718    210294949
9718    210294421
9718    210296438
9718    210295344
9718    210297567
9718    210296076
9718    210295525
9718    210298219
9718    210295840
9718    210301363
9718    210297733
9718    210298066
9718    210295239
9718    210298328
9718    210298008
9718    210299712
9718    210295586
9718    210295050
9718    210295566
9718    210299105
9718    210296318
9718    210295277
Time taken: 52.068 seconds, Fetched: 22 row(s)

 可见需要的时间是52秒,如果咱们换一个sql的书写方式:

hive> select s.id,o.order_id from sight s left join (select order_id,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=9718;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3562218, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562218/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3562218
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1
2015-10-12 23:03:54,926 Stage-1 map = 0%,  reduce = 0%
2015-10-12 23:04:01,075 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
2015-10-12 23:04:02,101 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
2015-10-12 23:04:03,126 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
2015-10-12 23:04:04,151 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 5.53 sec
2015-10-12 23:04:05,176 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 5.53 sec
2015-10-12 23:04:06,201 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.62 sec
2015-10-12 23:04:07,226 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:08,250 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:09,275 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:10,300 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:11,324 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:12,356 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
2015-10-12 23:04:13,384 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
2015-10-12 23:04:14,410 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
2015-10-12 23:04:15,437 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
2015-10-12 23:04:16,463 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
2015-10-12 23:04:17,487 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
2015-10-12 23:04:18,514 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
2015-10-12 23:04:19,538 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
2015-10-12 23:04:20,569 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
2015-10-12 23:04:21,595 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
2015-10-12 23:04:22,620 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
2015-10-12 23:04:23,646 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
2015-10-12 23:04:24,673 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
2015-10-12 23:04:25,698 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
2015-10-12 23:04:26,723 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
2015-10-12 23:04:27,748 Stage-1 map = 75%,  reduce = 21%, Cumulative CPU 23.32 sec
2015-10-12 23:04:28,774 Stage-1 map = 88%,  reduce = 21%, Cumulative CPU 27.27 sec
2015-10-12 23:04:29,799 Stage-1 map = 100%,  reduce = 21%, Cumulative CPU 32.82 sec
2015-10-12 23:04:30,823 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 34.35 sec
2015-10-12 23:04:31,846 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 34.35 sec
MapReduce Total cumulative CPU time: 34 seconds 350 msec
Ended Job = job_1434099279301_3562218
MapReduce Jobs Launched: 
Job 0: Map: 8  Reduce: 1   Cumulative CPU: 34.35 sec   HDFS Read: 371210469 HDFS Write: 330 SUCCESS
Total MapReduce CPU Time Spent: 34 seconds 350 msec
OK
9718    210297733
9718    210298066
9718    210295239
9718    210298328
9718    210298008
9718    210299712
9718    210297567
9718    210296076
9718    210295525
9718    210298219
9718    210295840
9718    210301363
9718    210295586
9718    210295050
9718    210295566
9718    210299105
9718    210296318
9718    210295277
9718    210294949
9718    210294421
9718    210296438
9718    210295344
Time taken: 43.709 seconds, Fetched: 22 row(s)

实用43秒,快了一些。当然咱们并不是仅仅分析说快了20%(我还多次测试,这次的差距最小),而是分析原因!

单从两个sql的写法上看的出来,特别是第二条的红色部分,我将left的条件写到里面了。那么执行的结果随之不一样,第二条的Reduce时间明显小于第一条的Reduce时间。

原因是这两个sql都分解成8个Map任务和1个Reduce任务,如果left的条件写在后面,那么这些关联操作会放在Reduce阶段,1个Reduce操作的时间必然大于8个Map的执行时间,造成执行时间超长。

结论:当使用外关联时,如果将副表的过滤条件写在Where后面,那么就会先全表关联,之后再过滤

你可能感兴趣的:(Etl之HiveSql调优(left join where的位置))