hive优化-级联求和

一、需求:根据每日访问信息,算累计访问

输入数据:

设备ID 日期
10000004 20180501
10000005 20180501
10000004 20180502
10000005 20180502
10000006 20180502
10000007 20180502
10000007 20180503
10000008 20180503
10000009 20180503

输出数据:

日期 累计
20180501 2
20180502 6
20180503 9



二、准备表和数据



1.创建表

create table device(dev_id string,dt string) row format delimited fields terminated by '\t'; 


2.准备数据,data.txt(按tab间隔)

10000004    20180501
10000005    20180501
10000004    20180502
10000005    20180502
10000006    20180502
10000007    20180502
10000007    20180503
10000008    20180503
10000009    20180503


3.把数据加载到表中

load data local inpath 'data.txt' overwrite into table device



三、累计明细 - 过程详解



第一步:提取表device中日期

 select dt from device group by dt
dt
20180503
20180502
20180501



第二步:将第一步做为子查询放在左边和表device产生笛卡儿积

select t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2
t1.dt t2.dt t2.dev_id
20180503 20180503 10000009
20180503 20180503 10000008
20180503 20180503 10000007
20180503 20180502 10000007
20180503 20180502 10000006
20180503 20180502 10000005
20180503 20180502 10000004
20180503 20180501 10000005
20180503 20180501 10000004
20180502 20180503 10000009
20180502 20180503 10000008
20180502 20180503 10000007
20180502 20180502 10000007
20180502 20180502 10000006
20180502 20180502 10000005
20180502 20180502 10000004
20180502 20180501 10000005
20180502 20180501 10000004
20180501 20180503 10000009
20180501 20180503 10000008
20180501 20180503 10000007
20180501 20180502 10000007
20180501 20180502 10000006
20180501 20180502 10000005
20180501 20180502 10000004
20180501 20180501 10000005
20180501 20180501 10000004



第三步:过滤(t1.dt>=t2.dt)生成累计明细

select t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt
t1.dt t2.dt t2.dev_id
20180503 20180503 10000009
20180503 20180503 10000008
20180503 20180503 10000007
20180503 20180502 10000007
20180503 20180502 10000006
20180503 20180502 10000005
20180503 20180502 10000004
20180503 20180501 10000005
20180503 20180501 10000004
20180502 20180502 10000007
20180502 20180502 10000006
20180502 20180502 10000005
20180502 20180502 10000004
20180502 20180501 10000005
20180502 20180501 10000004
20180501 20180501 10000005
20180501 20180501 10000004


最后一步(最重要):为解决数据量大时数据倾斜,走mapjoin,t1会全部加载到内存,在map端处理,第2个job不走reduce。
 set hive.map.aggr=true;
 set Hive.optimize.skewjoin=true;
 set hive.exec.parallel=true;
 set hive.auto.convert.join=true;
 
 select /*+ MAPJOIN(t1) */ t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt



mapreduce的job信息

 Total jobs = 2
 Launching Job 1 out of 2
 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
 2018-05-23 20:37:14,439 Stage-1 map = 0%,  reduce = 0%
 2018-05-23 20:37:20,680 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.22 sec
 2018-05-23 20:37:26,863 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.98 sec
 MapReduce Total cumulative CPU time: 3 seconds 980 msec
 Launching Job 2 out of 2
 Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
 2018-05-23 20:37:38,493 Stage-4 map = 0%,  reduce = 0%
 2018-05-23 20:37:44,663 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 3.32 sec
 MapReduce Total cumulative CPU time: 3 seconds 320 msec
 MapReduce Jobs Launched: 
 Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.98 sec   HDFS Read: 5965 HDFS Write: 174 SUCCESS
 Stage-Stage-4: Map: 1   Cumulative CPU: 3.32 sec   HDFS Read: 6031 HDFS Write: 459 SUCCESS
 Total MapReduce CPU Time Spent: 7 seconds 300 msec


四、测试结果

用一个表大小为108GB,记录为222612963行表中做测试,整个mapreduce过程消耗的CPU时间为:1 days 14 hours 49 minutes 56 seconds 470 msec


其mapreduce的job信息

 Total jobs = 6
 Launching Job 1 out of 6
 Hadoop job information for Stage-1: number of mappers: 12; number of reducers: 1
 2018-05-23 20:13:16,253 Stage-1 map = 0%,  reduce = 0%
 2018-05-23 20:13:20,426 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 1.95 sec
 2018-05-23 20:13:21,454 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 24.02 sec
 2018-05-23 20:13:22,481 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 31.04 sec
 2018-05-23 20:13:27,623 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 33.66 sec
 MapReduce Total cumulative CPU time: 33 seconds 660 msec
 Stage-13 is filtered out by condition resolver.
 Stage-14 is selected by condition resolver.
 Stage-2 is filtered out by condition resolver.
 2018-05-23 20:13:35    Starting to launch local task to process map join;  maximum memory = 508559360
 2018-05-23 20:13:36    End of local task; Time Taken: 0.771 sec.
 Execution completed successfully
 MapredLocal task succeeded
 Launching Job 3 out of 6
 Number of reduce tasks is set to 0 since there's no reduce operator
 Hadoop job information for Stage-11: number of mappers: 445; number of reducers: 0
 2018-05-23 20:13:47,631 Stage-11 map = 0%,  reduce = 0%
 2018-05-23 20:14:23,705 Stage-11 map = 1%,  reduce = 0%, Cumulative CPU 4601.59 sec
 2018-05-23 20:14:32,961 Stage-11 map = 2%,  reduce = 0%, Cumulative CPU 6296.4 sec
 2018-05-23 20:14:41,196 Stage-11 map = 3%,  reduce = 0%, Cumulative CPU 7643.17 sec
 2018-05-23 20:14:53,531 Stage-11 map = 4%,  reduce = 0%, Cumulative CPU 9436.65 sec
 2018-05-23 20:15:03,815 Stage-11 map = 5%,  reduce = 0%, Cumulative CPU 11603.96 sec
 2018-05-23 20:15:14,129 Stage-11 map = 6%,  reduce = 0%, Cumulative CPU 13287.69 sec
 2018-05-23 20:15:22,372 Stage-11 map = 7%,  reduce = 0%, Cumulative CPU 14857.01 sec
 2018-05-23 20:15:30,582 Stage-11 map = 8%,  reduce = 0%, Cumulative CPU 16324.26 sec
 2018-05-23 20:15:38,782 Stage-11 map = 9%,  reduce = 0%, Cumulative CPU 17903.47 sec
 2018-05-23 20:15:45,953 Stage-11 map = 10%,  reduce = 0%, Cumulative CPU 19114.98 sec
 2018-05-23 20:15:54,168 Stage-11 map = 11%,  reduce = 0%, Cumulative CPU 20462.52 sec
 2018-05-23 20:16:00,352 Stage-11 map = 12%,  reduce = 0%, Cumulative CPU 21664.48 sec
 2018-05-23 20:16:09,567 Stage-11 map = 13%,  reduce = 0%, Cumulative CPU 23182.39 sec
 2018-05-23 20:16:16,760 Stage-11 map = 14%,  reduce = 0%, Cumulative CPU 24985.56 sec
 2018-05-23 20:16:23,925 Stage-11 map = 15%,  reduce = 0%, Cumulative CPU 26222.6 sec
 2018-05-23 20:16:31,122 Stage-11 map = 16%,  reduce = 0%, Cumulative CPU 27689.36 sec
 2018-05-23 20:16:37,290 Stage-11 map = 17%,  reduce = 0%, Cumulative CPU 28790.46 sec
 2018-05-23 20:16:42,414 Stage-11 map = 18%,  reduce = 0%, Cumulative CPU 29595.99 sec
 2018-05-23 20:16:49,593 Stage-11 map = 19%,  reduce = 0%, Cumulative CPU 31000.63 sec
 2018-05-23 20:16:55,739 Stage-11 map = 20%,  reduce = 0%, Cumulative CPU 32066.58 sec
 2018-05-23 20:17:01,881 Stage-11 map = 21%,  reduce = 0%, Cumulative CPU 33455.96 sec
 2018-05-23 20:17:08,039 Stage-11 map = 22%,  reduce = 0%, Cumulative CPU 34469.7 sec
 2018-05-23 20:17:14,222 Stage-11 map = 23%,  reduce = 0%, Cumulative CPU 35659.33 sec
 2018-05-23 20:17:20,378 Stage-11 map = 24%,  reduce = 0%, Cumulative CPU 36752.24 sec
 2018-05-23 20:17:26,545 Stage-11 map = 25%,  reduce = 0%, Cumulative CPU 37894.69 sec
 2018-05-23 20:17:31,680 Stage-11 map = 26%,  reduce = 0%, Cumulative CPU 38814.68 sec
 2018-05-23 20:17:37,836 Stage-11 map = 27%,  reduce = 0%, Cumulative CPU 39917.47 sec
 2018-05-23 20:17:43,987 Stage-11 map = 28%,  reduce = 0%, Cumulative CPU 41044.08 sec
 2018-05-23 20:17:50,128 Stage-11 map = 29%,  reduce = 0%, Cumulative CPU 42133.94 sec
 2018-05-23 20:17:56,293 Stage-11 map = 30%,  reduce = 0%, Cumulative CPU 43210.77 sec
 2018-05-23 20:18:02,457 Stage-11 map = 31%,  reduce = 0%, Cumulative CPU 44302.57 sec
 2018-05-23 20:18:08,618 Stage-11 map = 32%,  reduce = 0%, Cumulative CPU 45353.76 sec
 2018-05-23 20:18:14,788 Stage-11 map = 33%,  reduce = 0%, Cumulative CPU 46470.85 sec
 2018-05-23 20:18:20,955 Stage-11 map = 34%,  reduce = 0%, Cumulative CPU 47600.43 sec
 2018-05-23 20:18:27,117 Stage-11 map = 35%,  reduce = 0%, Cumulative CPU 48682.28 sec
 2018-05-23 20:18:34,296 Stage-11 map = 36%,  reduce = 0%, Cumulative CPU 49934.44 sec
 2018-05-23 20:18:40,462 Stage-11 map = 37%,  reduce = 0%, Cumulative CPU 51019.56 sec
 2018-05-23 20:18:47,640 Stage-11 map = 38%,  reduce = 0%, Cumulative CPU 52278.9 sec
 2018-05-23 20:18:54,830 Stage-11 map = 39%,  reduce = 0%, Cumulative CPU 53561.27 sec
 2018-05-23 20:19:00,985 Stage-11 map = 40%,  reduce = 0%, Cumulative CPU 54670.49 sec
 2018-05-23 20:19:11,542 Stage-11 map = 41%,  reduce = 0%, Cumulative CPU 56161.62 sec
 2018-05-23 20:19:15,650 Stage-11 map = 42%,  reduce = 0%, Cumulative CPU 57228.98 sec
 2018-05-23 20:19:23,838 Stage-11 map = 43%,  reduce = 0%, Cumulative CPU 58538.29 sec
 2018-05-23 20:19:29,992 Stage-11 map = 44%,  reduce = 0%, Cumulative CPU 59752.35 sec
 2018-05-23 20:19:36,145 Stage-11 map = 45%,  reduce = 0%, Cumulative CPU 60833.42 sec
 2018-05-23 20:19:42,306 Stage-11 map = 46%,  reduce = 0%, Cumulative CPU 61952.92 sec
 2018-05-23 20:19:49,484 Stage-11 map = 47%,  reduce = 0%, Cumulative CPU 63240.14 sec
 2018-05-23 20:19:55,649 Stage-11 map = 48%,  reduce = 0%, Cumulative CPU 64412.93 sec
 2018-05-23 20:20:02,842 Stage-11 map = 49%,  reduce = 0%, Cumulative CPU 65536.58 sec
 2018-05-23 20:20:10,356 Stage-11 map = 50%,  reduce = 0%, Cumulative CPU 66877.52 sec
 2018-05-23 20:20:19,698 Stage-11 map = 51%,  reduce = 0%, Cumulative CPU 68598.9 sec
 2018-05-23 20:20:25,867 Stage-11 map = 52%,  reduce = 0%, Cumulative CPU 69693.92 sec
 2018-05-23 20:20:33,051 Stage-11 map = 53%,  reduce = 0%, Cumulative CPU 70867.07 sec
 2018-05-23 20:20:41,255 Stage-11 map = 54%,  reduce = 0%, Cumulative CPU 72342.41 sec
 2018-05-23 20:20:47,419 Stage-11 map = 55%,  reduce = 0%, Cumulative CPU 73482.16 sec
 2018-05-23 20:20:55,647 Stage-11 map = 56%,  reduce = 0%, Cumulative CPU 74751.77 sec
 2018-05-23 20:21:04,894 Stage-11 map = 57%,  reduce = 0%, Cumulative CPU 76294.79 sec
 2018-05-23 20:21:14,123 Stage-11 map = 58%,  reduce = 0%, Cumulative CPU 77849.29 sec
 2018-05-23 20:21:24,389 Stage-11 map = 59%,  reduce = 0%, Cumulative CPU 79616.28 sec
 2018-05-23 20:21:43,397 Stage-11 map = 61%,  reduce = 0%, Cumulative CPU 82838.05 sec
 2018-05-23 20:21:52,620 Stage-11 map = 62%,  reduce = 0%, Cumulative CPU 84412.33 sec
 2018-05-23 20:22:01,879 Stage-11 map = 63%,  reduce = 0%, Cumulative CPU 85991.96 sec
 2018-05-23 20:22:11,109 Stage-11 map = 64%,  reduce = 0%, Cumulative CPU 87430.72 sec
 2018-05-23 20:22:19,306 Stage-11 map = 65%,  reduce = 0%, Cumulative CPU 88764.07 sec
 2018-05-23 20:22:29,554 Stage-11 map = 66%,  reduce = 0%, Cumulative CPU 90620.16 sec
 2018-05-23 20:22:38,812 Stage-11 map = 67%,  reduce = 0%, Cumulative CPU 92066.45 sec
 2018-05-23 20:22:50,109 Stage-11 map = 68%,  reduce = 0%, Cumulative CPU 93883.67 sec
 2018-05-23 20:23:06,470 Stage-11 map = 69%,  reduce = 0%, Cumulative CPU 96169.2 sec
 2018-05-23 20:23:12,624 Stage-11 map = 70%,  reduce = 0%, Cumulative CPU 97121.51 sec
 2018-05-23 20:23:21,870 Stage-11 map = 71%,  reduce = 0%, Cumulative CPU 98447.23 sec
 2018-05-23 20:23:33,192 Stage-11 map = 72%,  reduce = 0%, Cumulative CPU 99957.59 sec
 2018-05-23 20:23:42,459 Stage-11 map = 73%,  reduce = 0%, Cumulative CPU 101274.39 sec
 2018-05-23 20:23:52,734 Stage-11 map = 74%,  reduce = 0%, Cumulative CPU 102580.07 sec
 2018-05-23 20:24:05,048 Stage-11 map = 75%,  reduce = 0%, Cumulative CPU 104293.63 sec
 2018-05-23 20:24:14,278 Stage-11 map = 76%,  reduce = 0%, Cumulative CPU 105529.64 sec
 2018-05-23 20:24:22,480 Stage-11 map = 77%,  reduce = 0%, Cumulative CPU 106748.48 sec
 2018-05-23 20:24:31,704 Stage-11 map = 78%,  reduce = 0%, Cumulative CPU 107946.75 sec
 2018-05-23 20:24:41,944 Stage-11 map = 79%,  reduce = 0%, Cumulative CPU 109133.42 sec
 2018-05-23 20:24:53,247 Stage-11 map = 80%,  reduce = 0%, Cumulative CPU 110501.49 sec
 2018-05-23 20:25:07,608 Stage-11 map = 81%,  reduce = 0%, Cumulative CPU 111875.66 sec
 2018-05-23 20:25:20,931 Stage-11 map = 82%,  reduce = 0%, Cumulative CPU 113264.94 sec
 2018-05-23 20:25:35,306 Stage-11 map = 83%,  reduce = 0%, Cumulative CPU 114595.5 sec
 2018-05-23 20:25:51,714 Stage-11 map = 84%,  reduce = 0%, Cumulative CPU 116126.83 sec
 2018-05-23 20:26:08,108 Stage-11 map = 85%,  reduce = 0%, Cumulative CPU 117537.44 sec
 2018-05-23 20:26:24,494 Stage-11 map = 86%,  reduce = 0%, Cumulative CPU 118963.82 sec
 2018-05-23 20:26:40,882 Stage-11 map = 87%,  reduce = 0%, Cumulative CPU 120285.9 sec
 2018-05-23 20:27:03,416 Stage-11 map = 88%,  reduce = 0%, Cumulative CPU 121891.57 sec
 2018-05-23 20:27:24,910 Stage-11 map = 89%,  reduce = 0%, Cumulative CPU 123494.62 sec
 2018-05-23 20:27:43,347 Stage-11 map = 90%,  reduce = 0%, Cumulative CPU 124845.46 sec
 2018-05-23 20:28:14,110 Stage-11 map = 91%,  reduce = 0%, Cumulative CPU 126327.84 sec
 2018-05-23 20:28:44,871 Stage-11 map = 92%,  reduce = 0%, Cumulative CPU 127912.78 sec
 2018-05-23 20:29:12,507 Stage-11 map = 93%,  reduce = 0%, Cumulative CPU 129345.83 sec
 2018-05-23 20:29:57,615 Stage-11 map = 94%,  reduce = 0%, Cumulative CPU 130868.17 sec
 2018-05-23 20:30:58,157 Stage-11 map = 95%,  reduce = 0%, Cumulative CPU 132488.85 sec
 2018-05-23 20:31:58,544 Stage-11 map = 95%,  reduce = 0%, Cumulative CPU 133836.96 sec
 2018-05-23 20:32:02,634 Stage-11 map = 96%,  reduce = 0%, Cumulative CPU 133997.82 sec
 2018-05-23 20:32:40,580 Stage-11 map = 97%,  reduce = 0%, Cumulative CPU 135165.46 sec
 2018-05-23 20:33:37,148 Stage-11 map = 98%,  reduce = 0%, Cumulative CPU 136322.97 sec
 2018-05-23 20:34:37,541 Stage-11 map = 98%,  reduce = 0%, Cumulative CPU 137363.62 sec
 2018-05-23 20:35:02,125 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 137742.39 sec
 2018-05-23 20:36:02,482 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 138493.75 sec
 2018-05-23 20:37:02,931 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 138966.65 sec
 2018-05-23 20:38:03,216 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 139377.26 sec
 2018-05-23 20:39:03,473 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 139657.54 sec
 2018-05-23 20:39:36,121 Stage-11 map = 100%,  reduce = 0%, Cumulative CPU 139762.81 sec
 MapReduce Total cumulative CPU time: 1 days 14 hours 49 minutes 22 seconds 810 msec
 Stage-5 is selected by condition resolver.
 Stage-4 is filtered out by condition resolver.
 Stage-6 is filtered out by condition resolver.
 MapReduce Jobs Launched: 
 Stage-Stage-1: Map: 12  Reduce: 1   Cumulative CPU: 33.66 sec   HDFS Read: 1894841 HDFS Write: 8548 SUCCESS
 Stage-Stage-11: Map: 445   Cumulative CPU: 139762.81 sec   HDFS Read: 116588319891 HDFS Write: 336399118813 SUCCESS
 Total MapReduce CPU Time Spent: 1 days 14 hours 49 minutes 56 seconds 470 msec

五、算结果

将累计明细落地成表detail([dt string,dev_id string]),注:dt为t1.dt

 set hive.map.aggr=true;
 set Hive.optimize.skewjoin=true;
 set hive.exec.parallel=true;
 set hive.auto.convert.join=true;
 
 insert overwrite table detail
 select /*+ MAPJOIN(t1) */ t1.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt

聚合

 set hive.map.aggr=true;
 set Hive.optimize.skewjoin=true;
 set hive.exec.parallel=true;
 set hive.auto.convert.join=true;
 
 select dt,count(1) as acc from detail group by dt order by dt

最终结果

dt acc
20180501 2
20180502 6
20180503 9

你可能感兴趣的:(hive优化-级联求和)