ORC格式确实要比textFile要更适合于hive,查询速度会提高20-40%左右
youtube1的文件格式是TextFIle,youtube3的文件格式是orc
hive> select videoId,uploader,age,views from youtube1 order by views limit 10;
Query ID = hadoop_20170710085454_6768a540-a0b3-4d98-92a0-f97d4eff8b42
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0070, Tracking URL = http://master:8088/proxy/application_1499153664137_0070/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0070
Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
2017-07-10 08:55:18,434 Stage-1 map = 0%, reduce = 0%
2017-07-10 08:55:56,924 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 11.99 sec
...
2017-07-10 08:56:33,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 62.7 sec
MapReduce Total cumulative CPU time: 1 minutes 2 seconds 700 msec
Ended Job = job_1499153664137_0070
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 Reduce: 1 Cumulative CPU: 62.7 sec HDFS Read: 1086704175 HDFS Write: 323 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 2 seconds 700 msec
OK
P4c-EViSRsw ERNESTINEbrowning 1240 0
woMdGKHIg3o Maxwell739 1240 0
_jVH58-X4C4 nelenajolly 1240 0
k1iTl0Kh4DQ rachellelala 1240 0
8LdPZ_n1S4c GohanxVidel21 1240 0
PSex2TAkQC8 Qingy3 1254 0
YC20zP9p_wI aimeenmegan 1254 0
3XduLiQMMTM marshallgovindan 1240 0
IjeXG6yXXZ4 SenateurDupont1973 1254 0
sw4XgF1zkXE bablooian 1240 0
Time taken: 119.739 seconds, Fetched: 10 row(s)
hive> select videoId,uploader,age,views from youtube3 order by views limit 10;
Query ID = hadoop_20170710085959_e6d66799-0a8a-4696-bf93-a0abc3f00de0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0071, Tracking URL = http://master:8088/proxy/application_1499153664137_0071/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0071
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2017-07-10 08:59:43,472 Stage-1 map = 0%, reduce = 0%
2017-07-10 09:00:12,555 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 2.99 sec
...
2017-07-10 09:00:52,499 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.2 sec
MapReduce Total cumulative CPU time: 50 seconds 200 msec
Ended Job = job_1499153664137_0071
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 1 Cumulative CPU: 50.2 sec HDFS Read: 95047649 HDFS Write: 323 SUCCESS
Total MapReduce CPU Time Spent: 50 seconds 200 msec
OK
P4c-EViSRsw ERNESTINEbrowning 1240 0
woMdGKHIg3o Maxwell739 1240 0
_jVH58-X4C4 nelenajolly 1240 0
k1iTl0Kh4DQ rachellelala 1240 0
8LdPZ_n1S4c GohanxVidel21 1240 0
PSex2TAkQC8 Qingy3 1254 0
YC20zP9p_wI aimeenmegan 1254 0
3XduLiQMMTM marshallgovindan 1240 0
IjeXG6yXXZ4 SenateurDupont1973 1254 0
sw4XgF1zkXE bablooian 1240 0
Time taken: 109.776 seconds, Fetched: 10 row(s)
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube1 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710090404_46a79a3d-8863-4390-8898-7f82c5a3b7ab
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 5
Starting Job = job_1499153664137_0072, Tracking URL = http://master:8088/proxy/application_1499153664137_0072/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0072
Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 5
2017-07-10 09:05:04,067 Stage-1 map = 0%, reduce = 0%
2017-07-10 09:05:40,498 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 5.9 sec
...
2017-07-10 09:06:09,681 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 39.56 sec
MapReduce Total cumulative CPU time: 39 seconds 560 msec
Ended Job = job_1499153664137_0072
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0073, Tracking URL = http://master:8088/proxy/application_1499153664137_0073/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0073
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2017-07-10 09:06:45,360 Stage-2 map = 0%, reduce = 0%
...
2017-07-10 09:07:42,083 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.62 sec
MapReduce Total cumulative CPU time: 3 seconds 620 msec
Ended Job = job_1499153664137_0073
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 Reduce: 5 Cumulative CPU: 39.56 sec HDFS Read: 1086732303 HDFS Write: 1188 SUCCESS
Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 3.62 sec HDFS Read: 8218 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 43 seconds 180 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
Blogs 447581
People 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Howto 124885
Style 124885
Pets 86444
Animals 86444
Travel 82068
Events 82068
Education 54133
Technology 50925
Science 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 201.127 seconds, Fetched: 25 row(s)
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view
explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710090909_bfddc50b-665d-4296-9475-0cee55058c85
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 3
Starting Job = job_1499153664137_0074, Tracking URL = http://master:8088/proxy/application_1499153664137_0074/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0074
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 3
2017-07-10 09:09:46,075 Stage-1 map = 0%, reduce = 0%
...
2017-07-10 09:10:39,596 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 20.87 sec
MapReduce Total cumulative CPU time: 20 seconds 870 msec
Ended Job = job_1499153664137_0074
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0075, Tracking URL = http://master:8088/proxy/application_1499153664137_0075/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0075
Hadoop job information for Stage-2: number of mappers: 3; number of reducers: 1
2017-07-10 09:11:16,172 Stage-2 map = 0%, reduce = 0%
...
2017-07-10 09:12:01,694 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 4.43 sec
MapReduce Total cumulative CPU time: 4 seconds 430 msec
Ended Job = job_1499153664137_0075
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 3 Cumulative CPU: 20.87 sec HDFS Read: 47331548 HDFS Write: 996 SUCCESS
Stage-Stage-2: Map: 3 Reduce: 1 Cumulative CPU: 4.43 sec HDFS Read: 9228 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 25 seconds 300 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
People 447581
Blogs 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Style 124885
Howto 124885
Pets 86444
Animals 86444
Travel 82068
Events 82068
Education 54133
Science 50925
Technology 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 177.463 seconds, Fetched: 25 row(s)
我这里的比较是从CPU的耗时进行比较,至于Job的初始化等耗时由于存在不确定性,会影响比较的准确性所以暂时忽略
语句 | youtube1耗时 | youtube3耗时 |
---|---|---|
select videoId,uploader,age,views from youtube1/youtube3 order by views limit 10; | Total MapReduce CPU Time Spent: 1 minutes 2 seconds 700 msec(63s) | Total MapReduce CPU Time Spent: 50 seconds 200 msec(51s) |
select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube1/youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc; | Total MapReduce CPU Time Spent: 43 seconds 180 msec(44s) | Total MapReduce CPU Time Spent: 25 seconds 300 msec(26s) |
…
由上面表格我们可以看出以orc作为文件格式的表格的查询速度会比以textfile为文件格式的表格要提升20-40%左右
如果在数据量比较大的情况下,也就是执行一个job时间比较长,那么在使用order by 进行topn之前先用distribute by 和 sort by 先局部进行topN,然后最终结果在使用order by效果会更好.如果你的job执行起来也就是几分钟的话,那就不建议使用,因为先使用distribute by和sort by 先 topN 再对结果order by 要启动三个 job来完成,MR的启动时间有点长,这是没必要的开销,比直接使用order by花费更多时间
hive> select * from youtube3 order by views limit 10;
Query ID = hadoop_20170710103535_4de243fb-f26b-42ab-8cc9-aef568b4a5cb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0094, Tracking URL = http://master:8088/proxy/application_1499153664137_0094/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0094
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2017-07-10 10:35:55,171 Stage-1 map = 0%, reduce = 0%
2017-07-10 10:36:33,748 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 24.59 sec
...
2017-07-10 10:37:45,584 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 114.3 sec
MapReduce Total cumulative CPU time: 1 minutes 54 seconds 300 msec
Ended Job = job_1499153664137_0094
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 1 Cumulative CPU: 114.3 sec HDFS Read: 574632526 HDFS Write: 2016 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 54 seconds 300 msec
OK
P4c-EViSRsw ERNESTINEbrowning 1240 ["Entertainment"] 66 0 0.0 0 0 ["b0b7yT9e7hM","UEq74FEphvc","PjQ2y_af08Y","myj6exBy5IQ","vKMljsGZluc","l4FzM1qImdg","mXw1UE2mOuk","gH_K5iqSdgA","InHt3_nkT7s","h3gMHd0XKnE","fXRt4ua3UsQ","T6-5K58vhl4","P1iwvBnOoHg","QAg12t6UvcQ","jy7zqy_Dzic","emUoDRc98ig","g66xEcC9kWY","AbYjCb1jHP8","RF0aRkdtiAM","BURRJkwFkqc"]
woMdGKHIg3o Maxwell739 1240 ["Pets","Animals"] 103 0 0.0 0 0 ["hSSqMOL6ThE","c2k9YCkcoaU","9idmedIj5pk","AUokp8o5aZc","Vhc9bGFKhds","LNQI20dunF0","BOlo7q7MSZA","9cGYnRGqgyM","_oO1WPlJUxc","XMqP8E6kfdY","qnR6BnVByVA","GkGd4RyWMYU","VHt8IZSDU1o","Z83w2lIJZVM","jmnugMwoP1Q","llfOxhnlrJU","JXUmvz4cIRA","Q8nHWUG7aAA","dqQOVI6hDCI","m80Qz8nwmHI"]
_jVH58-X4C4 nelenajolly 1240 ["Entertainment"] 205 0 0.0 0 0 ["VkNejMz7FeI","3AfW6V5EAWY","y8i8U9Ni6Dc","_I3Aeh_HHEg","sZBoBpxnoY8","fheFnYcUxLs","MzFWXWkmvz8","OQDRuqWsnp8","qGW43kf2QEU","u79DsR40LOA","X4TtK-3hryw","ANL8q16edOE","OMLKjvcs5EY","qohfnkV8J3Y","jLEFgyU6TdI","cEmyMSvPvVg","-1B8IkZkGOs","STUi3-cV4RQ","-IU8xCjTNW8","-OfnDq4HLPw"]
k1iTl0Kh4DQ rachellelala 1240 ["People","Blogs"] 48 0 0.0 0 0 []
8LdPZ_n1S4c GohanxVidel21 1240 ["Entertainment"] 271 0 0.0 0 0 ["Dt-0QZX5MII","K34GSodojaM","g6hYT815-9I","r5aYMYBEyPo","Bp_JAed6uTc","qe3PF-KyjCI","nWvwSYIUMpE","LxKS1beypxo","5ZmqPCjpmik","m_wYIbkkizc","9xDjgav1pkk","w5WJKrLR78M","nxj_3D3lIQ0","VSEV1IXg2hQ","gwEJPELAJiM","qDrBEST48_I","LymqYIU3E4U","2bhoPxDzU3w","nr_-AyOyfqk","Cg4qrZG1uVc"]
PSex2TAkQC8 Qingy3 1254 ["Entertainment"] 300 0 0.0 0 0 ["FOs2-r_Fikg","I2TrAyBGGcY","iHws_tdiK64","Yfr_wFmWHTI","FK-Z56YqTsU","Rmfy5-dPl6E","nsfOPbS2bSk","tz59v4WgXhA","ggPoCRixQvQ","hvtWJxJn_GI","qKD6VMiCSlI","O9pFAKOON7o","MZta_rn6o48","y2JV5Hs4WYg","kzP6HA-8MmA","60AR_R_UklQ","tNDVwd00Mow","N0lmZqXKJUw","dYxKJNYpjSA","EWBW5QuAedY"]
YC20zP9p_wI aimeenmegan 1254 ["Film","Animation"] 138 0 0.0 0 0 ["AuuCpfU1aNA","pfeU2WJPlnc","tdxYPHX7mhs","Vig624KxPPQ","mFWhAjQdjyI","7rbpfmqKfKI","Of8uv_etw1w","iuzYVo8L7KU","M_g220ZmoRo","oJ2YCvmg24M","PmxQP2h6Qvg","1u4HgPlGqrg","mGdoxToj5IA","hAnjWHP7KAc","rK66jzIvynM","RgK4BaBnIBU","B9Yp1lD4Ohg","vFRHyRmZ8c0","ZFoswQiXhwU","ptKLZlhz2SI"]
3XduLiQMMTM marshallgovindan 1240 ["Education"] 545 0 0.0 0 0 []
IjeXG6yXXZ4 SenateurDupont1973 1254 ["Howto","Style"] 467 0 0.0 0 0 []
sw4XgF1zkXE bablooian 1240 ["People","Blogs"] 47 0 0.0 0 0 []
Time taken: 153.26 seconds, Fetched: 10 row(s)
hive> select a.* from (select * from youtube3 distribute by views sort by views limit 10) a order by views limit 10;
Query ID = hadoop_20170710104040_52c1a6e7-0dc7-4359-82d0-e566c536947a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
Starting Job = job_1499153664137_0095, Tracking URL = http://master:8088/proxy/application_1499153664137_0095/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0095
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 2
2017-07-10 10:41:35,382 Stage-1 map = 0%, reduce = 0%
...
2017-07-10 10:44:04,783 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 116.58 sec
MapReduce Total cumulative CPU time: 1 minutes 56 seconds 580 msec
Ended Job = job_1499153664137_0095
Launching Job 2 out of 3
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0096, Tracking URL = http://master:8088/proxy/application_1499153664137_0096/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0096
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-07-10 10:44:39,421 Stage-2 map = 0%, reduce = 0%
2017-07-10 10:45:08,358 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.12 sec
2017-07-10 10:45:28,180 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.68 sec
MapReduce Total cumulative CPU time: 2 seconds 680 msec
Ended Job = job_1499153664137_0096
Launching Job 3 out of 3
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0097, Tracking URL = http://master:8088/proxy/application_1499153664137_0097/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0097
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2017-07-10 10:46:05,072 Stage-3 map = 0%, reduce = 0%
2017-07-10 10:46:33,177 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.07 sec
2017-07-10 10:47:00,963 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 2.4 sec
MapReduce Total cumulative CPU time: 2 seconds 400 msec
Ended Job = job_1499153664137_0097
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 2 Cumulative CPU: 116.58 sec HDFS Read: 574635698 HDFS Write: 5128 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.68 sec HDFS Read: 11354 HDFS Write: 2314 SUCCESS
Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 2.4 sec HDFS Read: 9143 HDFS Write: 1965 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 1 seconds 660 msec
OK
dTPF85D4tEs Antoinnette71 1240 ["Entertainment"] 73 0 0.0 0 0 []
5e_flEf6uoU alexandreetmartin 1254 ["Comedy"] 137 0 0.0 0 0 ["e8F9txsHzQI","nibCFP4OB74","tH-7yll1t24","T5K17x93aMM","KxSCMsBipAM","JYEsGC71bmo","PAgQcF7eOt0","-jpkLH4RQ7I","UsY-Iu8QsqA","ru48hMHNZ08","D4aHmwmEBkw","vVYTYnT8zXE","wBcaCI0UMAE","qs1Queu8tGE","zsLRrBHIlZc","8lmWTQZtOsY","ztunItmPYzM","O8KUVYl8_wI","yNYTg9aZ1FY","MPF3JwAgNTY"]
I3bMdoqH2g4 lapino50 1254 ["Comedy"] 131 0 0.0 0 0 []
AvdoZL0U0Ko kristianbreen 1240 ["Comedy"] 121 0 0.0 0 0 []
9SAe80p-U48 BobRucklepuckle 1240 ["Gaming"] 170 0 0.0 0 0 ["z04IyYFXfP0","jcaDekCI408","bV3XZe0Az34","O7aYaJ8PtCs","I07LmzF46a8","zEl2frWgqdU","zBA-h4uM3cY","0t_mN96jRb8","bahgug-PFPs","VGDOZiVRUtg","OaXE6pMEsZc","A2Is6NIEuso","hNtqpVP5st4","TxZ2lzPbuDA","UFSc3rrAYLQ","58DEKFeWh4I","YDEQIbcDO1M","o9dyNqactV8","Usb-fZm56GM","XvV1yvSPJ_s"]
P2cQ-W9J99A 636141 1254 ["Comedy"] 77 0 0.0 0 0 ["74sW9OHvkCg","onKIV_-f6ew","XtHYdkH1flg","-SM5P5lz86A","zpz3HFBQR2o","U9lOI7n8sB4","w99Q5PLtjjE","5N6Cchi9Gtw","p9Bn2NSfzHw","DR2IaxcAAUU","BiGZd4BPlqw","0GMzzp_nJbw","e2KVuzAcsv0","jEz_k247tws","OEt50_BdIpQ","MV95TjaXkX8","c8lyUi8QKWs","P1OXAQHv09E","gtp8fgYhp50","L5-0k1HHh_k"]
JobgU5-dFy4 singindancinfool 1254 ["Entertainment"] 184 0 0.0 0 0 ["7MFG_yaFlvk","c--loq6HsDQ","CyrZR3BSH5o","Wni8rmr7rzs","XvFfRpQS4GU","ENSiblZIMV0","0enG7lzuaqM","c8y-zzaI4g8","Czfk0W888V0","dc1ls-qQrYg","3ATYGqE6Hww","U9DPjKzumYw","esEiXSOjs2c","WRCpHFSYXdA","32YyZMxFrBA","Abg7uBdwGJM","BqELlVTMv8o","2tbj33kpBig","LronHxxGti8","NtSl2jdPezI"]
YC20zP9p_wI aimeenmegan 1254 ["Film","Animation"] 138 0 0.0 0 0 ["AuuCpfU1aNA","pfeU2WJPlnc","tdxYPHX7mhs","Vig624KxPPQ","mFWhAjQdjyI","7rbpfmqKfKI","Of8uv_etw1w","iuzYVo8L7KU","M_g220ZmoRo","oJ2YCvmg24M","PmxQP2h6Qvg","1u4HgPlGqrg","mGdoxToj5IA","hAnjWHP7KAc","rK66jzIvynM","RgK4BaBnIBU","B9Yp1lD4Ohg","vFRHyRmZ8c0","ZFoswQiXhwU","ptKLZlhz2SI"]
vBGEkIG8Zdk hachem44 1254 ["Entertainment"] 122 0 0.0 0 0 []
P4eiE33eqOU BsmStudio 1254 ["Comedy"] 287 0 0.0 0 0 ["1_ao3c5jicU","RRMAZj-msOc","X04hwJyaO_4","4oee8Zn3hHI","60W-uUYnj3w","FtoJoBp03r4","_IMSMYLM0zs","22Sz120MAT8","F-B35V6Sszk","1AExOXwL0Ag","owRgISGt_iw","Wz0ZfHbPC-c","b7uLUY-FdKM","yevHaMC10IE","jGvBLJfEIcs","Ma4Ozief4k8","ZwnLErETvKU","oE4oeyRMcKI","IQnEUxtk-Pw"]
Time taken: 368.489 seconds, Fetched: 10 row(s)
语句 | 直接使用order by | 先使用distribute by和sort by再使用order by |
---|---|---|
select * from youtube3 order by views limit 10; | Total MapReduce CPU Time Spent: 1 minutes 54 seconds 300 msec(115s) Time taken: 153.26 seconds, Fetched: 10 row(s) |
|
select a.* from (select * from youtube3 distribute by views sort by views limit 10) a order by views limit 10; | Total MapReduce CPU Time Spent: 2 minutes 1 seconds 660 msec(122s) Time taken: 368.489 seconds, Fetched: 10 row(s) |
…
由上面表格我们可以看出,使用distribute by和sort by 来优化order by 对于使用场景是有要求的,并不是拿来就用的,而是使用于数据量较大的场景,要不然单单是启动多两个job的时间就够你喝一壶了。数据量小直接使用order by即可。
hive> set hive.auto.convert.join = false;
hive> set hive.auto.convert.join.noconditionaltask = false;
hive> set hive.optimize.bucketmapjoin=false;
查看表格结构:
hive> desc formatted user;
# col_name data_type comment
uploader string
videos int
friends int
# Detailed Table Information
Database: default
Owner: hadoop
CreateTime: Fri Jul 07 13:46:53 CST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cluster1/user/hive/warehouse/user
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 1192676
rawDataSize 121652944
totalSize 9783258
transient_lastDdlTime 1499406528
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 24
Bucket Columns: [uploader]
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.087 seconds, Fetched: 34 row(s)
hive> desc formatted youtube3;
# col_name data_type comment
videoid string
uploader string
age int
category array<string>
length int
views int
rate float
ratings int
comments int
relatedid array<string>
# Detailed Table Information
Database: default
Owner: hadoop
CreateTime: Fri Jul 07 13:28:56 CST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cluster1/user/hive/warehouse/youtube3
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 6
numRows 5134636
rawDataSize 7949154631
totalSize 574644356
transient_lastDdlTime 1499405660
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 8
Bucket Columns: [uploader]
Sort Columns: []
Storage Desc Params:
colelction.delim &
field.delim \t
serialization.format \t
Time taken: 0.086 seconds, Fetched: 42 row(s)
实例测试:
hive> select a.uploader,a.friends,b.videoId from user a join youtube3 b on a.uploader = b.uploader limit 10;
Query ID = hadoop_20170710092626_a2f303f3-8ee5-4747-a6c0-96ccb5d2f0d7
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
Starting Job = job_1499153664137_0077, Tracking URL = http://master:8088/proxy/application_1499153664137_0077/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0077
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 3
2017-07-10 09:26:41,012 Stage-1 map = 0%, reduce = 0%
...
2017-07-10 09:27:49,553 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 66.63 sec
MapReduce Total cumulative CPU time: 1 minutes 6 seconds 630 msec
Ended Job = job_1499153664137_0077
MapReduce Jobs Launched:
Stage-Stage-1: Map: 5 Reduce: 3 Cumulative CPU: 66.63 sec HDFS Read: 86211567 HDFS Write: 698 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 6 seconds 630 msec
OK
a0000a 7 6NmdrPmWjSU
a00393977 0 _EEwnI7pCCk
a007plan 0 VlDabTbGLNc
a02030203 0 AaY3pbdfbUM
a02030203 0 P28i2Xu4WB8
a02030203 0 LFzKFOWOHBg
a042538 14 YHu_VlY_p_E
a042538 14 8ECefR4g0Tg
a04297 0 2rqgJ1FNflo
a04297 0 iT_yac8DcQk
Time taken: 109.44 seconds, Fetched: 10 row(s)
hive> set hive.auto.convert.join = true;
hive> set hive.auto.convert.join.noconditionaltask = true;
hive> select a.uploader,a.friends,b.videoId from user a join youtube3 b on a.uploader = b.uploader limit 10;
Query ID = hadoop_20170710093232_6b9bf92d-5d02-4ce6-bb35-a58625a8f03b
Total jobs = 1
17/07/10 09:32:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/hadoop/hadoop_20170710093232_6b9bf92d-5d02-4ce6-bb35-a58625a8f03b.log
2017-07-10 09:32:37 Starting to launch local task to process map join; maximum memory = 518979584
2017-07-10 09:32:50 Processing rows: 200000 Hashtable size: 199999 Memory usage: 76725768 percentage: 0.148
...
2017-07-10 09:32:56 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 316646056 percentage: 0.61
2017-07-10 09:32:56 Dump the side-table for tag: 0 with group count: 1192676 into file: file:/tmp/hadoop/81308db8-add0-47bf-9a70-ae8e83f4f05c/hive_2017-07-10_09-32-30_881_1416539360570079360-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile160--.hashtable
2017-07-10 09:32:59 Uploaded 1 File to: file:/tmp/hadoop/81308db8-add0-47bf-9a70-ae8e83f4f05c/hive_2017-07-10_09-32-30_881_1416539360570079360-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile160--.hashtable (36232906 bytes)
2017-07-10 09:32:59 End of local task; Time Taken: 21.563 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1499153664137_0079, Tracking URL = http://master:8088/proxy/application_1499153664137_0079/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0079
Hadoop job information for Stage-3: number of mappers: 4; number of reducers: 0
2017-07-10 09:33:42,458 Stage-3 map = 0%, reduce = 0%
2017-07-10 09:34:16,719 Stage-3 map = 50%, reduce = 0%, Cumulative CPU 13.33 sec
2017-07-10 09:34:27,090 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 27.05 sec
MapReduce Total cumulative CPU time: 27 seconds 50 msec
Ended Job = job_1499153664137_0079
MapReduce Jobs Launched:
Stage-Stage-3: Map: 4 Cumulative CPU: 27.05 sec HDFS Read: 20408350 HDFS Write: 1094 SUCCESS
Total MapReduce CPU Time Spent: 27 seconds 50 msec
OK
maxmagpies 22 9UagGiEP_kU
lilsteps97 2 hKE3F0cLl_M
cokerish 6 mZpSi9Sfs2k
mikecrazy03 17 a-nkbgF3Wcw
xoPaiigexo 21 CmCl_hSXM80
r4nd0mUs3r 133 2Yke3hdSeJQ
popkorn615 39 sc88PbADsk8
popkorn615 39 bGNzyCCK2Uo
tuxkiller007 0 s1n2y3c2z9k
popkorn615 39 norcLxmpS7o
Time taken: 118.299 seconds, Fetched: 10 row(s)
hive> set hive.optimize.bucketmapjoin=true;
hive> select a.uploader,a.friends,b.videoId from user a join youtube3 b on a.uploader = b.uploader limit 10;
Query ID = hadoop_20170710093535_dddfa9db-cae4-4a9d-bbe0-0dba73ee1cb5
Total jobs = 1
17/07/10 09:36:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/hadoop/hadoop_20170710093535_dddfa9db-cae4-4a9d-bbe0-0dba73ee1cb5.log
2017-07-10 09:36:05 Starting to launch local task to process map join; maximum memory = 518979584
2017-07-10 09:36:18 Processing rows: 200000 Hashtable size: 199999 Memory usage: 76897640 percentage: 0.148
...
2017-07-10 09:36:23 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 315574944 percentage: 0.608
2017-07-10 09:36:23 Dump the side-table for tag: 0 with group count: 1192676 into file: file:/tmp/hadoop/81308db8-add0-47bf-9a70-ae8e83f4f05c/hive_2017-07-10_09-35-58_131_4163936080161520499-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile170--.hashtable
2017-07-10 09:36:26 Uploaded 1 File to: file:/tmp/hadoop/81308db8-add0-47bf-9a70-ae8e83f4f05c/hive_2017-07-10_09-35-58_131_4163936080161520499-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile170--.hashtable (36232906 bytes)
2017-07-10 09:36:26 End of local task; Time Taken: 21.374 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1499153664137_0080, Tracking URL = http://master:8088/proxy/application_1499153664137_0080/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0080
Hadoop job information for Stage-3: number of mappers: 4; number of reducers: 0
2017-07-10 09:37:09,550 Stage-3 map = 0%, reduce = 0%
2017-07-10 09:37:43,845 Stage-3 map = 25%, reduce = 0%, Cumulative CPU 12.14 sec
2017-07-10 09:37:44,884 Stage-3 map = 50%, reduce = 0%, Cumulative CPU 12.58 sec
2017-07-10 09:37:54,187 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 25.41 sec
MapReduce Total cumulative CPU time: 25 seconds 410 msec
Ended Job = job_1499153664137_0080
MapReduce Jobs Launched:
Stage-Stage-3: Map: 4 Cumulative CPU: 25.41 sec HDFS Read: 20408350 HDFS Write: 1094 SUCCESS
Total MapReduce CPU Time Spent: 25 seconds 410 msec
OK
maxmagpies 22 9UagGiEP_kU
lilsteps97 2 hKE3F0cLl_M
cokerish 6 mZpSi9Sfs2k
mikecrazy03 17 a-nkbgF3Wcw
xoPaiigexo 21 CmCl_hSXM80
r4nd0mUs3r 133 2Yke3hdSeJQ
popkorn615 39 sc88PbADsk8
popkorn615 39 bGNzyCCK2Uo
tuxkiller007 0 s1n2y3c2z9k
popkorn615 39 norcLxmpS7o
Time taken: 117.119 seconds, Fetched: 10 row(s)
使用语句:select a.uploader,a.friends,b.videoId from user a join youtube3 b on a.uploader = b.uploader limit 10;
配置 | 开启 | 关闭 |
---|---|---|
map join | (将数据放到缓存中)End of local task; Time Taken: 21.563 sec. Total MapReduce CPU Time Spent: 27 seconds 50 msec CPU总耗时(22+28 = 50s) |
Total MapReduce CPU Time Spent: 1 minutes 6 seconds 630 msec CPU总耗时66s |
bucket map join | End of local task; Time Taken: 21.374 sec. Total MapReduce CPU Time Spent: 25 seconds 410 msec CPU总耗时(22+26)48s |
End of local task; Time Taken: 21.563 sec. Total MapReduce CPU Time Spent: 27 seconds 50 msec CPU总耗时(22+28)50s |
…
由上面数据我们可以看出,map join在cpu耗时方面还是有所减少,而bucket map join相对于map join的提升就不明显,可能原因还是在于数据量问题,我们可以看出map join相对于普通的reduce join执行速度还是有提升的,而且数据量越大的时候又是会更明显。但是数据量不大的情况,有文件的读取上传等操作导致他的优势不明显,可能反而最终耗时更长。
hive> set hive.merge.mapfiles=false;
hive> set hive.merge.mapredfiles = false;
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710095252_9c97d548-c734-4df3-8d78-541c7e4ec66f
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1499153664137_0082, Tracking URL = http://master:8088/proxy/application_1499153664137_0082/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0082
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 3
2017-07-10 09:53:26,240 Stage-1 map = 0%, reduce = 0%
2017-07-10 09:53:54,547 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 2.05 sec
...
2017-07-10 09:54:22,238 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 21.08 sec
MapReduce Total cumulative CPU time: 21 seconds 80 msec
Ended Job = job_1499153664137_0082
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0083, Tracking URL = http://master:8088/proxy/application_1499153664137_0083/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0083
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-07-10 09:54:59,413 Stage-2 map = 0%, reduce = 0%
...
2017-07-10 09:55:56,728 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.43 sec
MapReduce Total cumulative CPU time: 2 seconds 430 msec
Ended Job = job_1499153664137_0083
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 3 Cumulative CPU: 21.08 sec HDFS Read: 47331544 HDFS Write: 996 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.43 sec HDFS Read: 5826 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 23 seconds 510 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
People 447581
Blogs 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Style 124885
Howto 124885
Animals 86444
Pets 86444
Travel 82068
Events 82068
Education 54133
Science 50925
Technology 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 196.576 seconds, Fetched: 25 row(s)
hive> set hive.merge.mapfiles=true;
hive> set hive.merge.mapredfiles=true;
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710090909_bfddc50b-665d-4296-9475-0cee55058c85
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 3
Starting Job = job_1499153664137_0074, Tracking URL = http://master:8088/proxy/application_1499153664137_0074/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0074
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 3
2017-07-10 09:09:46,075 Stage-1 map = 0%, reduce = 0%
2017-07-10 09:10:15,149 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 1.99 sec
...
2017-07-10 09:10:39,596 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 20.87 sec
MapReduce Total cumulative CPU time: 20 seconds 870 msec
Ended Job = job_1499153664137_0074
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
Starting Job = job_1499153664137_0075, Tracking URL = http://master:8088/proxy/application_1499153664137_0075/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0075
Hadoop job information for Stage-2: number of mappers: 3; number of reducers: 1
2017-07-10 09:11:16,172 Stage-2 map = 0%, reduce = 0%
...
2017-07-10 09:12:01,694 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 4.43 sec
MapReduce Total cumulative CPU time: 4 seconds 430 msec
Ended Job = job_1499153664137_0075
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 3 Cumulative CPU: 20.87 sec HDFS Read: 47331548 HDFS Write: 996 SUCCESS
Stage-Stage-2: Map: 3 Reduce: 1 Cumulative CPU: 4.43 sec HDFS Read: 9228 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 25 seconds 300 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
People 447581
Blogs 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Style 124885
Howto 124885
Pets 86444
Animals 86444
Travel 82068
Events 82068
Education 54133
Science 50925
Technology 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 177.463 seconds, Fetched: 25 row(s)
reduce数太多或者太少都不好,太多的话会导致后续的job的map数过多,太少会导致浪费资源
reduce数由这三个参数来决定
设置reduces
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
结果如下:
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710095252_9c97d548-c734-4df3-8d78-541c7e4ec66f
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1499153664137_0082, Tracking URL = http://master:8088/proxy/application_1499153664137_0082/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0082
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 3
2017-07-10 09:53:26,240 Stage-1 map = 0%, reduce = 0%
2017-07-10 09:53:54,547 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 2.05 sec
...
2017-07-10 09:54:22,238 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 21.08 sec
MapReduce Total cumulative CPU time: 21 seconds 80 msec
Ended Job = job_1499153664137_0082
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1499153664137_0083, Tracking URL = http://master:8088/proxy/application_1499153664137_0083/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0083
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-07-10 09:54:59,413 Stage-2 map = 0%, reduce = 0%
2017-07-10 09:55:27,659 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.1 sec
2017-07-10 09:55:56,728 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.43 sec
MapReduce Total cumulative CPU time: 2 seconds 430 msec
Ended Job = job_1499153664137_0083
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 3 Cumulative CPU: 21.08 sec HDFS Read: 47331544 HDFS Write: 996 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.43 sec HDFS Read: 5826 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 23 seconds 510 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
People 447581
Blogs 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Style 124885
Howto 124885
Animals 86444
Pets 86444
Travel 82068
Events 82068
Education 54133
Science 50925
Technology 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 196.576 seconds, Fetched: 25 row(s)
hive> set hive.merge.mapfiles=true;
hive> set hive.merge.mapredfiles=true;
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710095757_8fc6f5e5-6b3a-4618-b8b0-57ec16ea0225
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1499153664137_0084, Tracking URL = http://master:8088/proxy/application_1499153664137_0084/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0084
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 3
2017-07-10 09:58:31,781 Stage-1 map = 0%, reduce = 0%
...
2017-07-10 09:59:24,606 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 20.81 sec
MapReduce Total cumulative CPU time: 20 seconds 810 msec
Ended Job = job_1499153664137_0084
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1499153664137_0085, Tracking URL = http://master:8088/proxy/application_1499153664137_0085/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0085
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-07-10 10:00:00,596 Stage-2 map = 0%, reduce = 0%
2017-07-10 10:00:29,530 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.08 sec
2017-07-10 10:00:58,684 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.67 sec
MapReduce Total cumulative CPU time: 2 seconds 670 msec
Ended Job = job_1499153664137_0085
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 3 Cumulative CPU: 20.81 sec HDFS Read: 47331712 HDFS Write: 996 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.67 sec HDFS Read: 5826 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 23 seconds 480 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
People 447581
Blogs 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Style 124885
Howto 124885
Animals 86444
Pets 86444
Travel 82068
Events 82068
Education 54133
Science 50925
Technology 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 187.501 seconds, Fetched: 25 row(s)
结果如下:
hive> set mapreduce.job.reduces=3;
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710100303_395d684c-fdd7-4ac6-a1f1-d7aa9269dd2a
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1499153664137_0086, Tracking URL = http://master:8088/proxy/application_1499153664137_0086/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0086
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 3
2017-07-10 10:04:03,317 Stage-1 map = 0%, reduce = 0%
2017-07-10 10:04:32,522 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 2.0 sec
2017-07-10 10:04:36,916 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 6.11 sec
2017-07-10 10:04:41,375 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 7.62 sec
2017-07-10 10:04:48,683 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 16.05 sec
2017-07-10 10:04:50,754 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 17.43 sec
2017-07-10 10:04:53,926 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 19.1 sec
2017-07-10 10:04:57,047 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 20.75 sec
MapReduce Total cumulative CPU time: 20 seconds 750 msec
Ended Job = job_1499153664137_0086
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1499153664137_0087, Tracking URL = http://master:8088/proxy/application_1499153664137_0087/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0087
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2017-07-10 10:05:32,722 Stage-2 map = 0%, reduce = 0%
2017-07-10 10:05:59,576 Stage-2 map = 50%, reduce = 0%, Cumulative CPU 0.93 sec
2017-07-10 10:06:00,604 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.03 sec
2017-07-10 10:06:19,183 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.61 sec
MapReduce Total cumulative CPU time: 3 seconds 610 msec
Ended Job = job_1499153664137_0087
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 3 Cumulative CPU: 20.75 sec HDFS Read: 47331712 HDFS Write: 996 SUCCESS
Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 3.61 sec HDFS Read: 7527 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 360 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
People 447581
Blogs 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Style 124885
Howto 124885
Pets 86444
Animals 86444
Travel 82068
Events 82068
Education 54133
Technology 50925
Science 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 177.949 seconds, Fetched: 25 row(s)
hive> set mapreduce.job.reduces=5;
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710100909_13293e63-288b-43d2-aa71-b50fd3992a9d
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1499153664137_0088, Tracking URL = http://master:8088/proxy/application_1499153664137_0088/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0088
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 5
2017-07-10 10:10:23,022 Stage-1 map = 0%, reduce = 0%
2017-07-10 10:10:54,170 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 3.81 sec
2017-07-10 10:10:56,290 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 9.27 sec
2017-07-10 10:10:58,461 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 11.33 sec
2017-07-10 10:11:05,880 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 14.86 sec
2017-07-10 10:11:07,987 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 15.2 sec
2017-07-10 10:11:14,235 Stage-1 map = 100%, reduce = 20%, Cumulative CPU 16.87 sec
2017-07-10 10:11:18,373 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 18.56 sec
2017-07-10 10:11:19,412 Stage-1 map = 100%, reduce = 60%, Cumulative CPU 20.24 sec
2017-07-10 10:11:21,495 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 21.66 sec
2017-07-10 10:11:22,525 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 23.08 sec
MapReduce Total cumulative CPU time: 23 seconds 80 msec
Ended Job = job_1499153664137_0088
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1499153664137_0089, Tracking URL = http://master:8088/proxy/application_1499153664137_0089/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0089
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2017-07-10 10:12:00,374 Stage-2 map = 0%, reduce = 0%
2017-07-10 10:12:27,396 Stage-2 map = 50%, reduce = 0%, Cumulative CPU 0.91 sec
2017-07-10 10:12:28,425 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.03 sec
2017-07-10 10:12:46,169 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.61 sec
MapReduce Total cumulative CPU time: 3 seconds 610 msec
Ended Job = job_1499153664137_0089
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 5 Cumulative CPU: 23.08 sec HDFS Read: 47339903 HDFS Write: 1188 SUCCESS
Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 3.61 sec HDFS Read: 8218 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 26 seconds 690 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
Blogs 447581
People 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Style 124885
Howto 124885
Animals 86444
Pets 86444
Travel 82068
Events 82068
Education 54133
Science 50925
Technology 50925
UNA 42928
Nonprofits 16925
Activism 16925
Gaming 10182
Time taken: 184.918 seconds, Fetched: 25 row(s)
hive> set mapreduce.job.reduces=2;
hive> select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
Query ID = hadoop_20170710101515_d3f7eb83-0edf-463b-90d6-306b6eed2079
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1499153664137_0090, Tracking URL = http://master:8088/proxy/application_1499153664137_0090/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0090
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 2
2017-07-10 10:16:03,018 Stage-1 map = 0%, reduce = 0%
2017-07-10 10:16:31,593 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 1.99 sec
2017-07-10 10:16:34,769 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.8 sec
2017-07-10 10:16:51,865 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 14.03 sec
2017-07-10 10:16:52,986 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 14.28 sec
2017-07-10 10:16:54,084 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 15.01 sec
2017-07-10 10:16:55,138 Stage-1 map = 84%, reduce = 13%, Cumulative CPU 15.39 sec
2017-07-10 10:16:56,210 Stage-1 map = 100%, reduce = 13%, Cumulative CPU 16.39 sec
2017-07-10 10:16:57,298 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 17.73 sec
2017-07-10 10:16:58,330 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 19.42 sec
MapReduce Total cumulative CPU time: 19 seconds 420 msec
Ended Job = job_1499153664137_0090
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1499153664137_0091, Tracking URL = http://master:8088/proxy/application_1499153664137_0091/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1499153664137_0091
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-07-10 10:17:35,020 Stage-2 map = 0%, reduce = 0%
2017-07-10 10:18:01,896 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.9 sec
2017-07-10 10:18:31,798 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.45 sec
MapReduce Total cumulative CPU time: 2 seconds 450 msec
Ended Job = job_1499153664137_0091
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Reduce: 2 Cumulative CPU: 19.42 sec HDFS Read: 47327614 HDFS Write: 900 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.45 sec HDFS Read: 5475 HDFS Write: 356 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 870 msec
OK
Entertainment 1304724
Music 1274825
Comedy 449652
Blogs 447581
People 447581
Film 442109
Animation 442109
Sports 390619
Politics 186753
News 186753
Autos 169883
Vehicles 169883
Howto 124885
Style 124885
Animals 86444
Pets 86444
Travel 82068
Events 82068
Education 54133
Science 50925
Technology 50925
UNA 42928
Activism 16925
Nonprofits 16925
Gaming 10182
Time taken: 190.953 seconds, Fetched: 25 row(s)
使用语句:select tagId, count(a.videoid) as sum from (select videoid,tagId from youtube3 lateral view explode(category) catetory as tagId) a group by a.tagId order by sum desc;
reduce数的选择只能通过不断地调整来达到一个最优的方案
reduce数 | CPU耗时 | 详细情况 |
---|---|---|
不设置时,自动切分为3 | 24s | Stage-Stage-1: Map: 4 Reduce: 3 Cumulative CPU: 21.08 sec HDFS Read: 47331544 HDFS Write: 996 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.43 sec HDFS Read: 5826 HDFS Write: 356 SUCCESS Total MapReduce CPU Time Spent: 23 seconds 510 msec |
设置为5 | 26s | MapReduce Jobs Launched: Stage-Stage-1: Map: 4 Reduce: 5 Cumulative CPU: 23.08 sec HDFS Read: 47339903 HDFS Write: 1188 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 3.61 sec HDFS Read: 8218 HDFS Write: 356 SUCCESS Total MapReduce CPU Time Spent: 26 seconds 690 msec |
设置为2 | 22s | MapReduce Jobs Launched: Stage-Stage-1: Map: 4 Reduce: 2 Cumulative CPU: 19.42 sec HDFS Read: 47327614 HDFS Write: 900 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.45 sec HDFS Read: 5475 HDFS Write: 356 SUCCESS Total MapReduce CPU Time Spent: 21 seconds 870 msec |
…
所以在此条查询中reduce最佳是2