老大昨天发给我一个hql:
create table zx_car_weibo_41_tmp
as
select *
from ods_tblog_content
where dt = '20130101'
and ((content like '%4C%') or ( extend like '%4C%'))
and ((content like '%雪铁龙%') or ( extend like '%雪铁龙%'))
让我研究一下为什么这么简单的hql会启动两个mapreduce作业,按理说最多是一个mapreduce加上数据文件move操作就够了,我拿到手之后就着手开始研究,首先拿到执行计划:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-4 depends on stages: Stage-1 , consists of Stage-3, Stage-2
Stage-3
Stage-0 depends on stages: Stage-3, Stage-2
Stage-5 depends on stages: Stage-0
Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
ods_tblog_content
TableScan
alias: ods_tblog_content
Filter Operator
predicate:
expr: (((content like '%4C%') or (extend like '%4C%')) and ((content like '%雪铁龙%') or (extend like '%雪铁龙%')))
type: boolean
Filter Operator
predicate:
expr: (((dt = '20130101') and ((content like '%4C%') or (extend like '%4C%'))) and ((content like '%雪铁龙%') or (extend like '%雪铁龙%')))
type: boolean
Select Operator
expressions:
expr: action
type: string
expr: mid
type: string
expr: uid
type: string
expr: extend
type: string
expr: geo
type: string
expr: huati_tag
type: string
expr: time
type: string
expr: ip
type: string
expr: oper_src_id
type: string
expr: usertype
type: string
expr: filter
type: string
expr: state
type: string
expr: content
type: string
expr: prov_id
type: string
expr: city_id
type: string
expr: location_name
type: string
expr: atusers
type: string
expr: blog_url
type: string
expr: user_type
type: string
expr: parentmid
type: string
expr: visible
type: string
expr: dt
type: string
expr: hour
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-4
Conditional Operator
Stage: Stage-3
Move Operator
files:
hdfs directory: true
destination: hdfs://nn1.hadoop.data.sina.com.cn/user/liangjun/hive-liangjun/hive_2014-02-12_17-49-20_360_2720110269798053816/-ext-10001
Stage: Stage-0
Move Operator
files:
hdfs directory: true
destination: hdfs://nn1.hadoop.data.sina.com.cn/user/liangjun/warehouse/zx_car_weibo_41_tmp
Stage: Stage-5
Create Table Operator:
Create Table
columns: action string, mid string, uid string, extend string, geo string, huati_tag string, time string, ip string, oper_src_id string, usertype string, filter string, state string, content string, prov_id string, city_id string, location_name string, atusers string, blog_url string, user_type string, parentmid string, visible string, dt string, hour string
if not exists: false
input format: org.apache.hadoop.mapred.TextInputFormat
# buckets: -1
output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
name: zx_car_weibo_41_tmp
isExternal: false
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://nn1.hadoop.data.sina.com.cn/user/liangjun/hive-liangjun/hive_2014-02-12_17-49-20_360_2720110269798053816/-ext-10002
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
果然,执行计划显示在Stage-2的时候又执行了一个mapreduce任务,这个mapreduce做了什么,执行计划里面看不出来啊。肿么办?看来只能使用hive的debug模式看看来。
以下是在命令行里面启用了hive的debug模式,由于debug模式信息输出比较多,我将信息数据到了文件里面:
hive -hiveconf hive.root.logger=INFO,console -e "create table hue_zx_car_weibo_41_tmp1 as select * from ods_tblog_content where dt = '20130101' and ((content like '%4C%') or ( extend like '%4C%')) and ((content like '%雪铁龙%') or ( extend like '%雪铁龙%'))" >> hive.query.debug 2>&1
打开hive.query.debug文件,查看第二个任务启动附近的相关信息,终于看到了些端倪:
Launching Job 2 out of 2
14/02/13 10:01:10 INFO ql.Driver: Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
14/02/13 10:01:10 INFO exec.MapRedTask: Number of reduce tasks is set to 0 since there's no reduce operator
14/02/13 10:01:10 INFO exec.MapRedTask: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
e-0.7.1-bin/lib/hive_hue_udf.jar,file:///usr/local/hive-0.7.1-bin/lib/hive-contrib-0.7.1.jar
-13_09-59-09_024_2095621341922989401/-ext-10002
2-13_09-59-09_024_2095621341922989401/-ext-10002
the same.
cn/user/liangjun/hive-liangjun/hive_2014-02-13_09-59-09_024_2095621341922989401/-ext-10002
14/02/13 10:01:11 INFO mapred.FileInputFormat: Total input paths to process : 121
14/02/13 10:01:11 INFO io.CombineHiveInputFormat: number of splits 1
e.hadoop.mapred.TextInputFormat
splitsize: 42 maxsize: 10
19_1314113
0030/jobdetails.jsp?jobid=job_201401110619_1314113
01401110619_1314113
hadoop.data.sina.com.cn:8021 -kill job_201401110619_1314113
2014-02-13 10:01:29,173 Stage-2 map = 0%, reduce = 0%
14/02/13 10:01:29 INFO exec.MapRedTask: 2014-02-13 10:01:29,173 Stage-2 map = 0%, reduce = 0%
1
14/02/13 10:01:55 INFO exec.DDLTask: Default to LazySimpleSerDe for table hue_zx_car_weibo_41_tmp1
ble, string dt, string hour}
14/02/13 10:01:55 INFO metastore.HiveMetaStore: 0: create_table: db=default tbl=hue_zx_car_weibo_41_tmp1
weibo_41_tmp1
t-10000
2-13_09-59-09_024_2095621341922989401/-ext-10000
OK
14/02/13 10:01:55 INFO ql.Driver: OK
Time taken: 166.654 seconds
14/02/13 10:01:55 INFO CliDriver: Time taken: 166.654 seconds
根据上面数据信息中提到“io.CombineHiveInputFormat,”原来,第二个mapreduce作业是在进行文件合并。仔细一想,确实应该是这样的:根据执行计划和作业运行日志,第一个作业启动了121个map任务,没有reduce 任务,导致输出到hdfs上面有121个小文件,hive在第一个作业运行完成之后,会起一个conditional task,来判断是否需要合并小文件,如果满足要求就会另外启动一个map-only job 或者mapred job来完成合并。
参考:http://blog.csdn.net/lalaguozhe/article/details/9053645