hql中小文件合并操作

老大昨天发给我一个hql:

create table zx_car_weibo_41_tmp
as 
select * 
from ods_tblog_content 
where  dt  = '20130101' 
 and ((content like '%4C%')  or ( extend like '%4C%')) 
 and ((content like '%雪铁龙%')  or ( extend like '%雪铁龙%'))

让我研究一下为什么这么简单的hql会启动两个mapreduce作业,按理说最多是一个mapreduce加上数据文件move操作就够了,我拿到手之后就着手开始研究,首先拿到执行计划:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-4 depends on stages: Stage-1 , consists of Stage-3, Stage-2
  Stage-3
  Stage-0 depends on stages: Stage-3, Stage-2
  Stage-5 depends on stages: Stage-0
  Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        ods_tblog_content 
          TableScan
            alias: ods_tblog_content
            Filter Operator
              predicate:
                  expr: (((content like '%4C%') or (extend like '%4C%')) and ((content like '%雪铁龙%') or (extend like '%雪铁龙%')))
                  type: boolean
              Filter Operator
                predicate:
                    expr: (((dt = '20130101') and ((content like '%4C%') or (extend like '%4C%'))) and ((content like '%雪铁龙%') or (extend like '%雪铁龙%')))
                    type: boolean
                Select Operator
                  expressions:
                        expr: action
                        type: string
                        expr: mid
                        type: string
                        expr: uid
                        type: string
                        expr: extend
                        type: string
                        expr: geo
                        type: string
                        expr: huati_tag
                        type: string
                        expr: time
                        type: string
                        expr: ip
                        type: string
                        expr: oper_src_id
                        type: string
                        expr: usertype
                        type: string
                        expr: filter
                        type: string
                        expr: state
                        type: string
                        expr: content
                        type: string
                        expr: prov_id
                        type: string
                        expr: city_id
                        type: string
                        expr: location_name
                        type: string
                        expr: atusers
                        type: string
                        expr: blog_url
                        type: string
                        expr: user_type
                        type: string
                        expr: parentmid
                        type: string
                        expr: visible
                        type: string
                        expr: dt
                        type: string
                        expr: hour
                        type: string
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22
                  File Output Operator
                    compressed: false
                    GlobalTableId: 1
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-4
    Conditional Operator

  Stage: Stage-3
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://nn1.hadoop.data.sina.com.cn/user/liangjun/hive-liangjun/hive_2014-02-12_17-49-20_360_2720110269798053816/-ext-10001

  Stage: Stage-0
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://nn1.hadoop.data.sina.com.cn/user/liangjun/warehouse/zx_car_weibo_41_tmp

  Stage: Stage-5
      Create Table Operator:
        Create Table
          columns: action string, mid string, uid string, extend string, geo string, huati_tag string, time string, ip string, oper_src_id string, usertype string, filter string, state string, content string, prov_id string, city_id string, location_name string, atusers string, blog_url string, user_type string, parentmid string, visible string, dt string, hour string
          if not exists: false
          input format: org.apache.hadoop.mapred.TextInputFormat
          # buckets: -1
          output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
          name: zx_car_weibo_41_tmp
          isExternal: false

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://nn1.hadoop.data.sina.com.cn/user/liangjun/hive-liangjun/hive_2014-02-12_17-49-20_360_2720110269798053816/-ext-10002 
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        果然,执行计划显示在Stage-2的时候又执行了一个mapreduce任务,这个mapreduce做了什么,执行计划里面看不出来啊。肿么办?看来只能使用hive的debug模式看看来。

以下是在命令行里面启用了hive的debug模式,由于debug模式信息输出比较多,我将信息数据到了文件里面:

hive -hiveconf hive.root.logger=INFO,console  -e "create table hue_zx_car_weibo_41_tmp1 as select * from ods_tblog_content where  dt  = '20130101'  and ((content like '%4C%')  or ( extend like '%4C%')) and ((content like '%雪铁龙%')  or ( extend like '%雪铁龙%'))" >> hive.query.debug 2>&1
打开hive.query.debug文件,查看第二个任务启动附近的相关信息,终于看到了些端倪:

Launching Job 2 out of 2
14/02/13 10:01:10 INFO ql.Driver: Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
14/02/13 10:01:10 INFO exec.MapRedTask: Number of reduce tasks is set to 0 since there's no reduce operator
14/02/13 10:01:10 INFO exec.MapRedTask: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
e-0.7.1-bin/lib/hive_hue_udf.jar,file:///usr/local/hive-0.7.1-bin/lib/hive-contrib-0.7.1.jar
-13_09-59-09_024_2095621341922989401/-ext-10002
2-13_09-59-09_024_2095621341922989401/-ext-10002
the same.
cn/user/liangjun/hive-liangjun/hive_2014-02-13_09-59-09_024_2095621341922989401/-ext-10002
14/02/13 10:01:11 INFO mapred.FileInputFormat: Total input paths to process : 121
14/02/13 10:01:11 INFO io.CombineHiveInputFormat: number of splits 1
e.hadoop.mapred.TextInputFormat
 splitsize: 42 maxsize: 10
19_1314113
0030/jobdetails.jsp?jobid=job_201401110619_1314113
01401110619_1314113
hadoop.data.sina.com.cn:8021 -kill job_201401110619_1314113
2014-02-13 10:01:29,173 Stage-2 map = 0%,  reduce = 0%
14/02/13 10:01:29 INFO exec.MapRedTask: 2014-02-13 10:01:29,173 Stage-2 map = 0%,  reduce = 0%
1
14/02/13 10:01:55 INFO exec.DDLTask: Default to LazySimpleSerDe for table hue_zx_car_weibo_41_tmp1
ble, string dt, string hour}
14/02/13 10:01:55 INFO metastore.HiveMetaStore: 0: create_table: db=default tbl=hue_zx_car_weibo_41_tmp1
weibo_41_tmp1       
t-10000
2-13_09-59-09_024_2095621341922989401/-ext-10000
OK
14/02/13 10:01:55 INFO ql.Driver: OK
Time taken: 166.654 seconds
14/02/13 10:01:55 INFO CliDriver: Time taken: 166.654 seconds
        根据上面数据信息中提到“io.CombineHiveInputFormat,”原来,第二个mapreduce作业是在进行文件合并。仔细一想,确实应该是这样的:根据执行计划和作业运行日志,第一个作业启动了121个map任务,没有reduce 任务,导致输出到hdfs上面有121个小文件,hive在第一个作业运行完成之后,会起一个conditional task,来判断是否需要合并小文件,如果满足要求就会另外启动一个map-only job 或者mapred job来完成合并。

参考:http://blog.csdn.net/lalaguozhe/article/details/9053645



你可能感兴趣的:(hive)