[Hive]Hive调优:让任务并行执行

业务背景

extract_trfc_page_kpi的hive sql如下:

set mapred.job.queue.name=pms;
set hive.exec.reducers.max=8;
set mapred.reduce.tasks=8;
set mapred.job.name=extract_trfc_page_kpi;

insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')
select distinct 
    page_type_id,
    pv,
    uv,
    '$yesterday' update_time 
from
(
    --针对PC、H5
    select 
        page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi 
    where ds = '$yesterday' and stat_type = 1 
    group by page_type_id 

union all

    --PC搜索页特殊处理
    select 
        5 as page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi 
    where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52)

union all

    --针对APP
    select 
        a.page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi a 
    left outer join (
        select distinct 
            page_type_id, 
            old_page_type_id 
        from tandem.mobile_backend_page_url_rule 
        where is_delete = 0
    ) b on (a.page_type_id = b.old_page_type_id)
    where a.ds = '$yesterday' and stat_type = 1 
    group by a.page_type_id 
) t;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50

上面的sql中存在两个union all操作,顺序执行下来的话,需要耗时20分钟。

优化策略

分析以上的sql,其中union all前后的三个查询操作并无直接关联,因此没有必要顺序执行,因此优化的思路是让这三个查询操作并行执行,hive提供了如下参数实现job的并行操作:

// 开启任务并行执行
set hive.exec.parallel=true;
// 同一个sql允许并行任务的最大线程数
set hive.exec.parallel.thread.number=8;
  • 1
  • 2
  • 3
  • 4

方案一

在执行sql时加上上面的两个hive参数,如:

set mapred.job.queue.name=pms;
set hive.exec.reducers.max=8;
set mapred.reduce.tasks=8;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set mapred.job.name=extract_trfc_page_kpi;

insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')
select distinct 
    page_type_id,
    pv,
    uv,
    '$yesterday' update_time 
from
(
    --针对PC、H5
    select 
        page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi 
    where ds = '$yesterday' and stat_type = 1 
    group by page_type_id 

union all

    --PC搜索页特殊处理
    select 
        5 as page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi 
    where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52)

union all

    --针对APP
    select 
        a.page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi a 
    left outer join (
        select distinct 
            page_type_id, 
            old_page_type_id 
        from tandem.mobile_backend_page_url_rule 
        where is_delete = 0
    ) b on (a.page_type_id = b.old_page_type_id)
    where a.ds = '$yesterday' and stat_type = 1 
    group by a.page_type_id 
) t;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52

方案二

在hive-site.xml中进行设置,查看当前版本hive的配置参数:

hive> set -v;
...
hive.exec.orc.zerocopy=false
hive.exec.parallel=false
hive.exec.parallel.thread.number=8
hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger
hive.exec.rcfile.use.explicit.header=true
hive.exec.rcfile.use.sync.cache=true
hive.exec.reducers.bytes.per.reducer=1000000000
hive.exec.reducers.max=999
hive.exec.rowoffset=false
hive.exec.scratchdir=/tmp/hive-pms
hive.exec.script.allow.partial.consumption=false
hive.exec.script.maxerrsize=100000
hive.exec.script.trust=false
hive.exec.show.job.failure.debug.info=true
...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

这些参数是配置在$HIVE_HOME/conf/hive-site.xml中的,现在在这个配置文件中加入:

<property>
    <name>hive.exec.parallelname>
    <value>truevalue>
property>
<property>
    <name>hive.exec.parallel.thread.numbername>
    <value>16value>
property>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

重新启动hive,看到刚刚配置的参数已经生效了:

hive> set -v;
...
hive.exec.orc.skip.corrupt.data=false
hive.exec.orc.zerocopy=false
hive.exec.parallel=true
hive.exec.parallel.thread.number=16
hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger
hive.exec.rcfile.use.explicit.header=true
hive.exec.rcfile.use.sync.cache=true
hive.exec.reducers.bytes.per.reducer=1000000000
hive.exec.reducers.max=999
hive.exec.rowoffset=false
hive.exec.scratchdir=/tmp/hive-pms
hive.exec.script.allow.partial.consumption=false
...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

结论

经过测试,添加了这两个参数以后,extract_trfc_page_kpi脚本执行时间从耗时20分钟,优化为耗时3分钟。

你可能感兴趣的:(hive)