Hive 通过关闭CBO (Cost based Optimizer) 来优化特定的SQL执行

Hive 自0.14.0开始,加入了一项”Cost based Optimizer”来对HQL执行计划进行优化,这个功能通过”hive.cbo.enable”来开启。在Hive 1.1.0之后,这个feature是默认开启的,它可以自动优化HQL中多个JOIN的顺序,并选择合适的JOIN算法

Join reordering and join algorithm selection are few of the optimizations that can benefit from a cost based optimizer. Cost based optimizer would free up user from having to rearrange joins in the right order or from having to specify join algorithm by using query hints and configuration options. This can potentially free up users to model their reporting and ETL needs close to business process without having to worry about query optimizations.

不过在工作中,发现这个功能在某些情况下反而会导致Query的执行效率降低,这里写个blog记录一下。以下仅针对Hive on tez , 至于 Hive on mapreduce 的情况待考察。

假如我们有三张表:
1. 事实表 table_fact

col_name data_type comment
field_dim_1 string
field_dim_2 string
score bigint

2. 维度表 table_dim1

col_name data_type comment
field_dim_1 string
field_res_1 string

3. 维度表 table_dim2

col_name data_type comment
field_dim_1 string
field_dim_2 string
field_res_2 bigint

这是一个常见的星型模型的缩影,由一个事实表和两个维度表组成。通常我们在查询时需要从维度表中关联出我们需要的信息,比如:

select 
    d1.field_res_1,
    d2.field_res_2,
    sum(f.score) 
from 
    table_fact f 
left join table_dim_1 d1 on f.field_dim_1=d1.field_dim_1 
left join table_dim_2 d2 on f.field_dim_1=d2.field_dim_1 and f.field_dim_2=d2.field_dim_2 
where 
    f.field_dim_2="abc" 
group by 
    d1.field_res_1,
    d2.field_res_2;

这是一个比较常见的查询语句。在cbo开关打开的情况下,执行计划是这个样子的:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Map 5 (SIMPLE_EDGE) |
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE)               |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3                                    |
|       File Output Operator [FS_17]                 |
|         Group By Operator [GBY_15] (rows=1 width=0) |
|           Output:["_col0","_col1","_col2"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0, KEY._col1 |
|         <-Reducer 2 [SIMPLE_EDGE]                  |
|           SHUFFLE [RS_14]                          |
|             PartitionCols:_col0, _col1             |
|             Group By Operator [GBY_13] (rows=2 width=0) |
|               Output:["_col0","_col1","_col2"],aggregations:["sum(_col2)"],keys:_col4, _col7 |
|               Select Operator [SEL_12] (rows=2 width=0) |
|                 Output:["_col4","_col7","_col2"]   |
|                 Merge Join Operator [MERGEJOIN_23] (rows=2 width=0) |
|                   Conds:RS_8._col0=RS_9._col0(Left Outer),RS_8._col0=RS_10._col0(Left Outer),Output:["_col2","_col4","_col7"] |
|                 <-Map 1 [SIMPLE_EDGE]              |
|                   SHUFFLE [RS_8]                   |
|                     PartitionCols:_col0            |
|                     Select Operator [SEL_2] (rows=1 width=0) |
|                       Output:["_col0","_col2"]     |
|                       Filter Operator [FIL_20] (rows=1 width=0) |
|                         predicate:(field_dim_2 = 'abc') |
|                         TableScan [TS_0] (rows=1 width=0) |
|                           default@table_fact,f,Tbl:PARTIAL,Col:NONE,Output:["field_dim_1","field_dim_2","score"] |
|                 <-Map 4 [SIMPLE_EDGE]              |
|                   SHUFFLE [RS_9]                   |
|                     PartitionCols:_col0            |
|                     Select Operator [SEL_4] (rows=1 width=0) |
|                       Output:["_col0","_col1"]     |
|                       TableScan [TS_3] (rows=1 width=0) |
|                         default@table_dim_1,d1,Tbl:PARTIAL,Col:NONE,Output:["field_dim_1","field_res_1"] |
|                 <-Map 5 [SIMPLE_EDGE]              |
|                   SHUFFLE [RS_10]                  |
|                     PartitionCols:_col0            |
|                     Select Operator [SEL_7] (rows=1 width=0) |
|                       Output:["_col0","_col2"]     |
|                       Filter Operator [FIL_22] (rows=1 width=0) |
|                         predicate:('abc' = field_dim_2) |
|                         TableScan [TS_5] (rows=1 width=0) |
|                           default@table_dim_2,d2,Tbl:PARTIAL,Col:NONE,Output:["field_dim_1","field_dim_2","field_res_2"] |
|                                                    |
+----------------------------------------------------+

我们可以看到,line 4提示这个查询经过了CBO的优化。hive对这个SQL做了一系列的优化(不仅仅是CBO的优化):

  • filter的下推(line 33,49)
  • projection的下推(line 31,40,47)
  • 对join key的优化 (line 26等)
  • 尽可能的合并了MR任务(line 7、8)

其他的优化不一一列举了。这个执行计划中将2个join用一次map-reduce就解决了,最后一个Reduce-3是为了做group by的。

但是如果事实表 table_fact 中的字段 field_dim_1 存在数据倾斜的话会怎么样呢?我们可以看到,在这个执行计划中,shuffle的key都是针对 field_dim_1的,如果这个字段存在大量的数据倾斜的话,那么这个查询会执行的非常缓慢。

针对数据倾斜,我们通常的解决方法无外乎就是 mapjoin、 salting ,或者做任务的拆分等几种方法。这里的查询是一张较大的事实表和两张维度表进行的关联。如果维度表(或者经过filter && projection)之后的维度表足够小的话,那么这里实际上hive是可以自动将这个查询转化为 map side join 的。

由于我为了说明这个情景,以上的三张表建立的都是空表,所以理论上hive应该会把这个查询转化为 map side join 。但是为什么没有呢? 这就是cbo的关系了。

hive的cbo是通过 Apache Calcite 来做的 ( https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive )。 这里代码我没有深究,不过我们将 hive.cbo.enable 设置为 false , 可以看到执行计划变成了如下的样子:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Vertex dependency in root stage                    |
| Map 1 <- Map 3 (BROADCAST_EDGE), Map 4 (BROADCAST_EDGE) |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 2                                    |
|       File Output Operator [FS_15]                 |
|         Group By Operator [GBY_13] (rows=1 width=0) |
|           Output:["_col0","_col1","_col2"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0, KEY._col1 |
|         <-Map 1 [SIMPLE_EDGE]                      |
|           SHUFFLE [RS_12]                          |
|             PartitionCols:_col0, _col1             |
|             Group By Operator [GBY_11] (rows=1 width=0) |
|               Output:["_col0","_col1","_col2"],aggregations:["sum(_col2)"],keys:_col7, _col13 |
|               Select Operator [SEL_10] (rows=1 width=0) |
|                 Output:["_col7","_col13","_col2"]  |
|                 Map Join Operator [MAPJOIN_22] (rows=1 width=0) |
|                   Conds:MAPJOIN_21._col0, _col1=RS_7.field_dim_1, field_dim_2(Left Outer),HybridGraceHashJoin:true,Output:["_col2","_col7","_col13"] |
|                 <-Map 4 [BROADCAST_EDGE]           |
|                   BROADCAST [RS_7]                 |
|                     PartitionCols:field_dim_1, field_dim_2 |
|                     Filter Operator [FIL_20] (rows=1 width=0) |
|                       predicate:(field_dim_2 = 'abc') |
|                       TableScan [TS_2] (rows=1 width=0) |
|                         default@table_dim_2,d2,Tbl:PARTIAL,Col:NONE,Output:["field_dim_1","field_dim_2","field_res_2"] |
|                 <-Map Join Operator [MAPJOIN_21] (rows=1 width=0) |
|                     Conds:FIL_18.field_dim_1=RS_4.field_dim_1(Left Outer),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2","_col7"] |
|                   <-Map 3 [BROADCAST_EDGE]         |
|                     BROADCAST [RS_4]               |
|                       PartitionCols:field_dim_1    |
|                       TableScan [TS_1] (rows=1 width=0) |
|                         default@table_dim_1,d1,Tbl:PARTIAL,Col:NONE,Output:["field_dim_1","field_res_1"] |
|                   <-Filter Operator [FIL_18] (rows=1 width=0) |
|                       predicate:(field_dim_2 = 'abc') |
|                       TableScan [TS_0] (rows=1 width=0) |
|                         default@table_fact,f,Tbl:PARTIAL,Col:NONE,Output:["field_dim_1","field_dim_2","score"] |
|                                                    |
+----------------------------------------------------+

首先CBO优化的标志不见了(line 4),然后 Stage1 针对两个维表的 Map 操作变成了 BROADCAST (line 5) ,并且输出结果传给了针对事实表的 Map .

由此可以看到关闭CBO之后, Hive会使用map join来执行这个查询,另外整个查询的shuffle次数还从2次减少到了1次,即使在没有数据倾斜的情况下,查询效率也很可能会快很多。

工作中实际查询的SQL很复杂,就不贴了。不过查询瓶颈最后抽象出来基本就是上面的情形,而且关联字段的某个值倾斜非常严重,整个查询持续了数个小时也没有会结束的迹象。当我们把CBO关闭以后,Hive将这个查询优化为了 Map Side Join,很快就返回了结果。

由此可见,一些优化的开关也未必是打开就好,小心过度优化

你可能感兴趣的:(hive)