倾斜join优化 Skewed Join

 

翻译:倾斜join优化设计 https://issues.apache.org/jira/browse/SPARK-29544

1.背景

数据倾斜是数据表在集群中分区之间分布不均匀导致的。数据倾斜会严重降低查询性能,特别实在join的场景下。在集群中,大表之间join需要shuffling并且数据倾斜会导致任务极端不平衡的运行。有三个主要的方法处理skew join:

1.增加任务的并行数,使用spark.sql.shuffle.partition参数,让数据分布均匀

2.使用广播hash join threshold 方法,从归并排序join转变成更快的广播hash join ,另外消除shuffle带来的skew join。

3.增加前缀到数据倾斜的keys上让相同的key 变得不一样,调整成适当的数据分布。

以上三个方式与很多的局限性,所有都需要人工处理。

2.goals

新的自适应框架基于运行时统计处理skew join 问题。

3.idea思想

3.1 判断倾斜分区的条件

通过运行时map统计表数据量大小和行数来决定是否有倾斜join分区。成为倾斜分区的条件是 data size>数据量大小在中位数分区的data size *F(倾斜参数) 并且  data size >S (预设置的倾斜参数data size)  或者  行数> 行数在中位数的分区*F 并且 行数>R (行数阈值)。

3.2 优化:

倾斜join优化 Skewed Join_第1张图片

如上图表A join 表B,且表B的partition 0 分区是数据倾斜的。当表A的partition 0分区和表B的partition 0分区join时,假设b表的partition 0分区读数据需要Map0, Map1…Map8的容量。“OptimizeSkewedJoin” 规则将拆分A表的partition 0分区复制三份  和 创建三个任务单独的读取b表的partition 0分区 (Map0 to Map 2, Map3 to Map5, Map6 to Map8) 的三个分区 相互 join ,最后表A的partition 0分区 join 表B的partition 0分区的结果是上述三个子join的union。这个方法会以读取N次表A的0分区的代价增加额外的开销,但是处理skewjoin可能会超过这种开销。


4.实现:

对每个分区统计行数 Collect the row count statistics for each partition:

和统计每个分区的数据大小类似,首先需要计算行数信息并封装到map结构中。然后在executors中从map结构中得到行数信息,计算行数信息可能会影响性能。

 

4.1 优化倾斜分区:

1.基于“Skewed Partition Condition”倾斜分区判断条件描述,检查是否有分区发生倾斜。

2.假设倾斜分区是mapper的数量是N个。将等效的划分N个mapper成M份,M是min (S,min(F,M)).S是预设置可变化的可以代替划分成M份的参数,F是max(skewedPatitionSize / medianSize of all partitions, skededPartitionRowCount / medianRowCount of all partitions)。然后创建一个新的 “SkewedShuffledRowRDD”对每个split分割都有相关的索引范围。最后将union所有M个子join作为join的结果。

3.在创建一个新的“SkewedShuffledRowRDD”之后,还需要定义新的API读取从处理倾斜mapper之后输出的map。

 

Skewed Join Optimization Design Doc.docx

97.99KB

原文:

Skewed Join Optimization Design Doc

Background

Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. Data skew can severely downgrade performance of queries, especially those with joins. Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. And currently there are mainly 3 approaches to handle skew join:

1.     Increase the parallelism number of “spark.sql.shuffle.partitions” to make the data distribution more evenly;

2.     Increase the broadcast hash join threshold to change the sort merge join to broadcast hash join as far as possible and then eliminate the skew join case brought by shuffle;

3.     Add prefix to the skewed keys to make the same key be different and then adjust the data distribution.

All the above 3 methods have lots of limitations and all need manual processing.

Goals

This goal is to add an optimization to auto handling skew join based on the runtime statistics with the new adaptive execution framework of SPARK-23128.

Idea

Skewed Partition Condition:

The skewed join partition is determined by the data size and row counts of the runtime map statistics. The partition is considered skewed if its size is > median partition size * F and > S or its row count > median partition row count * F and > R. F is the skew factor, S and R are the pre-configured skew size and row count threshold.

Optimization:

 

 

This above picture shows that table A join with table B and the partition 0 of table B is skewed. When partition 0 of table A join with partition 0 of table B, we assume the partition 0 of table B need reading the output from Map0, Map1…Map8. Here the “OptimizeSkewedJoin” rule will dive the partition 0 of table B into 3 splits and then create 3 tasks separately reading the partial partition 0 of table B (Map0 to Map 2, Map3 to Map5, Map6 to Map8) to join with partition 0 of table B.  Finally we will union the above 3 subjoin as the result of partition 0 of table A join partition 0 of table B. This approach will introduce the additional cost by reading N times about the partition 0 of table A. However the benefit of handing the skew join may be more than the cost.

Implement

Collect the row count statistics for each partition:

Similar with the approach of collecting data size for each partition, we need collect the row count info and wrap it in MapStatus. Then we can get the row count info from the returned MapStatus in executors side. Here collecting the row count info may affect the performance. The following is the updated files list.

 

Optimize Skewed Partition

1.    Check whether the partition is skewed based on the described in “Skewed Partition Condition”;

2.    Assuming the mappers number of the skewed partition is N. We will equally divide the N mappers to M splits. Here M is calculated by Math.min(S, Math.min(F, M)). S is a per-configured variable to represent the number of splits. F is the calculated by Math.max(skewedPatitionSize / medianSize of all partitions, skededPartitionRowCount / medianRowCount of all partitions). And then we created a new “SkewedShuffledRowRDD” with related rang map index for every split.  At last we will union all the M sub-join as the final join result.

3.    After created new “SkewedShuffledRowRDD”, we also need define the new reader API to read the map output from specified mappers.

def getReaderForSpecifiedMapper[K, C](

      handle: ShuffleHandle,

      startPartition: Int,

      endPartition: Int,

                                   startMapId:Int,

                                   endMapId:Int,

      context: TaskContext,

      metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]

 

 

 

你可能感兴趣的:(大数据,spark)