spark 自定义partitioner分区 java版

2019独角兽企业重金招聘Python工程师标准>>>

在遍历spark dataset的时候，通常会使用 forpartition 在每个分区内进行遍历，而在默认分区（由生成dataset时的分区决定）可能因数据分布原因导致datasetc处理时的数据倾斜，造成整个dataset处理缓慢，发挥不了spark多executor（jvm 进程）多partition（线程）的并行处理能力，因此，普遍的做法是在dataset遍历之前使用repartition进行重新分区，让数据按照指定的key进行分区，充分发挥spark的并行处理能力，例如：

dataset.repartition(9,new Column("name")).foreachPartition(it -> {
			while (it.hasNext()) {
				Row row = it.next();
				....
			}
		});

先看一下准备的原始数据集：

按照上面的代码，预想的结果应该是，相同名字在记录在同个partition(分区)，不同名字在不同的partition，并且一个partition里面不会有不同名字的记录，而实际分区却是这样的

（查看分区分布情况的代码在之前一篇文章 spark sql 在mysql的应用实践有说明，如果调用reparation时未指定分区数量9，则默认为200，使用 spark.default.parallelism 配置的数量为分区数,在partitioner.scala 的 partition object 定义可以看到）

这个很囧...乍看一下，压根看不出什么情况，翻看源码发现，rdd 的partition 分区器有两种 HashPartitioner & RangePartitioner，默认情况下使用 HashPartitioner，从 repartition 源码开始入手

/**  
   * Dataset.scala 
   * Returns a new Dataset partitioned by the given partitioning expressions into
   * `numPartitions`. The resulting Dataset is hash partitioned.
   *
   * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] = withTypedPlan {
    RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, Some(numPartitions))
  }

The resulting Dataset is hash partitioned，说的很清楚，使用hash 分区，那看看hash 分区的源码，

/**
 * Partitioner.scala
 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
 * Java's `Object.hashCode`.
 *
 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
 * produce an unexpected or incorrect result.
 */
class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

Utils.nonNegativeMod(key.hashCode, numPartitions) 说明在获取当前row所在分区时，用了分区key的hashCode作为实际分区的key值，在看看 nonNegativeMod

 /* Calculates 'x' modulo 'mod', takes to consideration sign of x,
  * i.e. if 'x' is negative, than 'x' % 'mod' is negative too
  * so function return (x % mod) + mod in that case.
  */
  def nonNegativeMod(x: Int, mod: Int): Int = {
    val rawMod = x % mod
    rawMod + (if (rawMod < 0) mod else 0)
  }

看到这里，前面的相同分区存在不同的 name 的记录就不难理解了，不同的name值hashCode%分区数后落到相同的分区... 简单的调整方式，在遍历分区里面用hashMap兼容不同name值的记录处理，那如果我们想自定义分区呢，自定义分组分区代码写起来就比较直观容易理解，幸好spark提供了partitioner接口，可以自定义partitioner，支持这种自定义分组分区的方式，这里我也有个简单实现类，可以支持同个分区只有相同name的记录

import org.apache.commons.collections.CollectionUtils;
import org.apache.spark.Partitioner;
import org.junit.Assert;

import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

/**
 * Created by lesly.lai on 2018/7/25.
 */
public class CuxGroupPartitioner extends Partitioner {

	private int partitions;

	/**
	 * map
	 * 主要为了区分不同分区
	 */
	private Map hashCodePartitionIndexMap = new ConcurrentHashMap<>();

	public CuxGroupPartitioner(List

spark 自定义partitioner分区 java版

你可能感兴趣的:(大数据,java,scala)