Spark 自定义 Partitioner

要实现自定义的分区器, 需要继承 org.apache.spark.Partitioner, 并且需要实现一下方法:

  1. numPartitions: 该方法需要返回分区数,不需要大于0
  2. getPartition(key): 返回指定键的分区编号(0 到 numPartitions-1)
  3. equals:
  4. hashCode: 如果覆写了 equals, 则也应该覆写这个方法
object MyPartitionerDemo {
    def main(args: Array[String]): Unit = {
        val conf: SparkConf = new SparkConf().setAppName("CreateRDD").setMaster("local[2]")
        val sc = new SparkContext(conf)
        val rdd1: RDD[(Int, String)] = sc.makeRDD(Array((10, "a"), (20, "b"), (30, "c"), (40, "d"), (50, "e"), (60, "f")))
        val rdd2: RDD[(Int, String)] = rdd1.partitionBy(new MyPartitioner(4))

        val rdd3: RDD[(Int, String)] = rdd2.mapPartitionsWithIndex((index, it) => it.map(x => (index, x._1 + " : " + x._2)))

        rdd2.glom.collect.foreach(it => {
            it.foreach({
                println("---------------------分区分隔符----------------------")
                println
            })
        })


        println(rdd3.collect().mkString(" "))


        sc.stop()
    }
}

class MyPartitioner(numPars: Int) extends Partitioner {
    override def getPartition(key: Any): Int ={
        0
    }

    override def numPartitions: Int = numPars
}

运行结果:

---------------------分区分隔符----------------------
(10,a)
(20,b)
(30,c)
(40,d)
(50,e)
(60,f)
---------------------分区分隔符----------------------
---------------------分区分隔符----------------------
---------------------分区分隔符----------------------
(0,10 : a) (0,20 : b) (0,30 : c) (0,40 : d) (0,50 : e) (0,60 : f)

你可能感兴趣的:(Spark,Spark)