spark中repartition和partitionBy的区别

今天来介绍一下spark中两个常用的重分区算子,repartition 和 partitionBy 都是对数据进行重新分区,默认都是使用 HashPartitioner,区别在于partitionBy 只能用于 PairRdd,但是当它们同时都用于 PairRdd时,效果也是不一样的,下面来看一个demo.

package test

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext, TaskContext}

/**
  * repartition和partitionBy的区别
  */
object partitionDemo {

  Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("localTest").setMaster("local[4]")
    val sc = new SparkContext(conf)
    //设置4个分区;
    val rdd = sc.parallelize(List("hello", "jason", "what", "are", "you", "doing","hi","jason","do","you","eat","dinner","hello","jason","do","you","have","some","time","hello","jason","time","do","you","jason","jason"),4)
    val word_count = rdd.flatMap(_.split(",")).map((_,1))
    //重分区为10个;
    val rep = word_count.repartition(10)
  

你可能感兴趣的:(Spark)