spark算子

val func1 = (index:Int,iter:Iterator[(Int)]) => {} //和下面的等同
def func1(index:Int,iter:Iterator[(Int)]): Iterator[String] = {
iter.toList.map(x => "partId:" + index + " val:"+x).iterator
}
val rdd.mapPartitionWithIndex(func1)

aggregate(0)(math.max(_,_),_+_) 将每个分区内的最大值求出,并相加,0是每个partition的初始值,
就是每个partition 的第一个元素和0比,最后合并的时候也是0加第一partition的最大值相加再往后加     action

val rdd1 = sc.parallelize(List("a","b","c","d","e","f"), 2)    
rdd1.aggregate("|")(_+_,_+_)   
// ||abc|def 也有可能是 ||def|abc 并行任务

val rdd2 = sc.parallelize(List("12","23","345","4567"),2)
rdd2.aggregate("")((x,y) =>math.max(x.length,y.length).toString,(x,y)=>x + y)
//24 或者42  并行的,不知道谁先出结果

val rdd3 = sc.parallelize(List("12","23","345",""),2)
rdd2.aggregate("")((x,y) =>math.min(x.length,y.length).toString,(x,y)=>x + y)
//10 或者01  并行的,不知道谁先出结果, 第一个partition初始值是“”,length是0,第二轮再比 0.length是1(第一轮 “”,“12”比的结果是0,tostring是“0”; 第二轮 “0”,“23”比的结果是 1,tostring是“1”)


val arr = List("","12","23")
arr.reduce((x, y ) => math.min(x.length, y.length).toString)
//1
reduce  要求返回值必须和arr中定义的类型一致

val pairRDD = sc.parallelize(List(("cat", 2),("cat,5"),("mouse",4),("cat",12),("dog",12),("mouse", 2)),2)
pairRDD.aggregateByKey(0)(_+_,_+_).collect

 (Cat,19) (mouse,6)(dog,12)

 PairRDD.aggregateByKey(0)(math.math(_,_),_+_).collect

 (Cat,17)(mouse,6)(dog,12)
Partition内局部操作,再整体操作

aggregate是action,aggregatebykey是transformation

PairRDD.aggregateByKey(100)(math.math(_,_),_+_).collect

(Cat,200) (mouse,200)(dog,100)
在每个分区内的key初始值都是100,最后出来汇总的时候没有初始值了

pairRDD.reduceByKey(_+_).collect

和上面的aggregateByKey是一样的最终,只不过一个传2组参数一个传一组参数

他们这两个方法最终都是调combineByKey

wordcount

sc.textfile().flatMap(_.split(" ")).map(_,1).reduceByKey(_+_)

GroupByKey也可以,但是没有partition内的局部求和,直接在整体求了

spark从hdfs读数据获取rdd的partition数和文件在hdfs中的block数相同。比如两个文件都有两个block那读这两个文件是4个partition。文件128m一个block。

(文件大小在128-256间则两个block)
问题,这样的文件在显示上是几个!

combineByKey

  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
  partitioner, mapSideCombine, serializer)(null)
  }

createCombiner 中V是每个partition内数据(key,value)中的第一条数据的value值,经过一个函数变换,将这第一个value变换成数据类型C,
mergeValue是在同一个partition内部做的操作函数,以createcombiner变化出的第一条数据(类型为c)的结果作为初始值,对这个partititon进行处理,V是一条接着一条数据的(key,value)中的value值,最后这个partition返回一个值,数据类型是C
mergeCombiners是对最后各个partition结果的合并再操作

val rdd10 = sc.textFile().flatMap(_.split(" ")).map((_,1)).combineByKey(x=> x + 10, (m:Int, n:Int) => (m + n), (a:Int, b:Int) => (a + b)).collect

e.g rdd11内的数据是一组tuple

rdd11.collect = Array((1,"dog"),(1,"cat"),(2,"gnu"),(2,"salmon"),(2,"rabbit"),(1,"turkey"),(1,"wolf"),(2,"bear"),(2,"bee"))
rdd11.combineByKey(List(_),(x:List[String],y:String) =>x :+ y, (a:List[String], b:List[String]) => a ++ b).collect

List()的意思是将各个partition的第一个tuple中的value放到一个新建的 List中,比如分3个区,(1,"dog")是第一区的第一个tuple,List() 意思是List("dog").

(x:List[String],y:String) =>x :+ y是说,现在已经有List("dog"),然后接着往里加 “cat” 变成List("dog","cat")所以第一区的结果是
(1, list("dog","cat"))
(2, list("gnu"))
第二区的结果是
(1, list("turkey"))
(2, list("salmon","rabbit"))
第二区的结果是
(1, list("wolf"))
(2, list("bear","bee"))

(a:List[String], b:List[String]) => a ++ b就是按key把各个分区结果合并得到result

Array[(Int, List[String])] = Array((1,List("dog","cat","turkey","wolf")),(2,List("gnu","salmon","rabbit","bear","bee")))

repartition

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
  }

collectAsMap action
countByKey action 可以用在wordcount
countByValue

rdd13.collect = arr = List(("a",1),("b",2),("b",2))
rdd13.countByValue() = Map(("a",1) -> 1,("b",2) -> 2)

val rdd14 = sc.parallelize((List("e",5),(c,3),(d,4),(c,2),(a,1),(b,6)))
rdd14.filterByRange("b","d").collect = Array((c,3),(d,4),(c,2),(b,6))

val rdd15 = sc.parallelize(List("a","1 2"),(v, "3 4"))
rdd15.flatMapValues(_.split(" ")).collect = Array((a,1 ),(a,2),(v,3),(v,4))

val rdd16 =sc.parallelize(List("dog",wolf,cat,bear),2).map(x => (x.length,x)).foldByKey("")(_+_).collect() = Array((4,wolfbear),(3,catdog))


val rdd17 = sc.parallelize(List("dog",wolf,cat,bear),2).keyBy(_.length).collect = Array((4,bear),(4,wolf),(3,dog),(3,cat))

val rdd18 =sc.parallelize(List("dog",wolf,cat,bear),2).map(x => (x.length,x).keys.collect = Array(3,4,3,4)

val rdd19 =sc.parallelize(List("dog",wolf,cat,bear),2).map(x => (x.length,x).values.collect = Array(dog,wolf,cat,bear)



  /**
     * Return this RDD sorted by the given key function.
   */
  def sortBy[K]( //RDD[T] 是返回值类型 ,方法名后的[K]中间涉及的类型, 比如Set[T]
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =     withScope {
    this.keyBy[K](f)
    .sortByKey(ascending, numPartitions)
    .values
  }



 object  OrderContext {
  implicit val girlOrdering = new Ordering[Girl] {
    override def compare(x: Girl, y: Girl): Int = {
      x.faceValue - y.faceValue
    }
  }
}
case class Girl(val faceValue:Int) extends 
Ordered[Girl] with Serializable {
  override def compare(that: Girl): Int = {
    that.faceValue -that.faceValue
   }
}

object Test {
  def main(args: Array[String]): Unit = {
    val sparkSession: SparkSession = new SparkSession.Builder().master("local[*]").getOrCreate()
    val rdd = sparkSession.sparkContext.parallelize(List(("girl1",10),("girl3",30),("girl2", 20)),3)
    import  OrderContext._
    val result = rdd.sortBy(x => Girl(x._2)) //x._2是
facevalue
    println(result.collect.toBuffer)
  }
}

你可能感兴趣的:(spark算子)