val func1 = (index:Int,iter:Iterator[(Int)]) => {} //和下面的等同
def func1(index:Int,iter:Iterator[(Int)]): Iterator[String] = {
iter.toList.map(x => "partId:" + index + " val:"+x).iterator
}
val rdd.mapPartitionWithIndex(func1)
aggregate(0)(math.max(_,_),_+_) 将每个分区内的最大值求出,并相加,0是每个partition的初始值,
就是每个partition 的第一个元素和0比,最后合并的时候也是0加第一partition的最大值相加再往后加 action
val rdd1 = sc.parallelize(List("a","b","c","d","e","f"), 2)
rdd1.aggregate("|")(_+_,_+_)
// ||abc|def 也有可能是 ||def|abc 并行任务
val rdd2 = sc.parallelize(List("12","23","345","4567"),2)
rdd2.aggregate("")((x,y) =>math.max(x.length,y.length).toString,(x,y)=>x + y)
//24 或者42 并行的,不知道谁先出结果
val rdd3 = sc.parallelize(List("12","23","345",""),2)
rdd2.aggregate("")((x,y) =>math.min(x.length,y.length).toString,(x,y)=>x + y)
//10 或者01 并行的,不知道谁先出结果, 第一个partition初始值是“”,length是0,第二轮再比 0.length是1(第一轮 “”,“12”比的结果是0,tostring是“0”; 第二轮 “0”,“23”比的结果是 1,tostring是“1”)
val arr = List("","12","23")
arr.reduce((x, y ) => math.min(x.length, y.length).toString)
//1
reduce 要求返回值必须和arr中定义的类型一致
val pairRDD = sc.parallelize(List(("cat", 2),("cat,5"),("mouse",4),("cat",12),("dog",12),("mouse", 2)),2)
pairRDD.aggregateByKey(0)(_+_,_+_).collect
(Cat,19) (mouse,6)(dog,12)
PairRDD.aggregateByKey(0)(math.math(_,_),_+_).collect
(Cat,17)(mouse,6)(dog,12)
Partition内局部操作,再整体操作
aggregate是action,aggregatebykey是transformation
PairRDD.aggregateByKey(100)(math.math(_,_),_+_).collect
(Cat,200) (mouse,200)(dog,100)
在每个分区内的key初始值都是100,最后出来汇总的时候没有初始值了
pairRDD.reduceByKey(_+_).collect
和上面的aggregateByKey是一样的最终,只不过一个传2组参数一个传一组参数
他们这两个方法最终都是调combineByKey
wordcount
sc.textfile().flatMap(_.split(" ")).map(_,1).reduceByKey(_+_)
GroupByKey也可以,但是没有partition内的局部求和,直接在整体求了
spark从hdfs读数据获取rdd的partition数和文件在hdfs中的block数相同。比如两个文件都有两个block那读这两个文件是4个partition。文件128m一个block。
(文件大小在128-256间则两个block)
问题,这样的文件在显示上是几个!
combineByKey
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
partitioner, mapSideCombine, serializer)(null)
}
createCombiner 中V是每个partition内数据(key,value)中的第一条数据的value值,经过一个函数变换,将这第一个value变换成数据类型C,
mergeValue是在同一个partition内部做的操作函数,以createcombiner变化出的第一条数据(类型为c)的结果作为初始值,对这个partititon进行处理,V是一条接着一条数据的(key,value)中的value值,最后这个partition返回一个值,数据类型是C
mergeCombiners是对最后各个partition结果的合并再操作
val rdd10 = sc.textFile().flatMap(_.split(" ")).map((_,1)).combineByKey(x=> x + 10, (m:Int, n:Int) => (m + n), (a:Int, b:Int) => (a + b)).collect
e.g rdd11内的数据是一组tuple
rdd11.collect = Array((1,"dog"),(1,"cat"),(2,"gnu"),(2,"salmon"),(2,"rabbit"),(1,"turkey"),(1,"wolf"),(2,"bear"),(2,"bee"))
rdd11.combineByKey(List(_),(x:List[String],y:String) =>x :+ y, (a:List[String], b:List[String]) => a ++ b).collect
List()的意思是将各个partition的第一个tuple中的value放到一个新建的 List中,比如分3个区,(1,"dog")是第一区的第一个tuple,List() 意思是List("dog").
(x:List[String],y:String) =>x :+ y是说,现在已经有List("dog"),然后接着往里加 “cat” 变成List("dog","cat")所以第一区的结果是
(1, list("dog","cat"))
(2, list("gnu"))
第二区的结果是
(1, list("turkey"))
(2, list("salmon","rabbit"))
第二区的结果是
(1, list("wolf"))
(2, list("bear","bee"))
(a:List[String], b:List[String]) => a ++ b就是按key把各个分区结果合并得到result
Array[(Int, List[String])] = Array((1,List("dog","cat","turkey","wolf")),(2,List("gnu","salmon","rabbit","bear","bee")))
repartition
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
collectAsMap action
countByKey action 可以用在wordcount
countByValue
rdd13.collect = arr = List(("a",1),("b",2),("b",2))
rdd13.countByValue() = Map(("a",1) -> 1,("b",2) -> 2)
val rdd14 = sc.parallelize((List("e",5),(c,3),(d,4),(c,2),(a,1),(b,6)))
rdd14.filterByRange("b","d").collect = Array((c,3),(d,4),(c,2),(b,6))
val rdd15 = sc.parallelize(List("a","1 2"),(v, "3 4"))
rdd15.flatMapValues(_.split(" ")).collect = Array((a,1 ),(a,2),(v,3),(v,4))
val rdd16 =sc.parallelize(List("dog",wolf,cat,bear),2).map(x => (x.length,x)).foldByKey("")(_+_).collect() = Array((4,wolfbear),(3,catdog))
val rdd17 = sc.parallelize(List("dog",wolf,cat,bear),2).keyBy(_.length).collect = Array((4,bear),(4,wolf),(3,dog),(3,cat))
val rdd18 =sc.parallelize(List("dog",wolf,cat,bear),2).map(x => (x.length,x).keys.collect = Array(3,4,3,4)
val rdd19 =sc.parallelize(List("dog",wolf,cat,bear),2).map(x => (x.length,x).values.collect = Array(dog,wolf,cat,bear)
/**
* Return this RDD sorted by the given key function.
*/
def sortBy[K]( //RDD[T] 是返回值类型 ,方法名后的[K]中间涉及的类型, 比如Set[T]
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
}
object OrderContext {
implicit val girlOrdering = new Ordering[Girl] {
override def compare(x: Girl, y: Girl): Int = {
x.faceValue - y.faceValue
}
}
}
case class Girl(val faceValue:Int) extends
Ordered[Girl] with Serializable {
override def compare(that: Girl): Int = {
that.faceValue -that.faceValue
}
}
object Test {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = new SparkSession.Builder().master("local[*]").getOrCreate()
val rdd = sparkSession.sparkContext.parallelize(List(("girl1",10),("girl3",30),("girl2", 20)),3)
import OrderContext._
val result = rdd.sortBy(x => Girl(x._2)) //x._2是
facevalue
println(result.collect.toBuffer)
}
}