sc.setCheckpointDir("my_directory_name")val a = sc.parallelize(1 to 4)a.checkpoint
a.count
主要是aggregate操作有些迷惑。
val z = sc.parallelize(List(1,2,3,4,5,6), 2) //把一个序列分成两个切片。这地反我有些不明白,parallelize的第二个参数是把一个序列分成RDD的切片的个数。这里是2,但是z.count的个数为6,而不是为2.
而aggregate操作是按照两个切片来计算的,详解如下:
z.aggregate(0)(math.max(_, _), _ + _)//res40: Int = 9 。计算过程为max(0,1,2,3)+max(0,4,5,6)+0=9.
z.aggregate(2)(math.max(_, _), _ + _)// Int = 11, 计算过程为 max(2,1,2,3)+max(2,4,5,6)+2=11.
z.aggregate(5)(math.max(_, _), _ + _)// Int = 11, 计算过程为 max(5,1,2,3)+max(5,4,5,6)+5=16.
如果把切片个数改为3 ,结果如下:
val z = sc.parallelize(List(1,2,3,4,5,6), 3)
z.aggregate(0)(math.max(_, _), _ + _)// Int=11。计算结果为max(0,1,2)+max(0,3,4)+max(0,5,6)+0=12.
z.aggregate(4)(math.max(_, _), _ + _)// Int=18。计算结果为max(4,1,2)+max(4,3,4)+max(4,5,6)+4=18.
除非这样理解,并行度和RDD元素的个数并不是一个概念。当RDD的并行度并不指定的时候,默认值是就是元素个数。但是不知道指定并行度的时候,元素的分配顺序是否会打乱
val z = sc.parallelize(List(1,2,3,4,5,6))//不指定其切片个数。事实上,默认参数为4. 切片为(1)(2,3)(4)(5,6)
z.aggregate(0)(math.max(_, _), _ + _)// Int=14。计算结果为1+3+4+6+0=14.
z.aggregate(3)(math.max(_, _), _ + _)// Int=14。计算结果为3+3+4+6+3=19.
可通过打印如下语句,判断出切片方式。
########z.aggregate(3)( (a, b) =>math.max(a,b), (a, b) => { println("a=" + a + ", b=" + b); a + b })//
如果上面的计算能够理解的话,那么接下来也好理解了。
val z = sc.parallelize(List("a","b","c","d","e","f"),2)
z.aggregate("x")(_ + _, _+_) ##xxdefxabc。注意也有可能是xxabcxdef
val z = sc.parallelize(List("12","23","345","4567"),2);z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y)=> x + y)###res141: String = 42
结果也可能是24.
有意思的是
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y)=> x + y)###res142: String = 11 结果是11.
注意看上面这个例子,切片有两个("12","23")("345","4567")。先来看第一个切片("12","23"),空串的长度首先和"12"比较,结果为0,由于min的结果结果需要toString。于是,0变成了"0",然后"0"的长度和"23"比较,结果为1.同理""与第二个切片("345","4567")比较的结果也是1.然后最终的结果再和空串""组合,变成了11.
一定要注意:第一个是序列计算,先序计算的结果是后续计算的输入。第二个是组合计算。输入不分先后。两个计算都需要aggregate的第一个参数,这里是空串""。
这也解释了为何
val z = sc.parallelize(List("12","23","345",""),2)z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y)=> x + y)####: String = 10
val z = sc.parallelize(List("12","23","","345"),2)z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y)=> x + y)####: String = 11
再来看第二个函数cartesian ,使用这个函数需要特别小心,因为内存很容易耗尽。但是不知道
val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10),
(2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9),(3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10),(5,9), (5,10))
sc.setCheckpointDir("my_directory_name")
val a = sc.parallelize(1 to 4)
a.checkpoint
a.count
4,coalesce, repartition
coalesce是将RDD进行重新划分(默认是分区减少),然后返回一个新的RDD。repartition是coalesce的一个特殊情况,等同于coalesce(numPartitions, shuffle = true) 。 shuffle = true允许分区增加。
5,cogroup ,groupWith
cogroup根据key连接多个RDD
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
// Array[(Int, (Seq[String], Seq[String]))] = Array((2,(ArrayBuffer(b),ArrayBuffer(c))),(3,(ArrayBuffer(b),ArrayBuffer(c))),(1,(ArrayBuffer(b, b),ArrayBuffer(c, c))))
val d = a.map((_, "d")) ; b.cogroup(c, d).collect######和两个RDD做组合。
// Array[(Int, (Seq[String], Seq[String], Seq[String]))] = Array((2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d))))
6,collect, toArray
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"),2)
c.collect。//Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)
7,collectAsMap#####把一个RDD映射成scalab的keyvalue对。
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collectAsMap
res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)
8.countByKey and countByValue
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
c.countByKey
//: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))b.countByValue
//: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)
9 countApproxDistinct 近似计数,当数据量大的时候很有用。
val a = sc.parallelize(1 to 10000, 20)val b = a++a++a++a++ab.countApproxDistinct(0.1)
//: Long = 10784
b.countApproxDistinct(0.05)res15: Long = 11055
b.countApproxDistinct(0.01)res16: Long = 10040
b.countApproxDistinct(0.001)res0: Long = 10001