1.map、flatMap、distinct
map说明:将一个RDD中的每个数据项,通过map中的函数映射变成为一个新的元素,
输入分区与输入分区一对一。
即:有多少个输入分区,就有多少个输出分区。
flatMap说明:同Map算子,最后将所有元素放到同一集合中:
distinct说明:将RDD中重复元素做去重处理
注意:针对Array[String]类型,将String对象视为字符串数组
scala> val rdd =sc.textFile("/input/input1.txt")
scala> val rdd1 = rdd.map(x=>x.split(" "))
scala> rdd1.collect
Array[Array[String]] = Array(Array(hello, world), Array(how, are, you?), Array(ni, hao), Array(hello, tom))
scala> val rdd2 = rdd1.flatMap(x=>x) //压扁操作
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :28
scala> rdd2.collect
res1: Array[String] = Array(hello, world, how, are, you?, ni, hao, hello, tom)
scala> rdd2.flatMap(x=>x).collect
res3: Array[Char] = Array(h, e, l, l, o, w, o, r, l, d, h, o, w, a, r, e, y, o, u, ?, n, i, h, a, o, h, e, l, l, o, t, o, m)
scala> val rdd3 = rdd2.distinct
rdd3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at distinct at :30
scala> rdd3.collect
res4: Array[String] = Array(are, tom, how, you?, hello, hao, world, ni)
2.coalesce和repartition
coalesce:修改RDD分区数
repartition:重分区
coalesce说明:将RDD的分区数进行修改,并生成新的RDD;
有两个参数:第一个参数为分区数,第二个参数为shuffle Booleean类型,默认为false
如果更改分区数比原有RDD的分区数小,shuffle为false;默认可以改小分区
如果更改分区数比原有RDD的分区数大,shuffle必须为true;
应用说明:一般处理filter或简化操作时,新生成的RDD中分区内数据骤减,可考虑重分区
repartition底层源码:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
底层调用coalesce方法,shuffle为true
查看默认rdd分区数
scala> rdd.partitions.size
res4: Int = 2
默认分区2个,往小了分Yes 修改rdd分区数,并生成新的rdd
scala> val rdd4 = rdd.coalesce(1)
rdd4: org.apache.spark.rdd.RDD[String] = CoalescedRDD[8] at coalesce at :26
scala> rdd4.partitions.size
res10: Int = 1
默认分区2个,往大了分NO
scala> val rdd5 = rdd.coalesce(5)
rdd5: org.apache.spark.rdd.RDD[String] = CoalescedRDD[9] at coalesce at :26
scala> rdd5.partitions.size
res12: Int = 2
默认分区2个,往大了分 增加属性shuffle设为true
scala> val rdd5 = rdd.coalesce(5,true)
rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at coalesce at :26
scala> rdd5.partitions.size
res13: Int = 5
再分区 repartition ,可增可减分区
scala> val rdd6 = rdd5.repartition(8)
rdd6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :34
scala> rdd6.partitions.size
res6: Int = 8
*******修改分区即修改Task任务数********
1)textFile可以修改分区,如果加载文件后再想修改分区,可以使用以上两种方法
2)考虑场景业务需求清洗后,数据会减少,通过glom查看分区里的数据会有空值情况,采用重新分区解决
3.randomSplit:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
说明:将RDD按照权重(weights)进行随机分配,返回指定个数的RDD集合;
应用案例:Hadoop全排操作
scala> val rdd = sc.parallelize(List(1,2,3,4,5,6,7))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
0.7+0.1+0.2 = 1 ,将rdd中的7个元素按权重分配, 权重加起来一定等于1
scala> val rdd1 = rdd.randomSplit(Array(0.7,0.1,0.2))
rdd1: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[1] at randomSplit at :26, MapPartitionsRDD[2] at randomSplit at :26, MapPartitionsRDD[3] at randomSplit at :26)
scala> rdd1(0).collect
res0: Array[Int] = Array(1, 5)
scala> rdd1(1).collect
res1: Array[Int] = Array()
scala> rdd1(2).collect
res2: Array[Int] = Array(2, 3, 4, 6, 7)
rdd重分区 ,按权重分配
4.glom 说明:返回每个分区中的数据项
val rdd = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
scala> rdd.glom.collect
res0: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9))
5.union:并集 将两个RDD进行合并,不去重
scala> val rdd = sc.parallelize(Array(9,8,7,6,5,4,3,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at :24
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at :24
scala> rdd.union(rdd1).collect
res3: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> rdd1.union(rdd).collect
res4: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 8, 7, 6, 5, 4, 3, 2)
返回一个数组Array
6.subtract:差集
scala> rdd.subtract(rdd1).collect //rdd-rdd1
res8: Array[Int] = Array()
scala> rdd1.subtract(rdd).collect rdd1-rdd
res9: Array[Int] = Array(1)
7.intersection:交集,去重
scala> rdd.intersection(rdd1).collect
res10: Array[Int] = Array(4, 6, 8, 7, 9, 3, 5, 2)
scala> rdd1.intersection(rdd).collect
res11: Array[Int] = Array(4, 6, 3, 7, 9, 8, 5, 2)
*********************************************************************
val list = List(1,2,3)
// :: 用于的是向队列的头部追加数据,产生新的列表, x::list,x就会添加到list的头部
println(4 :: list) //输出: List(4, 1, 2, 3) list.::(4)
// :+ 用于在list尾部追加元素; list :+ x;
println(list :+ 6) //输出: List(1, 2, 3, 6)
// +: 用于在list的头部添加元素;
val list2 = "A"+:"B"+:Nil //Nil Nil是一个空的List,定义为List[Nothing]
println(list2) //输出: List(A, B)
// ::: 用于连接两个List类型的集合 list ::: list2
println(list ::: list2) //输出: List(1, 2, 3, A, B)
// ++ 用于连接两个集合,list ++ list2
println(list ++ list2) //输出: List(1, 2, 3, A, B)
*********************************************************************
8.mapPartitions
说明:针对每个分区进行操作;
与map方法类似,map是对rdd中的每一个元素进行操作
而mapPartitions(foreachPartition)则是对rdd中的每个分区的迭代器进行操作。
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at :24
object MapPartitions {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("MapPartions")
val sc =new SparkContext(conf)
sc.setLogLevel("ERROR")
val s =sc.parallelize(1 to 9,3)
s.collect()
}
def myfunc[T](iter: Iterator[T]) : Iterator[(T,T)]={
var list = List[(T,T)]()
var res1 = iter.next()
while(iter.hasNext){
val res2 = iter.next()
list.::=(res1,res2)
res1 =res2
}
list.iterator
}
}
9.mapPartitionsWithIndex
类似于mappartition,但接受两个参数。
第一个参数是分区的索引,第二个参数是遍历该分区内所有项的迭代器。
输出是一个迭代器,包含应用函数编码的任何转换之后的项列表
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
scala> x.glom.collect
res11: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.map(x => index + "," + x)
}
注意:iter: Iterator[Int]:Iterator[T]类型,应和RDD内部数据类型一致
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)
10.zip 组合新的RDD
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]
说明:1.两个RDD之间数据类型可以不同;
2.要求每个RDD具有相同的分区数
3.需RDD的每个分区具有相同的数据个数
个数不相同报错 r.TaskSetManager: Lost task 0.0 in stage 11.0 (TID 29, 192.168.179.131, executor 1):
org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
scala> val rdd = sc.parallelize(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at :24
val rdd1 = sc.parallelize(List("a","b","c","d","e","f","g","h","i","j"),3)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24
scala> rdd.zip(rdd1).collect //(rdd,rdd1)
res14: Array[(Int, String)] = Array((1,a), (2,b), (3,c), (4,d), (5,e), (6,f), (7,g), (8,h), (9,i), (10,j))
scala> rdd1.zip(rdd).collect //(rdd1,rdd)
res15: Array[(String, Int)] = Array((a,1), (b,2), (c,3), (d,4), (e,5), (f,6), (g,7), (h,8), (i,9), (j,10))
11.zipParititions 要求:需每个RDD具有相同的分区数;
scala> var rdd4 = rdd1.zipPartitions(rdd2,rdd3){
| (rdd1Iter,rdd2Iter,rdd3Iter) => {
| var result = List[String]()
| while(rdd1Iter.hasNext && rdd2Iter.hasNext && rdd3Iter.hasNext) {
| result::=(rdd1Iter.next() + "_" + rdd2Iter.next() + "_" + rdd3Iter.next())
| }
| result.iterator
| }
| }
rdd4: org.apache.spark.rdd.RDD[String] = ZippedPartitionsRDD3[33] at zipPartitions at :27
scala> rdd4.collect
res23: Array[String] = Array(2_B_b, 1_A_a, 5_E_e, 4_D_d, 3_C_c)
12.zipWithIndex
def zipWithIndex(): RDD[(T, Long)]
将现有的RDD的每个元素和相对应的Index组合,生成新的RDD[(T,Long)]
scala> val rdd1 = sc.parallelize(List("a","b","c","d","e","f","g","h","i","j"),3)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[38] at parallelize at :24
scala> rdd1.zipWithIndex.collect
res31: Array[(String, Long)] = Array((a,0), (b,1), (c,2), (d,3), (e,4), (f,5), (g,6), (h,7), (i,8), (j,9))
scala> rdd1.zipWithIndex.glom.collect
res32: Array[Array[(String, Long)]] = Array(Array((a,0), (b,1), (c,2)), Array((d,3), (e,4), (f,5)), Array((g,6), (h,7), (i,8), (j,9)))
13.zipWithUniqueId
def zipWithUniqueId(): RDD[(T, Long)]
该函数将RDD中元素和一个唯一ID组合成键/值对,该唯一ID生成算法如下:
每个分区中第一个元素的唯一ID值为:该分区索引号,
每个分区中第N个元素的唯一ID值为:(前一个元素的唯一ID值) + (该RDD总的分区数)
该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。
scala> val rdd = sc.parallelize(List(1,2,3,4,5),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at :24
scala> rdd.glom.collect
res25: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))
scala> val rdd2 = rdd.zipWithUniqueId()
rdd2: org.apache.spark.rdd.RDD[(Int, Long)] = MapPartitionsRDD[23] at zipWithUniqueId at :26
scala> rdd2.collect
res26: Array[(Int, Long)] = Array((1,0), (2,2), (3,1), (4,3), (5,5))
计算规则:
step1:第一个分区的第一个元素0,第二个分区的第一个元素1
step2:第一个分区的第二个元素0+2
step3:第二个分区的第二个元素1+2=3;第二个分区的第三个元素3+2=5;
res25: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))
0 2 1 3 5
3个分区:
scala> val rdd = sc.parallelize(List(1,2,3,4,5),3)
res37: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
0 1 4 2 5
res38: Array[(Int, Long)] = Array((1,0), (2,1), (3,4), (4,2), (5,5))
*************************************************************************************************************************************************
14.reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
说明:合并具有相同键(key)的值
scala> val rdd = sc.parallelize(List("cat","dog","bear","frog","fish","chichen"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[47] at parallelize at :24
scala> rdd.map(x=>(x.length,x)).reduceByKey(_+_).collect
res35: Array[(Int, String)] = Array((4,bearfrogfish), (3,catdog), (7,chichen))
def keyBy[K](f: T => K): RDD[(K, T)]
说明:将f函数的返回值作为Key,与RDD的每个元素构成piarRDD{RDD[(K, T)]}
scala> val rdd = sc.parallelize(List("cat","dog","bear","frog","fish","chichen"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[47] at parallelize at :24
scala> rdd.keyBy(x=>x+"1").collect
res38: Array[(String, String)] = Array((cat1,cat), (dog1,dog), (bear1,bear), (frog1,frog), (fish1,fish), (chichen1,chichen))
16.groupByKey()
def groupByKey(): RDD[(K, Iterable[V])]
说明:按照相同的key进行分组,返回值为RDD[(K, Iterable[V])]
scala> val rdd = sc.parallelize(List((1,"a"),(1,"f"),(2,"b"),(2,"c"),(3,"d")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at :24
scala> rdd.groupBy
groupBy groupByKey
scala> rdd.groupByKey.collect
res0: Array[(Int, Iterable[String])] = Array((1,CompactBuffer(a, f)), (3,CompactBuffer(d)), (2,CompactBuffer(b, c)))
17.keys
def keys: RDD[K]
说明:返回键值对的RRD的key的RDD
scala> rdd.keys.collect
res1: Array[Int] = Array(1, 1, 2, 2, 3)
18.values
def values: RDD[V]
说明:返回键值对的RRD的value的RDD
scala> rdd.values.collect
res2: Array[String] = Array(a, f, b, c, d)
19.sortByKey
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P]
说明:根据key进行排序,默认为ascending: Boolean = true(“升序”)
scala> val rdd = sc.parallelize(List("one","two","three","four","five"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at :24
scala> val rdd1 = sc.parallelize(1 to rdd.count.toInt)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at :26
scala> rdd.zip(rdd1).collect
res4: Array[(String, Int)] = Array((one,1), (two,2), (three,3), (four,4), (five,5))
scala> rdd.zip(rdd1).map(_.swap).sortByKey().collect
res6: Array[(Int, String)] = Array((1,one), (2,two), (3,three), (4,four), (5,five))
scala> rdd.zip(rdd1).map(_.swap).sortByKey(false).collect
res7: Array[(Int, String)] = Array((5,five), (4,four), (3,three), (2,two), (1,one))
20.partitionBy
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
说明:通过设置Partitioner对RDD进行重分区
**************************************************************************************************************
聚合操作
1.mapValues[Pair]
def mapValues[U](f: V => U): RDD[(K, U)]
说明:将RDD[(K, V)] --> RDD[(K, U)],对Value做(f: V => U)操作,Key不变
scala> val rdd = sc.parallelize(List("one","two","three","four","five"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at parallelize at
scala> val rdd2 = rdd.map(x=>(x.length,x)).mapValues(_+"_").collect
rdd2: Array[(Int, String)] = Array((3,one_), (3,two_), (5,three_), (4,four_), (4,five_))
2.flatMapValues[Pair]
flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]
scala> val rdd = sc.parallelize(List("one","two","three","four","five"))
scala> rdd.map(x=>(x.length,x)).flatMapValues("_"+_+"_").collect
res18: Array[(Int, Char)] = Array((3,_), (3,o), (3,n), (3,e), (3,_), (3,_), (3,t), (3,w), (3,o), (3,_), (5,_), (5,t), (5,h), (5,r), (5,e), (5,e), (5,_), (4,_), (4,f), (4,o), (4,u), (4,r), (4,_), (4,_), (4,f), (4,i), (4,v), (4,e), (4,_))
3.subtractByKey[Pair]
def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
说明:删掉RDD 中键与other RDD 中的键相同的元素
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(_.length)
b.subtractByKey(d).collect
res15: Array[(Int, String)] = Array((4,lion))
4.combineByKey[Pair] 4.3.1 聚合操作
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
如下解释下3个重要的函数参数:
createCombiner: V => C ,这个函数把当前的值作为参数,此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
mergeValue: (C, V) => C,该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
mergeCombiners: (C, C) => C,该函数把2个元素C合并 (这个操作在不同分区间进行)
说明:createCombiner:当分区中遇到第一次出现的键时,触发此函数
mergeValue:当分区中再次出现的键时,触发此函数
mergeCombiners:处理不同区当中相同Key的Value,执行此函数
scala> val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[30] at parallelize at :24
scala> val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at :24
scala> val c = b.zip(a)
c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[32] at zip at :28
scala> c.glom.collect
res19: Array[Array[(Int, String)]] = Array(Array((1,dog), (1,cat), (2,gnu)), Array((2,salmon), (2,rabbit), (1,turkey)), Array((2,wolf), (2,bear), (2,bee)))
scala> c.combineByKey(List(_),(x:List[String],y:String)=>y::x,(x:List[String],y:List[String])=>x:::y).collect
res23: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))
5.foldByKey[Pair]
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
说明:与reduceByKey作用类似,但通过柯里化函数,首先要初始化zeroValue
scala> val rdd = sc.parallelize(List("one","two","three","four","five"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at :24
scala> rdd.map(x=>(x.length,x)).foldByKey("@")(_+_).collect
res27: Array[(Int, String)] = Array((4,@fourfive), (3,@onetwo), (5,@three))
6.reduceByKeyLocally:
行动操作
该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
scala> val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24
scala> val b = a.map(x => (x.length, x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at :26
scala> b.glom.collect
Array[Array[(Int, String)]] = Array(Array((3,dog), (3,cat)), Array((3,owl), (3,gnu), (3,ant)))
scala> b.reduceByKeyLocally(_+_)
res0: scala.collection.Map[Int,String] = Map(3 -> dogcatowlgnuant)
7.join
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
scala> val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[43] at parallelize at :24
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[44] at keyBy at :26
scala> val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[45] at parallelize at :24
scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[46] at keyBy at :26
scala> b.join(d).collect
res28: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
scala> val b = a.keyBy(_.length).collect
b: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))
8.rightOuterJoin
说明:对两个RDD 进行连接操作,确保第一个RDD 的键必须存在(右外连接)
9.leftOuterJoin
说明:对两个RDD 进行连接操作,确保第二个RDD 的键必须存在(左外连接)
10.cogroup
说明:将两个RDD 中拥有相同键的数据分组到一起,全连
行动操作3.3.2
-------------------------------------
定义:触发Job,调用runJob()方法:
比如:collect、count
1.first():T 表示返回RDD中的第一个元素,不排序
count(): Long表示返回RDD中的元素个数
reduce(f:(T,T)=>T):T根据映射函数f,对RDD中的元素进地二元计算
collect():Array[T]表示将RDD转换成数组
2.take(num:Int)Array[T]表示获取RDD中从0到num-1下标的元素,不排序
scala> var rdd1 = sc.makeRDD(Seq(10,4,3,12,3))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at makeRDD at :24
scala> rdd1.take(1)
res7: Array[Int] = Array(10)
3.top(num:Int)Array[T]表示从RDD中,按照默认(降序)或者指定的排序规则,返回前num个元素
scala> rdd1.top(3)
res8: Array[Int] = Array(12, 10, 4)
4.takeOrdered(num:Int):Array[T]和top类似,只不过以和top相反的顺序返回元素
scala> rdd1.takeOrdered(3)
res9: Array[Int] = Array(3, 3, 4)
5.sortBy[K](f:(T))=>K 定义RDD为键值类型
scala> var rdd1 = sc.makeRDD(Array(("A",2),("A",1),("B",6),("B",3),("B",7)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[7] at makeRDD at :24
按照键值升序
scala> rdd1.sortBy(x=>x).collect
res10: Array[(String, Int)] = Array((A,1), (A,2), (B,3), (B,6), (B,7))
值降序排列
scala> rdd1.sortBy(x=>x._2,false).collect
res13: Array[(String, Int)] = Array((B,7), (B,6), (B,3), (A,2), (A,1))
6.foreach
说明:将结果返回值给执行器节点,而非驱动器,
Local模式下:
$ > spark-shell
scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at :24
查看打印结果:
scala> a.foreach(println)
dog
tiger
lion
cat
spider
eagle
scala> :q
$> spark-shell --master spark://master:7077
scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24
scala> a.foreach(println)
无结果:
查看master:4040
Running Applications-->Executor Summary-->Logs --> stdout log page for app-20190727155306-0007/2 cat spider eagle
总结:
spark-shell 下所有的守护进程在一个JVM中所以可以使用foreach
打印RDD的元素
另一个常见的习惯用法是尝试使用rdd.foreach(println)或打印出RDD的元素rdd.map(println)。
在一台机器上,这将生成预期的输出并打印所有RDD的元素。
但是,在cluster模式下,stdout执行程序调用的输出现在写入执行stdout程序,
而不是驱动程序上的输出,因此stdout驱动程序不会显示这些!要打印驱动程序上的所有元素,
可以使用该collect()方法首先将RDD带到驱动程序节点:rdd.collect().foreach(println)。
但是,这会导致驱动程序内存不足,因为collect()将整个RDD提取到一台机器上;
如果您只需要打印RDD的一些元素,更安全的方法是使用take():rdd.take(100).foreach(println)
7.aggregate
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
说明:aggregate用于聚合RDD中的元素,先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,
再使用combOp将之前每个分区聚合后的U类型聚合成U类型
特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9
说明:
step1:首先在第一个分区[0,1,2,3]中执行math.max,结果为:3
step2:在第二个分区[0,4,5,6]中执行math.max,结果为:6
stepn:在第N个分区中执行math.max,结果为:max
step:将所有分区结果执行combOp(_+_),0+3+6=9
修改初值为5:
z.aggregate(5)(math.max(_, _), _ + _)
结果是多少?
说明:
// This example returns 16 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(5, 4, 5, 6) = 6
// final reduce across partitions will be 5 + 5 + 6 = 16
// note the final reduce include the initial value
案例说明:
scala> val z = sc.parallelize(List("12","23","345","4567"),2)
z: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at :24
scala> z.glom.collect
res11: Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))
scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res12: String = 42
scala> z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res13: String = 24
说明:
step1:第一个分区中执行math.max(x.length, y.length).toString代码
res0=:math.max(“”.length, “12”.length).toString = “2”
res1=:math.max(res0.length, “23”.length).toString = “2”
第一个分区最终返回值为:2
step2:第二个分区中执行math.max(x.length, y.length).toString代码
res2=:math.max(“”.length, “345”.length).toString = “3”
res3=:math.max(res2.length, “4567”.length).toString = “4”
第一个分区最终返回值为:4
step3:最后执行(x,y) => x + y :24 或 42
【原因分区是在不同的executor下运行,查看webUI task执行的秒数】
scala> z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res14: String = 11
step1:第一个分区中执行math.min(x.length, y.length).toString代码
res0=:math.min(“”.length, “12”.length).toString = “0”
res1=:math.min(res0.length, “23”.length).toString = “1”
第一个分区最终返回值为:1
step2:第二个分区中执行math.min(x.length, y.length).toString代码
res2=:math.min(“”.length, “345”.length).toString = “0”
res3=:math.min(res2.length, “4567”.length).toString = “1”
第一个分区最终返回值为:1
step3:最后执行(x,y) => x + y :11 或 11
scala> z.aggregate("12")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res15: String = 1211
step1:第一个分区中执行math.min(x.length, y.length).toString代码
res0=:math.min(“12”.length, “12”.length).toString = “2”
res1=:math.min(res0.length, “23”.length).toString = “1”
第一个分区最终返回值为:1
step2:第二个分区中执行math.min(x.length, y.length).toString代码
res2=:math.min(“12”.length, “345”.length).toString = “2”
res3=:math.min(res2.length, “4567”.length).toString = “1”
第一个分区最终返回值为:1
step3:最后执行(x,y) => x + y :1211 或 1211
8.fold
def fold(zeroValue: T)(op: (T, T) => T): T
说明:fold理解为aggregate的简化
val a = sc.parallelize(List(1,2,3), 3)
a.fold(0)(_ + _)
res59: Int = 6