Spark中的 转换操作、转换算子

学习算子推荐的网站:http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

文章目录

      • 转换操作简介
      • 转换算子举例
        • ==map、flatMap、distinct==
        • ==coalesce 和 repartition==:都是修改RDD分区数 、重分区
        • ==randomSplit== :RDD随机分配
        • ==glom==:返回每个分区中的数据项
        • ==union==:并集
        • ==subtrat==:差集
        • ==intersection==:交集
        • ==mapPartitions==:对每个分区进行操作
        • ==mapPartitionWithIndex==
        • ==zip==
        • ==zipParititions==
        • ==zipWithIndex==
        • ==zipWithUniqueId==
        • ==join==
        • ==rightOuterJoin==
        • ==leftOuterJoin==
        • ==cogroup==
    • 针对键值对的转换算子
        • ==reduceByKey[Pair]==
        • ==groupByKey()[Pair]==
        • ==keyBy== 设置某一元素作为 键key
        • ==keys[Pair]==
        • ==values[Pair]==
        • ==sortByKey[Pair]==
        • ==partitionBy[Pair]==
        • ==mapValues[Pair]==
        • ==flatMapValues[Pair]==
        • ==subtractByKey[Pair]==
        • ==combineByKey[Pair]==
        • ==foldByKey[Pair]==

转换操作简介

将当前RDD转换为新的RDD数据集,特点为惰性求值,当触发行动操作时RDD才开始执行计算。

Spark中的 转换操作、转换算子_第1张图片

转换算子举例

map、flatMap、distinct

map说明:将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。
        输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区【分区数不会改变】
        
flatMap说明:同Map算子一样,最后将所有元素放到同一集合中;【分区数不会改变】
       注意:针对Array[String]类型,将String对象视为字符数组
      
distinct说明:将RDD中重复元素做去重处理
//map操作
scala> val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.partitions.length
res0: Int = 3

scala> a.glom.collect
res4: Array[Array[String]] = Array(Array(dog), Array(salmon, salmon), Array(rat, elephant))

scala> val b = a.map(_.length)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at map at <console>:26

scala> b.partitions.length
res5: Int = 3

scala> b.glom.collect
res6: Array[Array[Int]] = Array(Array(3), Array(6, 6), Array(3, 8))   

//flatMap操作
scala> rdd1.collect
res0: Array[Array[String]] = Array(Array(hello, world), Array(how, are, you?), Array(ni, hao), Array(hello, tom))

scala> val rdd2 = rdd1.flatMap(x=>x)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:28

scala> rdd2.collect
res1: Array[String] = Array(hello, world, how, are, you?, ni, hao, hello, tom)  
	
scala> rdd2.flatMap(x=>x).collect
res3: Array[Char] = Array(h, e, l, l, o, w, o, r, l, d, h, o, w, a, r, e, y, o, u, ?, n, i, h, a, o, h, e, l, l, o, t, o, m)
         

//distinct 去重
scala> a.collect
res7: Array[String] = Array(dog, salmon, salmon, rat, elephant)

scala> val c = a.distinct
c: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at distinct at <console>:26

scala> c.collect
res8: Array[String] = Array(rat, salmon, elephant, dog)

coalesce 和 repartition:都是修改RDD分区数 、重分区

def coalesce ( numPartitions : Int , shuffle : Boolean = false ): RDD [T]
def repartition ( numPartitions : Int ): RDD [T]

将RDD的分区数进行修改,并生成新的RDD;有两个参数:第一个参数为分区数,第二个参数为shuffle Booleean类型,默认为false
	如果更改分区数比原有RDD的分区数小,shuffle为false
	如果更改分区数比原有RDD的分区数大,shuffle必须为true
实际应用:一般处理filter或简化操作时,新生成的RDD中分区内数据骤减,可考虑重分区
//举例:
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

scala> rdd.partitions.length
res10: Int = 1

scala> val rdd1= rdd.coalesce(5,true)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[15] at coalesce at <console>:26

scala> rdd1.partitions.length
res14: Int = 5

scala> rdd1.glom.collect
res15: Array[Array[Int]] = Array(Array(5, 10), Array(1, 6), Array(2, 7), Array(3, 8), Array(4, 9))

randomSplit :RDD随机分配

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

根据一个权值数组将一个RDD随机分割成多个较小的RDD,该数组指定分配给每个较小的RDD的总数据元素的百分比。
注意: 每个较小的RDD的实际大小仅近似等于权值数组指定的百分比。

应用案例:Hadoop全排操作中做数据抽样操作
//举例:
scala> rdd.collect
res19: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val rdd0 =rdd.randomSplit(Array(0.1,0.2,0.8))
	rdd0: Array[org.apache.spark.rdd.RDD[Int]] = Array(
	MapPartitionsRDD[17] at randomSplit at <console>:26, 
	MapPartitionsRDD[18] at randomSplit at <console>:26, 
	MapPartitionsRDD[19] at randomSplit at <console>:26)

scala> rdd0(0).collect
res16: Array[Int] = Array(9)

scala> rdd0(1).collect
res17: Array[Int] = Array(8)

scala> rdd0(2).collect
res18: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 10)

glom:返回每个分区中的数据项

说明:返回每个分区中的数据项,一般在用并行度时通过glom来测试

scala> val z=sc.parallelize(1 to 15,3)
z: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24

scala> z.glom.collect
res20: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7, 8, 9, 10), Array(11, 12, 13, 14, 15))

union:并集

说明:将两个RDD进行合并,不去重
注意:union后分区数为两个RDD分区的和

scala> val x= sc.parallelize(1 to 6,2)
x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at <console>:24

scala> val y =sc.parallelize(5 to 13,3)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24

scala> val z =x.union(y)
z: org.apache.spark.rdd.RDD[Int] = UnionRDD[24] at union at <console>:28

scala> z.glom.collect
res21: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(5, 6, 7), Array(8, 9, 10), Array(11, 12, 13))

subtrat:差集

注意:subtrat 操作后分区数为前一个RDD的分区数

scala> val z1=x.subtract(y)
z1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[29] at subtract at <console>:28

scala> z1.glom.collect
res23: Array[Array[Int]] = Array(Array(2, 4), Array(1, 3))

intersection:交集

说明:将两个RDD求交集,去重
注意:intersection操作后 RDD分区数为之前分区数较大的值

scala> val z2 = x.intersection(y)
z2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[36] at intersection at <console>:28

scala> z2.glom.collect
res24: Array[Array[Int]] = Array(Array(6), Array(), Array(5)) 

mapPartitions:对每个分区进行操作

def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

这是一个专门的映射,对于每个分区只调用一次。
通过输入参数(Iterarator[T]),各个分区的整个内容可以作为连续的值使用。
自定义函数必须返回另一个迭代器[U]。合并的结果迭代器将自动转换为新的RDD。

实际应用:对RDD进行数据库操作时,需采用 mapPartitions 对每个分区实例化数据库连接 conn 对象;

//举例:
val a = sc.parallelize(1 to 9, 3)

def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
     
  var res = List[(T, T)]()
  var pre = iter.next
  while (iter.hasNext)
  {
     
    val cur = iter.next;
    res .::= (pre, cur)
    pre = cur;
  }
  res.iterator
}
	
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

mapPartitionWithIndex

def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

类似于mappartition,但接受两个参数。
第一个参数是分区的索引,第二个参数是遍历该分区内所有项的迭代器。
输出是一个迭代器,包含应用函数编码的任何转换之后的项列表

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)

def myfunc[T](index: Int, iter: Iterator[T]) : Iterator[String] = {
     
  iter.map(x => index + "," + x)
}

注意:iter: Iterator[Int]:Iterator[T]类型,应和RDD内部数据类型一致

x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)

zip

说明:通过将任意分区的第i个元素组合在一起,连接两个RDDs。得到的RDD将由两个组件元组组
注意:
1.两个RDD之间数据类型可以不同
2.要求每个RDD具有相同的分区数
3.需RDD的每个分区具有相同的数据个数

val x1 = sc.parallelize(1 to 15,3)
val y1 = sc.parallelize(11 to 25,3)

x1.zip(y1).collect
res27: Array[(Int, Int)] = Array((1,11), (2,12), (3,13), (4,14), (5,15), (6,16), (7,17), (8,18), (9,19), (10,20), (11,21), (12,22), (13,23), (14,24), (15,25))

scala>  val z1 = sc.parallelize(21 to 35,3) 
z1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at parallelize at <console>:24

scala> x1.zip(y1).zip(z1).map((x) => (x._1._1, x._1._2, x._2 )).collect
res28: Array[(Int, Int, Int)] = Array((1,11,21), (2,12,22), (3,13,23), (4,14,24), (5,15,25), (6,16,26), (7,17,27), (8,18,28), (9,19,29), (10,20,30), (11,21,31), (12,22,32), (13,23,33), (14,24,34), (15,25,35))

zipParititions

与zip类似,要求:需每个RDD具有相同的分区数;

//举例:
val a = sc.parallelize(0 to 9, 3)
val b = sc.parallelize(10 to 19, 3)
val c = sc.parallelize(100 to 109, 3)

def myfunc(aiter: Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]): Iterator[String] =
{
     
  var res = List[String]()
  while (aiter.hasNext && biter.hasNext && citer.hasNext)
  {
     
    val x = aiter.next + " " + biter.next + " " + citer.next
    res ::= x
  }
  res.iterator
}

a.zipPartitions(b, c)(myfunc).collect
res50: Array[String] = Array(2 12 102, 1 11 101, 0 10 100, 5 15 105, 4 14 104, 3 13 103, 9 19 109, 8 18 108, 7 17 107, 6 16 106)

zipWithIndex

def zipWithIndex(): RDD[(T, Long)]
将现有的RDD的每个元素和相对应的Index组合,生成新的RDD[(T,Long)]

//举例:
val y1 = sc.parallelize(11 to 25,3)

scala> y1.zipWithIndex.collect
res29: Array[(Int, Long)] = Array((11,0), (12,1), (13,2), (14,3), (15,4), (16,5), (17,6), (18,7), (19,8), (20,9), (21,10), (22,11), (23,12), (24,13), (25,14))

val z = sc.parallelize(Array("A", "B", "C", "D"))
val r = z.zipWithIndex
r.collect
res110: Array[(String, Long)] = Array((A,0), (B,1), (C,2), (D,3))

zipWithUniqueId

//举例:
val z = sc.parallelize(100 to 120, 5)
val r = z.zipWithUniqueId
r.collect

res12: Array[(Int, Long)] = Array(
(100,0), (101,5), (102,10), (103,15),
(104,1),(105,6), (106,11), (107,16), 
(108,2), (109,7), (110,12), (111,17), 
(112,3), (113,8), (114,13), (115,18), 
(116,4), (117,9), (118,14), (119,19), (120,24))

//计算规则:
step1:第一个分区的第一个元素为0
	  第二个分区的第一个元素为1
  	  第三个分区的第一个元素为2
  	  第四个分区的第一个元素为3
  	  第五个分区的第一个元素为4
  	  
step2:第一个分区的第二个元素0+5(分区数),第一个分区的第三个元素5+5(分区数),第一个分区的第四个元素10+5(分区数)
		051015
      第二个分区的第二个元素1+5(分区数),第二个分区的第三个元素6+5(分区数),第二个分区的第四个元素11+5(分区数)
      	161116
      第三个分区的第二个元素2+5(分区数),第三个分区的第三个元素7+5(分区数),第三个分区的第四个元素12+5(分区数)
      	271217
      第四个分区的第二个元素3+5(分区数),第三个分区的第三个元素7+5(分区数),第三个分区的第四个元素12+5(分区数)
      	381318
      第五个分区的第二个元素4+5(分区数),第五个分区的第三个元素9+5(分区数),第五个分区的第四个元素14+5(分区数),第五个分区的第五个元素19+5(分区数)
      	49141924

join

说明:将两个RDD进行内连接,将相同键的值放到一起
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
//举例:
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)

val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)

b.join(d).collect
res0: Array[(Int, (String, String))] = Array(
	(6,(salmon,salmon)), (6,(salmon,rabbit)),(6,(salmon,turkey)), 
	(6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), 
	(3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), 
	(3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

rightOuterJoin

说明:对两个RDD进行连接操作,确保第一个RDD的键必须存在(右外连接)


val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)

b.rightOuterJoin(d).collect
res2: Array[(Int, (Option[String], String))] = Array(
	(6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), 
	(6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), 
	(3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), 
	(3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)),
	(4,(None,wolf)), 
	(4,(None,bear))
	)

leftOuterJoin

说明:对两个RDD进行连接操作,确保第二个RDD的键必须存在(左外连接)

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)

b.leftOuterJoin(d).collect
res1: Array[(Int, (String, Option[String]))] = Array(
	(6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), 
	(6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), 
	(3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))), 
	(3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))), (3,(rat,Some(bee))), 
	(8,(elephant,None))
	)

cogroup

说明:将两个RDD中拥有相同键的数据分组到一起,全连,使用键最多将3个键值RDD组合在一起

val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c))),
(3,(ArrayBuffer(b),ArrayBuffer(c))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))
)

val d = a.map((_, "d"))
b.cogroup(c, d).collect
res9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))
)

val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)
x.cogroup(y).collect
res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))), 
(2,(ArrayBuffer(banana),ArrayBuffer())), 
(3,(ArrayBuffer(orange),ArrayBuffer())),
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),
(5,(ArrayBuffer(),ArrayBuffer(computer))))

针对键值对的转换算子

reduceByKey[Pair]

def reduceByKey(func: (V, V) => V): RDD[(K, V)] 合并具有相同键的值

//例一:
scala> val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[50] at parallelize at <console>:24

scala> val b = a.map(x=>(x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[51] at map at <console>:28

scala> b.collect
res31: Array[(Int, String)] = Array((3,dog), (3,cat), (3,owl), (3,gnu), (3,ant))

scala> b.reduceByKey((x,y)=>x+y).collect
res32: Array[(Int, String)] = Array((3,dogcatowlgnuant))  

//例二:
scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[53] at parallelize at <console>:24

scala> val b = a.map(x=>(x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[54] at map at <console>:28

scala> b.collect
res33: Array[(Int, String)] = Array((3,dog), (5,tiger), (4,lion), (3,cat), (7,panther), (5,eagle))

scala> b.reduceByKey(_+_).collect
res34: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

groupByKey()[Pair]

说明:按照相同的键key进行分组,返回值为RDD[(K, Iterable[V])]

scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[53] at parallelize at <console>:24

scala> val b = a.map(x=>(x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[54] at map at <console>:28

scala> b.groupByKey
res35: org.apache.spark.rdd.RDD[(Int, Iterable[String])] = ShuffledRDD[56] at groupByKey at <console>:31

scala> b.groupByKey.collect
res37: Array[(Int, Iterable[String])] = Array(
		(4,CompactBuffer(lion)), (3,CompactBuffer(dog, cat)), 
		(7,CompactBuffer(panther)), (5,CompactBuffer(tiger, eagle)))

keyBy 设置某一元素作为 键key

def keyBy[K](f: T => K): RDD[(K, T)]

说明:将 f 函数的返回值作为Key,与RDD的每个元素构成piarRDD{RDD[(K, T)]}

scala> a.collect
res39: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)

scala> a.keyBy(x=>x.head).collect
scala> a.keyBy(_.head).collect
//效果相同
res38: Array[(Char, String)] = Array((d,dog), (t,tiger), (l,lion), (c,cat), (p,panther), (e,eagle))

keys[Pair]

def keys: RDD[K]
说明:返回具有key的RDD

scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
scala> val b = a.map(x => (x.length, x))

scala> b.keys.collect
res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

scala> val b = a.keyBy(_.head)
b: org.apache.spark.rdd.RDD[(Char, String)] = MapPartitionsRDD[63] at keyBy at <console>:26

scala> b.keys.collect
res46: Array[Char] = Array(d, s, s, r, e)

values[Pair]

def values: RDD[V]
说明:返回具有value的RDD

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))

b.values.collect
res3: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)

sortByKey[Pair]

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P]
说明:根据key进行排序,默认为ascending: Boolean = true(“升序”)

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)

c.sortByKey(true).collect
res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))

c.sortByKey(false).collect
res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))

partitionBy[Pair]

def partitionBy(partitioner: Partitioner): RDD[(K, V)]
说明:通过设置Partitioner对RDD进行重分区

	scala> val rdd = sc.parallelize(List((1,"a"),(2,"b"),(3,"c"),(4,"d")),2)
	rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[26] at parallelize at <console>:24

	scala> rdd.glom.collect
	res28: Array[Array[(Int, String)]] = Array(Array((1,a), (2,b)), Array((3,c), (4,d)))
	
	scala> val rdd1=rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
	rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[28] at partitionBy at <console>:26

	scala> rdd1.glom.collect
	res29: Array[Array[(Int, String)]] = Array(Array((4,d), (2,b)), Array((1,a), (3,c)))

mapValues[Pair]

获取由两个组件元组组成的RDD的值,并应用提供的函数转换每个值。
然后,它使用键和转换后的值形成新的双组件元组,并将它们存储在一个新的RDD中
def mapValues[U](f: V => U): RDD[(K, U)]
说明:将RDD[(K, V)] --> RDD[(K, U)],对Value做(f: V => U)操作

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))

b.mapValues("x" + _ + "x").collect
res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex))

flatMapValues[Pair]

def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))

b.flatMapValues("x" + _ + "x").collect
res6: Array[(Int, Char)] = Array(
	(3,x), (3,d), (3,o), (3,g), (3,x),
 	(5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), 
 	(4,x), (4,l), (4,i), (4,o), (4,n), (4,x), 
 	(3,x), (3,c), (3,a), (3,t), (3,x),
  	(7,x), (7,p), (7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x),
   	(5,x), (5,e), (5,a), (5,g), (5,l), (5,e), (5,x)) 

subtractByKey[Pair]

def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
说明:删掉RDD中键与other RDD 中的键相同的元素

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(_.length)

b.subtractByKey(d).collect
res15: Array[(Int, String)] = Array((4,lion))

combineByKey[Pair]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
说明:createCombiner:当分区中遇到第一次出现的键时,触发此函数
mergeValue:当分区中再次出现的键时,触发此函数
mergeCombiners:处理不同区当中相同Key的Value,执行此函数

RDD为一个分区时:
	scala> var rdd1 = sc.makeRDD(Array(("A",1),("A",2),("B",1),("B",2),("C",1)))  
	rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at makeRDD at <console>:24

	scala> rdd1.combineByKey(x=>x+"_",(x:String,y:Int)=>x+"@"+y,(x:String,y:String)=>x+"$"+y)
	res0: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[1] at combineByKey at <console>:27

	scala> res0.collect
	res1: Array[(String, String)] = Array((B,1_@2), (A,1_@2), (C,1_))
	
RDD为两个分区时:
    scala> val rdd2 = rdd1.repartition(2)
	rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at repartition at <console>:26

	scala> rdd2.partitions.size
	res2: Int = 2

	scala> rdd2.glom.collect
	res3: Array[Array[(String, Int)]] = Array(Array((A,1), (B,1), (C,1)), Array((A,2), (B,2)))

	scala> rdd2.combineByKey(x=>x+"_",(x:String,y:Int)=>x+"@"+y,(x:String,y:String)=>x+"$"+y)
	res4: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[7] at combineByKey at <console>:29

    scala> res4.collect
	res6: Array[(String, String)] = Array((B,1_$2_), (A,1_$2_), (C,1_))
       
RDD为三个分区时:
	scala> val rdd3 = rdd1.partitionBy(new org.apache.spark.HashPartitioner(3))
	rdd3: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at partitionBy at <console>:26

	scala> rdd3.partitions.size
	res7: Int = 3

	scala> rdd3.glom.collect
	res8: Array[Array[(String, Int)]] = Array(Array((B,1), (B,2)), Array((C,1)), Array((A,1), (A,2)))

	scala> rdd3.combineByKey(x=>x+"_",(x:String,y:Int)=>x+"@"+y,(x:String,y:String)=>x+"$"+y)
	res9: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[10] at combineByKey at <console>:29

	scala> res9.collect
	res10: Array[(String, String)] = Array((B,1_@2), (C,1_), (A,1_@2))

//举例:将数量相同的动物放到一个笼子里	
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

foldByKey[Pair]

def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
说明:与reduceByKey作用类似(合并相同键的值),但通过柯里化函数,首先要初始化zeroValue

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))

b.foldByKey("")(_ + _).collect
res84: Array[(Int, String)] = Array((3,dogcatowlgnuant)

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))

b.foldByKey("")(_ + _).collect
res85: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

你可能感兴趣的:(Spark,RDD,转换算子)