map(func):
将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区
flatMap(func):
功能和map大致相同,但是将所有的输出分区合并在一起
note: flatMap只会将String扁平化成字符数组,并不会把Array[String]也扁平化成字符数组。
scala> var data=sc.textFile("/test/datas")
data: org.apache.spark.rdd.RDD[String] = /test/datas MapPartitionsRDD[7] at textFile at :24
scala> data.flatMap(x=>x.split(" ")).collect
res9: Array[String] = Array(hello, word, hello, spark, hello, hadoop)
scala> data.map(x=>x.split(" ")).collect
res10: Array[Array[String]] = Array(Array(hello, word), Array(hello, spark), Array(hello, hadoop))
.
.
.
.
.
scala> data.map(_.toUpperCase).collect
res32: Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, HI SPARK)
scala> data.flatMap(_.toUpperCase).collect
res33: Array[Char] = Array(H, E, L, L, O, , W, O, R, L, D, H, E, L, L, O, , S, P, A, R, K, H, E, L, L, O, , H, I, V, E, H, I, , S, P, A, R, K)
distinct:
对Rdd中的元素进行去重操作
scala> var data=sc.textFile("/test/datas")
data: org.apache.spark.rdd.RDD[String] = /test/datas MapPartitionsRDD[1] at textFile at :24
scala> data.collect
res2: Array[String] = Array("hello word ", hello spark, hello hadoop)
scala> data.distinct.collect
res3: Array[String] = Array("hello word ", hello spark, hello hadoop)
scala> data.flatMap(_.split(" ")).distinct.collect
res4: Array[String] = Array(word, hello, spark, hadoop)
parallelize
coalesce:
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]
该函数用于将RDD进行重分区,使用HashPartitioner。
第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false;
如果指定分区数目大于原来分区数目,则第二个参数应该设置为true,否则参数不变
scala> var data = sc.textFile("/tmp/lxw1234/1.txt")
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at textFile at :21
scala> data.collect
res37: Array[String] = Array(hello world, hello spark, hello hive, hi spark)
scala> data.partitions.size
res38: Int = 2 //RDD data默认有两个分区
scala> var rdd1 = data.coalesce(1)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at :23
scala> rdd1.partitions.size
res1: Int = 1 //rdd1的分区数为1
scala> var rdd1 = data.coalesce(4)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at :23
scala> rdd1.partitions.size
res2: Int = 2 //如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true,//否则,分区数不便
scala> var rdd1 = data.coalesce(4,true)
rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at :23
scala> rdd1.partitions.size
res3: Int = 4
repartition
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
该函数其实就是coalesce函数第二个参数为true的实现
scala> var rdd2 = data.repartition(1)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :23
scala> rdd2.partitions.size
res4: Int = 1
scala> var rdd2 = data.repartition(4)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at repartition at :23
scala> rdd2.partitions.size
res5: Int = 4
randomSplit
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
该函数根据weights权重,将一个RDD切分成多个RDD。
该权重参数为一个Double数组
第二个参数为random的种子,基本可忽略。
scala> var rdd = sc.makeRDD(1 to 10,10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at makeRDD at :21
scala> rdd.collect
res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> var splitRDD = rdd.randomSplit(Array(1.0,2.0,3.0,4.0))
splitRDD: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[17] at randomSplit at :23,
MapPartitionsRDD[18] at randomSplit at :23,
MapPartitionsRDD[19] at randomSplit at :23,
MapPartitionsRDD[20] at randomSplit at :23)
//这里注意:randomSplit的结果是一个RDD数组
scala> splitRDD.size
res8: Int = 4
//由于randomSplit的第一个参数weights中传入的值有4个,因此,就会切分成4个RDD,
//把原来的rdd按照权重1.0,2.0,3.0,4.0,随机划分到这4个RDD中,权重高的RDD,划分到//的几率就大一些。
//注意,权重的总和加起来为1,否则会不正常
scala> splitRDD(0).collect
res10: Array[Int] = Array(1, 4)
scala> splitRDD(1).collect
res11: Array[Int] = Array(3)
scala> splitRDD(2).collect
res12: Array[Int] = Array(5, 9)
scala> splitRDD(3).collect
res13: Array[Int] = Array(2, 6, 7, 8, 10)
glom
def glom(): RDD[Array[T]]
该函数是将RDD中每一个分区中类型为T的元素转换成Array[T],这样每一个分区就只有一个数组元素。
scala> var rdd = sc.makeRDD(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[38] at makeRDD at :21
scala> rdd.partitions.size
res33: Int = 3 //该RDD有3个分区
scala> rdd.glom().collect
res35: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
//glom将每个分区中的元素放到一个数组中,这样,结果就变成了3个数组
union
def union(other: RDD[T]): RDD[T]
该函数比较简单,就是将两个RDD进行合并,不去重。
scala> var rdd1 = sc.makeRDD(1 to 2,1)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21
scala> rdd1.collect
res42: Array[Int] = Array(1, 2)
scala> var rdd2 = sc.makeRDD(2 to 3,1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21
scala> rdd2.collect
res43: Array[Int] = Array(2, 3)
scala> rdd1.union(rdd2).collect
res44: Array[Int] = Array(1, 2, 2, 3
union
union
def union(other: RDD[T]): RDD[T]
该函数比较简单,就是将两个RDD进行合并,不去重。
scala> var rdd1 = sc.makeRDD(1 to 2,1)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21
scala> rdd1.collect
res42: Array[Int] = Array(1, 2)
scala> var rdd2 = sc.makeRDD(2 to 3,1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21
scala> rdd2.collect
res43: Array[Int] = Array(2, 3)
scala> rdd1.union(rdd2).collect
res44: Array[Int] = Array(1, 2, 2, 3)
intersection
def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
该函数返回两个RDD的交集,并且去重。
参数numPartitions指定返回的RDD的分区数。
参数partitioner用于指定分区函数
scala> var rdd1 = sc.makeRDD(1 to 2,1)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21
scala> rdd1.collect
res42: Array[Int] = Array(1, 2)
scala> var rdd2 = sc.makeRDD(2 to 3,1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21
scala> rdd2.collect
res43: Array[Int] = Array(2, 3)
scala> rdd1.intersection(rdd2).collect
res45: Array[Int] = Array(2)
scala> var rdd3 = rdd1.intersection(rdd2)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[59] at intersection at :25
scala> rdd3.partitions.size
res46: Int = 1
scala> var rdd3 = rdd1.intersection(rdd2,2)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[65] at intersection at :25
scala> rdd3.partitions.size
res47: Int = 2
subtract
def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
该函数类似于intersection,但返回在RDD中出现,并且不在otherRDD中出现的元素,不去重。
参数含义同intersection
scala> var rdd1 = sc.makeRDD(Seq(1,2,2,3))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[66] at makeRDD at :21
scala> rdd1.collect
res48: Array[Int] = Array(1, 2, 2, 3)
scala> var rdd2 = sc.makeRDD(3 to 4)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[67] at makeRDD at :21
scala> rdd2.collect
res49: Array[Int] = Array(3, 4)
scala> rdd1.subtract(rdd2).collect
res50: Array[Int] = Array(1, 2, 2)
var rdd3 = rdd1.mapPartitions{ x => {
var result=List[Int]()
var i = 0
while(x.hasNext){
i += x.next()
}
result.::(i)iterator
}}
mapPartition
def mapPartitions[U](f: (Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
该函数和map函数类似,只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。
比如,将RDD中的所有数据通过JDBC连接写入数据库,如果使用map函数,可能要为每一个元素都创建一个connection,这样开销很大,如果使用mapPartitions,那么只需要针对每一个分区建立一个connection。
参数preservesPartitioning表示是否保留父RDD的partitioner分区信息。
var rdd1 = sc.makeRDD(1 to 5,2)
//rdd1有两个分区
scala> var rdd3 = rdd1.mapPartitions{ x => {
| var result = List[Int]()
| var i = 0
| while(x.hasNext){
| i += x.next()
| }
| result.::(i).iterator
| }}
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[84] at mapPartitions at :23
//rdd3将rdd1中每个分区中的数值累加
scala> rdd3.collect
res65: Array[Int] = Array(3, 12)
scala> rdd3.partitions.size
res66: Int = 2
mapPartitionsWithIndex
def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
函数作用同mapPartitions,不过提供了两个参数,第一个参数为分区的索引。
var rdd1 = sc.makeRDD(1 to 5,2)
//rdd1有两个分区
var rdd2 = rdd1.mapPartitionsWithIndex{
(x,iter) => {
var result = List[String]()
var i = 0
while(iter.hasNext){
i += iter.next()
}
result.::(x + "|" + i).iterator
}
}
//rdd2将rdd1中每个分区的数字累加,并在每个分区的累加结果前面加了分区索引
scala> rdd2.collect
res13: Array[String] = Array(0|3, 1|12)
**zip**
def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
scala> var rdd1 = sc.makeRDD(1 to 10,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at :21
scala> var rdd1 = sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at :21
scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at makeRDD at :21
scala> rdd1.zip(rdd2).collect
res0: Array[(Int, String)] = Array((1,A), (2,B), (3,C), (4,D), (5,E))
scala> rdd2.zip(rdd1).collect
res1: Array[(String, Int)] = Array((A,1), (B,2), (C,3), (D,4), (E,5))
scala> var rdd3 = sc.makeRDD(Seq("A","B","C","D","E"),3)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at makeRDD at :21
scala> rdd1.zip(rdd3).collect
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
//如果两个RDD分区数不同,则抛出异常
zipPartitions
zipPartitions函数将多个RDD按照partition组合成为新的RDD,该函数需要组合的RDD具有相同的分区数,但对于每个分区内的元素数量没有要求。
参数是一个RDD
def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
def zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
这两个区别就是参数preservesPartitioning,是否保留父RDD的partitioner分区信息
映射方法f参数为两个RDD的迭代器。
scala> var rdd1 = sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at makeRDD at :21
scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at makeRDD at :21
//rdd1两个分区中元素分布:
scala> rdd1.mapPartitionsWithIndex{
| (x,iter) => {
| var result = List[String]()
| while(iter.hasNext){
| result ::= ("part_" + x + "|" + iter.next())
| }
| result.iterator
|
| }
| }.collect
res17: Array[String] = Array(part_0|2, part_0|1, part_1|5, part_1|4, part_1|3)
//rdd2两个分区中元素分布
scala> rdd2.mapPartitionsWithIndex{
| (x,iter) => {
| var result = List[String]()
| while(iter.hasNext){
| result ::= ("part_" + x + "|" + iter.next())
| }
| result.iterator
|
| }
| }.collect
res18: Array[String] = Array(part_0|B, part_0|A, part_1|E, part_1|D, part_1|C)
//rdd1和rdd2做zipPartition
scala> rdd1.zipPartitions(rdd2){
| (rdd1Iter,rdd2Iter) => {
| var result = List[String]()
| while(rdd1Iter.hasNext && rdd2Iter.hasNext) {
| result::=(rdd1Iter.next() + "_" + rdd2Iter.next())
| }
| result.iterator
| }
| }.collect
res19: Array[String] = Array(2_B, 1_A, 5_E, 4_D, 3_C)
Spark RDD是被分区的,在生成RDD时候,一般可以指定分区的数量,如果不指定分区数量,当RDD从集合创建时候,则默认为该程序所分配到的资源的CPU核数,如果是从HDFS文件创建,默认为文件的Block数。
可以利用RDD的mapPartitionsWithIndex方法来统计每个分区中的元素及数量。
具体看例子:
//创建一个RDD,默认分区15个,因为我的spark-shell指定了一共使用15个CPU资源
//–total-executor-cores 15
scala> var rdd1 = sc.makeRDD(1 to 50)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at makeRDD at :21
scala> rdd1.partitions.size
res15: Int = 15
//统计rdd1每个分区中元素数量
rdd1.mapPartitionsWithIndex{
(partIdx,iter) => {
var part_map = scala.collection.mutable.Map[String,Int]()
while(iter.hasNext){
var part_name = "part_" + partIdx;
if(part_map.contains(part_name)) {
var ele_cnt = part_map(part_name)
part_map(part_name) = ele_cnt + 1
} else {
part_map(part_name) = 1
}
iter.next()
}
part_map.iterator
}
}.collect
res16: Array[(String, Int)] = Array((part_0,3), (part_1,3), (part_2,4), (part_3,3), (part_4,3), (part_5,4), (part_6,3),
(part_7,3), (part_8,4), (part_9,3), (part_10,3), (part_11,4), (part_12,3), (part_13,3), (part_14,4))
//从part_0到part_14,每个分区中的元素数量
//统计rdd1每个分区中有哪些元素
rdd1.mapPartitionsWithIndex{
(partIdx,iter) => {
var part_map = scala.collection.mutable.Map[String,List[Int]]()
while(iter.hasNext){
var part_name = "part_" + partIdx;
var elem = iter.next()
if(part_map.contains(part_name)) {
var elems = part_map(part_name)
elems ::= elem
part_map(part_name) = elems
} else {
part_map(part_name) = List[Int]{elem}
}
}
part_map.iterator
}
}.collect
res17: Array[(String, List[Int])] = Array((part_0,List(3, 2, 1)), (part_1,List(6, 5, 4)), (part_2,List(10, 9, 8, 7)), (part_3,List(13, 12, 11)),
(part_4,List(16, 15, 14)), (part_5,List(20, 19, 18, 17)), (part_6,List(23, 22, 21)), (part_7,List(26, 25, 24)), (part_8,List(30, 29, 28, 27)),
(part_9,List(33, 32, 31)), (part_10,List(36, 35, 34)), (part_11,List(40, 39, 38, 37)), (part_12,List(43, 42, 41)), (part_13,List(46, 45, 44)),
(part_14,List(50, 49, 48, 47)))
//从part_0到part14,每个分区中包含的元素
mapValues
def mapValues[U](f: (V) => U): RDD[(K, U)]
同基本转换操作中的map,只不过mapValues是针对[K,V]中的V值进行map操作。
scala> var rdd1 = sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[27] at makeRDD at :21
scala> rdd1.mapValues(x => x + "_").collect
res26: Array[(Int, String)] = Array((1,A_), (2,B_), (3,C_), (4,D_))
flatMapValues
def flatMapValues[U](f: (V) => TraversableOnce[U]): RDD[(K, U)]
同基本转换操作中的flatMap,只不过flatMapValues是针对[K,V]中的V值进行flatMap操作。
scala> rdd1.flatMapValues(x => x + "_").collect
res36: Array[(Int, Char)] = Array((1,A), (1,_), (2,B), (2,_), (3,C), (3,_), (4,D), (4,_))
partitionBy
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
该函数根据partitioner函数生成新的ShuffleRDD,将原RDD重新分区。
scala> var rdd1 = sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[23] at makeRDD at :21
scala> rdd1.partitions.size
res20: Int = 2
//查看rdd1中每个分区的元素
scala> rdd1.mapPartitionsWithIndex{
| (partIdx,iter) => {
| var part_map = scala.collection.mutable.Map[String,List[(Int,String)]]()
| while(iter.hasNext){
| var part_name = "part_" + partIdx;
| var elem = iter.next()
| if(part_map.contains(part_name)) {
| var elems = part_map(part_name)
| elems ::= elem
| part_map(part_name) = elems
| } else {
| part_map(part_name) = List[(Int,String)]{elem}
| }
| }
| part_map.iterator
|
| }
| }.collect
res22: Array[(String, List[(Int, String)])] = Array((part_0,List((2,B), (1,A))), (part_1,List((4,D), (3,C))))
//(2,B),(1,A)在part_0中,(4,D),(3,C)在part_1中
//使用partitionBy重分区
scala> var rdd2 = rdd1.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[25] at partitionBy at :23
scala> rdd2.partitions.size
res23: Int = 2
//查看rdd2中每个分区的元素
scala> rdd2.mapPartitionsWithIndex{
| (partIdx,iter) => {
| var part_map = scala.collection.mutable.Map[String,List[(Int,String)]]()
| while(iter.hasNext){
| var part_name = "part_" + partIdx;
| var elem = iter.next()
| if(part_map.contains(part_name)) {
| var elems = part_map(part_name)
| elems ::= elem
| part_map(part_name) = elems
| } else {
| part_map(part_name) = List[(Int,String)]{elem}
| }
| }
| part_map.iterator
| }
| }.collect
res24: Array[(String, List[(Int, String)])] = Array((part_0,List((4,D), (2,B))), (part_1,List((3,C), (1,A))))
//(4,D),(2,B)在part_0中,(3,C),(1,A)在part_1中
combineByKey
def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]
该函数用于将RDD[K,V]转换成RDD[K,C],这里的V类型和C类型可以相同也可以不同。
其中的参数:
createCombiner:组合器函数,用于将V类型转换成C类型,输入参数为RDD[K,V]中的V,输出为C
mergeValue:合并值函数,将一个C类型和一个V类型值合并成一个C类型,输入参数为(C,V),输出为C
mergeCombiners:合并组合器函数,用于将两个C类型值合并成一个C类型,输入参数为(C,C),输出为C
numPartitions:结果RDD分区数,默认保持原有的分区数
partitioner:分区函数,默认为HashPartitioner
mapSideCombine:是否需要在Map端进行combine操作,类似于MapReduce中的combine,默认为true
看下面例子:
scala> var rdd1 = sc.makeRDD(Array(("A",1),("A",2),("B",1),("B",2),("C",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[64] at makeRDD at :21
scala> rdd1.combineByKey(
| (v : Int) => v + "_",
| (c : String, v : Int) => c + "@" + v,
| (c1 : String, c2 : String) => c1 + "$" + c2
| ).collect
res60: Array[(String, String)] = Array((A,2_$1_), (B,1_$2_), (C,1_))
其中三个映射函数分别为:
createCombiner: (V) => C
(v : Int) => v + “” //在每一个V值后面加上字符,返回C类型(String)
mergeValue: (C, V) => C
(c : String, v : Int) => c + “@” + v //合并C类型和V类型,中间加字符@,返回C(String)
mergeCombiners: (C, C) => C
(c1 : String, c2 : String) => c1 + “ ” + c 2 / / 合 并 C 类 型 和 C 类 型 , 中 间 加 ” + c2 //合并C类型和C类型,中间加 ”+c2//合并C类型和C类型,中间加,返回C(String)
其他参数为默认值。
最终,将RDD[String,Int]转换为RDD[String,String]。
再看例子:
rdd1.combineByKey(
(v : Int) => List(v),
(c : List[Int], v : Int) => v :: c,
(c1 : List[Int], c2 : List[Int]) => c1 ::: c2
).collect
res65: Array[(String, List[Int])] = Array((A,List(2, 1)), (B,List(2, 1)), (C,List(1)))
最终将RDD[String,Int]转换为RDD[String,List[Int]]。
foldByKey
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]
该函数用于RDD[K,V]根据K将V做折叠、合并处理,其中的参数zeroValue表示先根据映射函数将zeroValue应用于V,进行初始化V,再将映射函数应用于初始化后的V.
直接看例子:
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
scala> rdd1.foldByKey(0)(_+_).collect
res75: Array[(String, Int)] = Array((A,2), (B,3), (C,1))
//将rdd1中每个key对应的V进行累加,注意zeroValue=0,需要先初始化V,映射函数为+操
//作,比如("A",0), ("A",2),先将zeroValue应用于每个V,得到:("A",0+0), ("A",2+0),即:
//("A",0), ("A",2),再将映射函数应用于初始化后的V,最后得到(A,0+2),即(A,2)
再看:
scala> rdd1.foldByKey(2)(_+_).collect
res76: Array[(String, Int)] = Array((A,6), (B,7), (C,3))
//先将zeroValue=2应用于每个V,得到:("A",0+2), ("A",2+2),即:("A",2), ("A",4),再将映射函
//数应用于初始化后的V,最后得到:(A,2+4),即:(A,6)
再看乘法操作:
scala> rdd1.foldByKey(0)(_*_).collect
res77: Array[(String, Int)] = Array((A,0), (B,0), (C,0))
//先将zeroValue=0应用于每个V,注意,这次映射函数为乘法,得到:("A",0*0), ("A",2*0),
//即:("A",0), ("A",0),再将映射函//数应用于初始化后的V,最后得到:(A,0*0),即:(A,0)
//其他K也一样,最终都得到了V=0
scala> rdd1.foldByKey(1)(_*_).collect
res78: Array[(String, Int)] = Array((A,0), (B,2), (C,1))
//映射函数为乘法时,需要将zeroValue设为1,才能得到我们想要的结果。
在使用foldByKey算子时候,要特别注意映射函数及zeroValue的取值。
groupByKey
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[89] at makeRDD at :21
scala> rdd1.groupByKey().collect
res81: Array[(String, Iterable[Int])] = Array((A,CompactBuffer(0, 2)), (B,CompactBuffer(2, 1)), (C,CompactBuffer(1)))
reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21
scala> rdd1.partitions.size
res82: Int = 15
scala> var rdd2 = rdd1.reduceByKey((x,y) => x + y)
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[94] at reduceByKey at :23
scala> rdd2.collect
res85: Array[(String, Int)] = Array((A,2), (B,3), (C,1))
scala> rdd2.partitions.size
res86: Int = 15
scala> var rdd2 = rdd1.reduceByKey(new org.apache.spark.HashPartitioner(2),(x,y) => x + y)
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[95] at reduceByKey at :23
scala> rdd2.collect
res87: Array[(String, Int)] = Array((B,3), (A,2), (C,1))
scala> rdd2.partitions.size
res88: Int = 2
cogroup
##参数为1个RDD
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]
##参数为2个RDD
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
##参数为3个RDD
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
cogroup相当于SQL中的全外关联full outer join,返回左右RDD中的记录,关联不上的为空。
参数numPartitions用于指定结果的分区数。
参数partitioner用于指定分区函数。
##参数为1个RDD的例子
var rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
var rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> var rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[String], Iterable[String]))] = MapPartitionsRDD[12] at cogroup at :25
scala> rdd3.partitions.size
res3: Int = 2
scala> rdd3.collect
res1: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())),
(D,(CompactBuffer(),CompactBuffer(d))),
(A,(CompactBuffer(1),CompactBuffer(a))),
(C,(CompactBuffer(3),CompactBuffer(c)))
)
scala> var rdd4 = rdd1.cogroup(rdd2,3)
rdd4: org.apache.spark.rdd.RDD[(String, (Iterable[String], Iterable[String]))] = MapPartitionsRDD[14] at cogroup at :25
scala> rdd4.partitions.size
res5: Int = 3
scala> rdd4.collect
res6: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())),
(C,(CompactBuffer(3),CompactBuffer(c))),
(A,(CompactBuffer(1),CompactBuffer(a))),
(D,(CompactBuffer(),CompactBuffer(d))))
##参数为2个RDD的例子
var rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
var rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
var rdd3 = sc.makeRDD(Array(("A","A"),("E","E")),2)
scala> var rdd4 = rdd1.cogroup(rdd2,rdd3)
rdd4: org.apache.spark.rdd.RDD[(String, (Iterable[String], Iterable[String], Iterable[String]))] =
MapPartitionsRDD[17] at cogroup at :27
scala> rdd4.partitions.size
res7: Int = 2
scala> rdd4.collect
res9: Array[(String, (Iterable[String], Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer(),CompactBuffer())),
(D,(CompactBuffer(),CompactBuffer(d),CompactBuffer())),
(A,(CompactBuffer(1),CompactBuffer(a),CompactBuffer(A))),
(C,(CompactBuffer(3),CompactBuffer(c),CompactBuffer())),
(E,(CompactBuffer(),CompactBuffer(),CompactBuffer(E))))
##参数为3个RDD示例略,同上。
join
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
join相当于SQL中的内关联join,只返回两个RDD根据K可以关联上的结果,join只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。
参数numPartitions用于指定结果的分区数
参数partitioner用于指定分区函数
var rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
var rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> rdd1.join(rdd2).collect
res10: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))