combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)
定义:
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = self.withScope {}
从定义中我们可以看出,该函数最终返回的类型是C,也就是reateCombiner所构造和返回的类型。下面是官方解释:
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
*
* Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).1234567891011
通俗一点讲:
combineByKey的作用是:Combine values with the same key using a different result type.
createCombiner函数是通过value构造并返回一个新的类型为C的值,这个类型也是combineByKey函数返回值中value的类型(key的类型不变)。
mergeValue函数是把具有相同的key的value合并到C中。这时候C相当于一个累计器。(同一个partition内)
mergeCombiners函数把两个C合并成一个C。(partitions之间)
举一个例子(parseData是(String,String)类型的)
scala> val textRDD = sc.parallelize(List(("A", "aa"), ("B","bb"),("C","cc"),("C","cc"), ("D","dd"), ("D","dd")))
textRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at :24
scala> val combinedRDD = textRDD.combineByKey(
| value => (1, value),
| (c:(Int, String), value) => (c._1+1, c._2),
| (c1:(Int, String), c2:(Int, String)) => (c1._1+c2._1, c1._2)
| )
combinedRDD: org.apache.spark.rdd.RDD[(String, (Int, String))] = ShuffledRDD[1] at combineByKey at :26
scala>
scala> combinedRDD.collect.foreach(x=>{
| println(x._1+","+x._2._1+","+x._2._2)
| })
D,2,dd
A,1,aa
B,1,bb
C,2,cc
scala>12345678910111213141516171819202122
第二个例子:
scala> val textRDD = sc.parallelize(List(("A", "aa"), ("B","bb"),("C","cc"),("C","cc"), ("D","dd"), ("D","dd")))
textRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at :24
scala> val combinedRDD2 = textRDD.combineByKey(
| value => 1,
| (c:Int, String) => (c+1),
| (c1:Int, c2:Int) => (c1+c2)
| )
combinedRDD2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at combineByKey at :26
scala> combinedRDD2.collect.foreach(x=>{
| println(x._1+","+x._2)
| })
D,2
A,1
B,1
C,2
scala>12345678910111213141516171819
上面两个函数的作用是相同的,返回类型不一样,目的是统计key的个数。第一个的类型是(String,(Int,String)),第二个的类型是(String,Int)。
aggregate
aggregate用户聚合RDD中的元素,先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,再使用combOp将之前每个分区聚合后的U类型聚合成U类型,特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U。这个方法的参数和combineByKey函数差不多。我们需要注意的是,aggregate函数是先计算每个partition中的数据,在计算partition之间的数据。
/**
* Aggregate the elements of each partition, and then the results for all the partitions, using
* given combine functions and a neutral "zero value". This function can return a different result
* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
* allowed to modify and return their first argument instead of creating a new U to avoid memory
* allocation.
*
* @param zeroValue the initial value for the accumulated result of each partition for the
* `seqOp` operator, and also the initial value for the combine results from
* different partitions for the `combOp` operator - this will typically be the
* neutral element (e.g. `Nil` for list concatenation or `0` for summation)
* @param seqOp an operator used to accumulate results within a partition
* @param combOp an associative operator used to combine results from different partitions
*/
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
val cleanSeqOp = sc.clean(seqOp)
val cleanCombOp = sc.clean(combOp)
val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
sc.runJob(this, aggregatePartition, mergeResult)
jobResult
}12345678910111213141516171819202122232425
例子:在spark shell中,输入下面代码。注意,本例子的初始值是一个元组,该类型也是aggregate函数的输出类型。这个函数的作用是统计字母的个数,同时拼接所有的字母。
scala> val textRDD = sc.parallelize(List("A", "B", "C", "D", "D", "E"))
textRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at :24
scala> val resultRDD = textRDD.aggregate((0, ""))((acc, value)=>{(acc._1+1, acc._2+":"+value)}, (acc1, acc2)=> {(acc1._1+acc2._1, acc1._2+":"+acc2._2)})
resultRDD: (Int, String) = (6,::D:E::D::A::B:C)12345
第二个例子:初始值为20000,Int类型,所以该函数的输出类型也为Int,该函数的作用是在20000基础上叠加所有字母的ascall码的值
scala> val textRDD = sc.parallelize(List('A', 'B', 'C', 'D', 'D', 'E'))
textRDD: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[4] at parallelize at :24
scala> val resultRDD2 = textRDD.aggregate[Int](20000)((acc, cha) => {acc+cha}, (acc1, acc2)=>{acc1+acc2})
resultRDD2: Int = 100403
123456
collect()
返回RDD中所有的元素。需要注意的是,这个方法会返回所有的分区的数据,所以如果数据量比较大的话(大于一个节点能够承载的量),使用该方法可能会出现问题。
countByValue()
该方法的定义为:
def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
map(value => (value, null)).countByKey()
}123
调用它的RDD不是一个pair型的,它返回值为一个Map,这个map的的key表示某个元素,这个map的value是Long类型的,表示某一个元素重复出现的次数。
看一个例子:
scala> val textRDD = sc.parallelize(List('A', 'B', 'C', 'D', 'D', 'E'))
textRDD: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[4] at parallelize at :24
scala> textRDD.countByValue()
res7: scala.collection.Map[Char,Long] = Map(E -> 1, A -> 1, B -> 1, C -> 1, D -> 2)
123456
mapValues(func)
描述:Apply a function to each value of a pair RDD without changing the key.
例子:rdd.mapValues(x => x+1)
结果:{(1, 3), (3, 5), (3, 7)}
flatMapValues(func)
定义:
/**
* Pass each value in the key-value pair RDD through a flatMap function without changing the
* keys; this also retains the original RDD's partitioning.
*/
def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)] = self.withScope {}12345
从定义可以看出,flatMapValues函数的输入数据的类型和返回的数据类型是一样的。该函数的参数是一个方法(假设此方法叫method)。method方法的有一个参数,返回值的类型是TraversableOnce[U],TraversableOnce[U]是干什么的呢?下面这段话是官方的解释。通俗来讲,TraversableOnece是一个用于集合(collection)的接口,具有遍历迭代的能力。
A template trait for collections which can be traversed either once only or one or more times.1
flatMapValues的作用是把一个key-value型RDD的value传给一个TraversableOnece类型的方法,key保持不变,value便是TraversableOnece方法所迭代产生的值,这些值对应一个相同的key。
例子:
rdd 是{(1, 2), (3, 4), (3, 6)}
rdd.flatMapValues(x => (x to 5)
上面的x表示的是rdd的value,为2,4,6,结果:
{(1, 2), (1, 3), (1, 4), (1, 5), (3, 4), (3, 5)}
再看一个例子:
val a = sc.parallelize(List((1,2),(3,4),(5,6)))
val b = a.flatMapValues(x=>1 to x)
b.collect.foreach(println(_))
/*
(1,1)
(1,2)
(3,1)
(3,2)
(3,3)
(3,4)
(5,1)
(5,2)
(5,3)
(5,4)
(5,5)
(5,6)
*/1234567891011121314151617
fold(zero)(func)
该方法和reduce方法一样,但是,fold有一个“zero”值作为参数,数据存在多少个分区中就有多少个“zero”值。该函数现计算每一个分区中的数据,再计算分区之间中的数据。所以,有多少个分区就会有多少个“zero”值被包含进来。
scala> val textRDD = sc.parallelize(List("A", "B", "C", "D", "D", "E"))
textRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at parallelize at :24
scala> textRDD.reduce((a, b)=> (a+b))
res11: String = DBCADE
scala> textRDD.fold("")((a, b)=>(a+b))
res12: String = BCDEDA
123456789
scala> var rdd = sc.parallelize(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at :24
scala> rdd.fold(0)((a,b)=>(a+b))
res36: Int = 55
scala> rdd.partitions.length
res38: Int = 2
scala> rdd.fold(1)((a,b)=>(a+b))
res37: Int = 581234567891011
上面第二个例子中总共有两个partition,为什么结果是58(55+3)而不是57呢?因为分区1和分区2分别有一个zero值,分区1和分区2相加的时候又包含了一次“zero”值。
mapValues(func)
该函数作用于key-value型RDD的value值,key不变。也就是说,改变该RDD的value值,key不变,返回值还是一个key-value的形式,只是这里的value和之前的value可能不一样。
下面的例子是把RDD的value值都加1.
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[10] at parallelize at :24
scala> val mappedRDD = textRDD.mapValues(value => {value+1})
mappedRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[11] at mapValues at :26
scala> mappedRDD.collect.foreach(println)
(1,4)
(3,6)
(3,8)
scala> 123456789101112
keys()
描述:Return an RDD of just the keys.
例子:
rdd.keys()
结果:
{1, 3, 3}
values()
Return an RDD of just the values.
rdd.values()
{2, 4, 6}
groupByKey()
描述:
Group values with the same key.
例子:
rdd.groupByKey()
输入数据:
{(1, 2), (3, 4), (3, 6)}
结果:
{(1,[2]),(3, [4,6])}
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)))
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[3] at parallelize at :24
scala> val groupRDD = rdd.groupByKey
groupRDD: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[4] at groupByKey at :26
scala> groupRDD.collect.foreach(print)
(1,CompactBuffer(2))(3,CompactBuffer(4, 6))12345678
上面的groupRDD的类型是(Int,Iterable[Int])
reduceByKey(func)
作用:作用于key-value型的RDD,组合具有相同key的value值。
看一个例子:把具有相同的key的value拼接在一起,用分号隔开。
scala> val textRDD = sc.parallelize(List(("A", "aa"), ("B","bb"),("C","cc"),("C","cc"), ("D","dd"), ("D","dd")))
textRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[7] at parallelize at :24
scala> val reducedRDD = textRDD.reduceByKey((value1,value2) => {value1+";"+value2})
reducedRDD: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[9] at reduceByKey at :26
scala> reducedRDD.collect.foreach(println)
(D,dd;dd)
(A,aa)
(B,bb)
(C,cc;cc)
scala>12345678910111213
scala> sc.parallelize(List((1,2),(3,4),(3,6)))
res0: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at :25
scala> res0.reduceByKey(_+_)
res1: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at reduceByKey at :27
scala> res1.collect.foreach(println)
(1,2)
(3,10)
scala>1234567891011
sortByKey()
Return an RDD sorted by the key.
rdd.sortByKey()
{(1, 2), (3, 4), (3, 6)}
reduce(func)
该函数的定义为:
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T = withScope {}12345
它的参数是一个函数(methodA),并且methodA的参数是两个类型相同的值,methodA的返回值为“一个”同类型的值,所以,从这里我们就可以看出reduce函数的作用是“reduce”。需要注意的是,reduce函数的返回值类型和methodA方法的参数的类型是一样的。
运行一个例子瞧一瞧:
scala> val textRDD = sc.parallelize(List("A", "B", "C", "D", "D", "E"))
textRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at parallelize at :24
scala> textRDD.reduce((a, b)=> (a+b))
res11: String = DBCADE
123456
subtractByKey
定义:
def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)] = self.withScope {}1
作用:Return an RDD with the pairs from this whose keys are not in other.
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at :24
scala> val textRDD2 = sc.parallelize(List((3,9)))
textRDD2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[13] at parallelize at :24
scala> val subtractRDD = textRDD.subtractByKey(textRDD2)
subtractRDD: org.apache.spark.rdd.RDD[(Int, Int)] = SubtractedRDD[18] at subtractByKey at :28
scala> subtractRDD.collect.foreach(println)
(1,3)
scala>
join – inner join
定义:
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {}123456
从上面的定义中可以看出,join函数的参数是一个RDD,返回值也是一个RDD。返回值RDD的类型是一个元组,该元组的key类型是两个RDD的key类型,value的类型又是一个元组。假设RDD1.join(RDD2),那么V类型表示RDD1的value的类型,W表示RDD2的value的类型。分析到这里我们大致就可以知道这个函数的作用了。
看一个例子:
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7), (3, 8), (3, 9)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at :24
scala> val textRDD2 = sc.parallelize(List((3,9), (3,4)))
textRDD2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[30] at parallelize at :24
scala> val joinRDD = textRDD.join(textRDD2)
joinRDD: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[33] at join at :28
scala> joinRDD.collect.foreach(println)
(3,(5,9))
(3,(5,4))
(3,(7,9))
(3,(7,4))
(3,(8,9))
(3,(8,4))
(3,(9,9))
(3,(9,4))123456789101112131415161718
leftOuterJoin
和join方法差不多,有一点区别,先看一个例子:
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7), (3, 8), (3, 9)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at :24
scala> val textRDD2 = sc.parallelize(List((3,9), (3,4)))
textRDD2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[30] at parallelize at :24
scala> val joinRDD = textRDD.leftOuterJoin(textRDD2)
joinRDD: org.apache.spark.rdd.RDD[(Int, (Int, Option[Int]))] = MapPartitionsRDD[36] at leftOuterJoin at :28
scala> joinRDD.collect.foreach(println)
(1,(3,None))
(3,(5,Some(9)))
(3,(5,Some(4)))
(3,(7,Some(9)))
(3,(7,Some(4)))
(3,(8,Some(9)))
(3,(8,Some(4)))
(3,(9,Some(9)))
(3,(9,Some(4)))
1234567891011121314151617181920
从上面这个例子看出,textRDD(左边)的key一定存在,textRDD2的key如果不存在于textRDD中,会以None代替。
rightOuterJoin
这个方法和leftOuterJoin相反。
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7), (3, 8), (3, 9)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at :24
scala> val textRDD2 = sc.parallelize(List((3,9), (3,4)))
textRDD2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[30] at parallelize at :24
scala> val joinRDD = textRDD.rightOuterJoin(textRDD2)
joinRDD: org.apache.spark.rdd.RDD[(Int, (Option[Int], Int))] = MapPartitionsRDD[39] at rightOuterJoin at :28
scala> joinRDD.collect.foreach(println)
(3,(Some(5),9))
(3,(Some(5),4))
(3,(Some(7),9))
(3,(Some(7),4))
(3,(Some(8),9))
(3,(Some(8),4))
(3,(Some(9),9))
(3,(Some(9),4))
scala>
123456789101112131415161718192021
cogroup
现看一个例子:
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7), (3, 8), (3, 9)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at :24
scala> val textRDD2 = sc.parallelize(List((3,9), (3,4)))
textRDD2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[30] at parallelize at :24
scala> val cogroupRDD = textRDD.cogroup(textRDD2)
cogroupRDD: org.apache.spark.rdd.RDD[(Int, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[41] at cogroup at :28
scala> cogroupRDD.collect.foreach(println)
(1,(CompactBuffer(3),CompactBuffer()))
(3,(CompactBuffer(5, 7, 8, 9),CompactBuffer(9, 4)))
scala> 1234567891011121314
下面是该函数的定义:
/**
* For each key k in `this` or `other1` or `other2` or `other3`,
* return a resulting RDD that contains a tuple with the list of values
* for that key in `this`, `other1`, `other2` and `other3`.
*/
def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
other2: RDD[(K, W2)],
other3: RDD[(K, W3)],
partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope {}12345678910
看了上面的例子和定义,应该很好理解cogroup的作用了。
countByKey() – action
对于key-value形式的RDD,统计相同的key出现的次数。
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7), (3, 8), (3, 9)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at :24
scala> val countRDD = textRDD.countByKey()
countRDD: scala.collection.Map[Int,Long] = Map(1 -> 1, 3 -> 4)
123456
collectAsMap() –action
对于key-value形式的RDD, 先collect,然后把它们转换成map,便于查找。
scala> val textRDD = sc.parallelize(List((1, 3), (3, 5), (3, 7), (3, 8), (3, 9)))
textRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at :24
scala> val countRDD = textRDD.collectAsMap()
countRDD: scala.collection.Map[Int,Int] = Map(1 -> 3, 3 -> 9)12345
需要注意的是:如果有多个相同的key,那么后一个value会覆盖前一个value。
mllib-statistics
google-math
programming-guide