Spark RDD Operations(操作)转换算子与动作算子

RDD支持两种类型的操作: transformations-转换算子,将⼀个已经存在的RDD转换为一个新的RDD,另外⼀种称为actions-动作算子 ,动作算子一般在执行结束以后,会将结果返回给Driver。在Spark中所有的transformations
都是lazy的,所有转换算子并不会立即执行,它们仅是记录对当前RDD的转换逻辑。仅当 Actions 算子要求将结果返回给Driver程序时 transformations 才开始真正的进行转换计算。这种设计Spark可以更⾼效地运行。


scala> var rdd1=sc.textFile("hdfs:///demo/words/t_word",1).map(line=>line.split("
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24

scala> rdd1.cache
res54: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24

scala> rdd1.reduce(_+_)
res55: Int = 15

scala> rdd1.reduce(_+_)
res56: Int = 15

Spark还支持将RDD持久存储在磁盘上,或在多个节点之间复制。⽐如用户可调用persist(StorageLevel.DISK_ONLY_2) 将RDD存储在磁盘上,并且存储2份。

Transformations(转换算子) func )变形

Return a new distributed dataset formed by passing each element of the
source through a function func

将一个RDD[U] 转换为 RDD[T]类型。在转换的时候需要用户提供一个匿名函数 func: U => T

    scala> var rdd:RDD[String]=sc.makeRDD(List("a","b","c","a"))
    rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[120] at makeRDD at
    scala> val mapRDD:RDD[(String,Int)] = => (w, 1))
    mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[121] at map at

2.filter( func )过滤

Return a new dataset formed by selecting those elements of the source
on which func returns true.

将对一个RDD[U]类型元素进行过滤,过滤产生新的RDD[U],但是需要用户提供 func:U => Boolean 系统仅会保留返回true的元素。

    scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[122] at makeRDD at
    scala> val mapRDD:RDD[Int]=rdd.filter(num=> num %2 == 0)
    mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[123] at filter at
    scala> mapRDD.collect
    res63: Array[Int] = Array(2, 4)

3.flatMap( func )变类型

Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).

和map类似,也是将一个RDD[U] 转换为 RRD[T]类型。但是需要用户提供一个方法 func:U => Seq[T]

scala> var rdd:RDD[String]=sc.makeRDD(List("this is","good good"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[124] at makeRDD at

scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap(line=> for(i<- line.split("\\s+"))
yield (i,1))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[125] at flatMap
at :26

scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap( line=>
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[126] at flatMap
at :26

scala> flatMapRDD.collect
res64: Array[(String, Int)] = Array((this,1), (is,1), (good,1), (good,1))

4.mapPartitions( func )判断元素分区

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.

和map类似,但是该方法的输入时一个分区的全量数据,因此需要用户提供一个分区的转换方法:func:Iterator => Iterator

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at makeRDD at

scala> var mapPartitionsRDD=rdd.mapPartitions(values =>>(n,n%2==0)))
mapPartitionsRDD: org.apache.spark.rdd.RDD[(Int, Boolean)] = MapPartitionsRDD[129] at
mapPartitions at :26

scala> mapPartitionsRDD.collect
res70: Array[(Int, Boolean)] = Array((1,false), (2,true), (3,false), (4,true),

5.mapPartitionsWithIndex( func)显示元素和分区

Similar to mapPartitions, but also provides func with an integer
value representing the index of thepartition, so func must be of type
(Int, Iterator) => Iterator when running on an RDD of type T.

和mapPartitions类似,但是该方法会提供RDD元素所在的分区编号。因此 func:(Int, Iterator)=> Iterator

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[139] at makeRDD at

scala> var mapPartitionsWithIndexRDD=rdd.mapPartitionsWithIndex((p,values) =>>(n,p)))
mapPartitionsWithIndexRDD: org.apache.spark.rdd.RDD[(Int, Int)] =
MapPartitionsRDD[140] at mapPartitionsWithIndex at :26

scala> mapPartitionsWithIndexRDD.collect
res77: Array[(Int, Int)] = Array((1,0), (2,0), (3,0), (4,1), (5,1), (6,1))

6.sample( withReplacement , fraction , seed )随机抽样

Sample a fraction fraction of the data, with or without replacement,
using a given random number generator seed.

withReplacement :是否允许重复抽样.
fraction :控制抽样大致比例.
seed :控制的是随机抽样过程中产生随机数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[150] at makeRDD at

scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[151] at sample at

scala> simpleRDD.collect
res91: Array[Int] = Array(1, 5, 6)


7.union( otherDataset )合并

Return a new dataset that contains the union of the elements in the
source dataset and the argument.


scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
scala> rdd.union(rdd2).collect
res95: Array[Int] = Array(1, 2, 3, 4, 5, 6, 6, 7)

8.intersection( otherDataset )取交集

Return a new RDD that contains the intersection of elements in the
source dataset and the argument.


scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
scala> rdd.intersection(rdd2).collect
res100: Array[Int] = Array(6)

9.distinct([ numPartitions ]))去重

Return a new dataset that contains the distinct elements of the source

去除RDD中重复元素,其中 numPartitions 是一个可选参数,是否修改RDD的分区数,一般是在当数据集经过去重之后,如果数据量级大规模降低,可以尝试传递 numPartitions 减少分区数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
scala> rdd.distinct(3).collect
res106: Array[Int] = Array(6, 3, 4, 1, 5, 2)

10.join( otherDataset , [ numPartitions ])连接

When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with allpairs of elements for each key. Outer
joins are supported through le"OuterJoin, rightOuterJoin, and

当调用RDD[(K,V)]和RDD[(K,W)]系统可以返回⼀个新的RDD[(k,(v,w))](默认内连接),目前支持leftOuterJoin, rightOuterJoin, 和 fullOuterJoin>>>>RDD[(1,2)] RDD[(1,3)] ==RDD[(1,(2,3))]

scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at
makeRDD at :25
scala> case class OrderItem(name:String,price:Double,count:Int)
defined class OrderItem
scala> var
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[206]
at makeRDD at :27
scala> userRDD.join(orderItemRDD).collect
res107: Array[(Int, (String, OrderItem))] = Array((1,
scala> userRDD.leftOuterJoin(orderItemRDD).collect
res108: Array[(Int, (String, Option[OrderItem]))] = Array((1,
(zhangsan,Some(OrderItem(apple,4.5,2)))), (2,(lisi,None)))

11,cogroup( otherDataset , [ numPartitions ])-了解

When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (Iterable,Iterable)) tuples. This operation is also
called groupWith .

scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at
makeRDD at :25
scala> var
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[215]
at makeRDD at :27
scala> userRDD.cogroup(orderItemRDD).collect
res110: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,
OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))
scala> userRDD.groupWith(orderItemRDD).collect
res119: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,
OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))

12.cartesian( otherDataset )-了了解 笛卡尔积

When called on datasets of types T and U, returns a dataset of (T, U)
pairs (all pairs of elements).


scala> var rdd1:RDD[Int]=sc.makeRDD(List(1,2,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[238] at makeRDD at
scala> var rdd2:RDD[String]=sc.makeRDD(List("a","b","c"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[239] at makeRDD at
scala> rdd1.cartesian(rdd2).collect
res120: Array[(Int, String)] = Array((1,a), (1,b), (1,c), (2,a), (2,b), (2,c), (4,a),
(4,b), (4,c))

13.coalesce( numPartitions )分区的缩放《不可放大》

Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large

当经过大规模的过滤数据以后,可以使 coalesce 对RDD进行分区的缩放(只能减少分区,不可以增加)。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).coalesce(3).getNumPartitions
res127: Int = 3
scala> rdd1.filter(n=> n%2 == 0).coalesce(12).getNumPartitions
res128: Int = 6

14.repartition( numPartitions )分区缩放《可放大》

Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shu!les all data
over the network.

和 coalesce 相似,但是该算子能够变大或者缩小RDD的分区数。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).repartition(12).getNumPartitions
res130: Int = 12
scala> rdd1.filter(n=> n%2 == 0).repartition(3).getNumPartitions
res131: Int = 3

15.repartitionAndSortWithinPartitions( partitioner )-了了解

Repartition the RDD according to the given partitioner and, within
each resulting partition, sortrecords by their keys. This is more
e!icient than calling repartition and then sorting within each
partition because it can push the sorting down into the shu!le

该算子能够使⽤用户提供的 partitioner 实现对RDD中数据分区,然后对分区内的数据按照他们key进行排序。

scala> case class User(name:String,deptNo:Int)
defined class User
var empRDD:RDD[User]= sc.parallelize(List(User("张
三",1),User("lisi",2),User("wangwu",1))) => (t.deptNo, Partitioner
override def numPartitions: Int = 4
override def getPartition(key: Any): Int = {
key.hashCode() & Integer.MAX_VALUE % numPartitions
}).mapPartitionsWithIndex((p,values)=> {


Spark RDD Operations(操作)转换算子与动作算子_第1张图片



一.groupByKey([ numPartitions ])值为对个值(数组|集合)

When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs.

类似于MapReduce计算模型。将RDD[(K, V)] 转换为RDD[ (K, Iterable)]

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupByKey.collect
res3: Array[(String, Iterable[Int])] = Array((this,CompactBuffer(1)),
(is,CompactBuff)), (good,CompactBuffer(1, 1)))
  • groupBy(f:(k,v)=> T)

    scala> var lines=sc.parallelize(List("this is good good"))
    lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
    scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1)
    res5: org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])] = ShuffledRDD[18] at
    groupBy at :26
    scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1).map(t=>
    res6: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

二.reduceByKey( func , [ numPartitions分区数可省 ])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V)
pairs where the values for each keyare aggregated using the given
reduce function func , which must be of type (V,V) => V. Like in
groupByKey , the number of reduce tasks is configurable through an
optional second argument.


scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at

scala> lines.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
res8: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

三.aggregateByKey柯里化<求V和>( zeroValue初始值 )( seqOp 局部求和, combOp最终求和 , [ numPartitions ])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U)
pairs where the values for each key are aggregated using the given
combine functions and a neutral “zero” value. Allows an aggregated
value type that is di!erent than the input value type, while avoiding
unnecessary allocations. Like in groupByKey , the number of reduce tasks is configurable through an optional second argument.

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).collect
res9: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

四.sortByKey([ ascending ], [ numPartitions ])

ascending 为true时倒叙排列,为false时正序

When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.


scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
res13: Array[(String, Int)] = Array((good,2), (is,1), (this,1))

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
res14: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
  • sortBy(T=>U,ascending,[ numPartitions ])

    scala> var lines=sc.parallelize(List("this is good good"))
    lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
    scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
    res18: Array[(String, Int)] = Array((good,2), (this,1), (is,1))
    scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
    res19: Array[(String, Int)] = Array((this,1), (is,1), (good,2))



1.reduce( func )

Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in


scala> sc.textFile("file:///root/t_word").map(_.length).reduce(_+_)
res3: Int = 64


Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that
returns a suficiently small subset of the data.


scala> sc.textFile("file:///root/t_word").collect
res4: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day up ", come on baby)

3.foreach( func )

Run a function func on each element of the dataset. This is usually
done for side e!ects such as updating an Accumulator or interacting
with external storage systems.

在数据集的每个元素上运行函数func。通常这样做是出于副作用,例如更新累加器或与 外部存储系统 交互。

scala> sc.textFile("file:///root/t_word").foreach(line=>println(line))


Return the number of elements in the dataset.


scala> sc.textFile("file:///root/t_word").count()
res7: Long = 5

5.first()<返回第一个元素>|6.take( n )<返回指定个数元素>

Return the first element of the dataset (similar to take(1)). take(n)
Return an array with the first n elements of the dataset.

scala> sc.textFile("file:///root/t_word").first
res9: String = this is a demo

scala> sc.textFile("file:///root/t_word").take(1)
res10: Array[String] = Array(this is a demo)

scala> sc.textFile("file:///root/t_word").take(2)
res11: Array[String] = Array(this is a demo, hello spark)

7.takeSample( withReplacement , num , [ seed ])<随机抽取>

Return an array with a random sample of num elements of the dataset,
with or without replacement, optionally pre-specifying a random number
generator seed.


scala> sc.textFile("file:///root/t_word").takeSample(false,2)
res20: Array[String] = Array("good good study ", hello spark)

8. takeOrdered( n , [ordering] )

Return the first n elements of the RDD using either their natural
order or a custom comparator.

返回RDD中前N个元素,用户可以指定 比较规则

scala> case class User(name:String,deptNo:Int,salary:Double)
defined class User

scala> var
userRDD: org.apache.spark.rdd.RDD[User] = ParallelCollectionRDD[51] at parallelize at

scala> userRDD.takeOrdered
	def takeOrdered(num: Int)(implicit ord: Ordering[User]): Array[User]

scala> userRDD.takeOrdered(3)
:26: error: No implicit Ordering defined for User.

scala> implicit var userOrder=new Ordering[User]{
| override def compare(x: User, y: User): Int = {
| if(x.deptNo!=y.deptNo){
| x.deptNo.compareTo(y.deptNo)
| }else{
| x.salary.compareTo(y.salary) * -1
| }
| }
| }
userOrder: Ordering[User] = $anon$1@7066f4bc

scala> userRDD.takeOrdered(3)
res23: Array[User] = Array(User(zs,1,1000.0), User(ls,2,1500.0), User(ww,2,1000.0))

9.saveAsTextFile( path )<存储文本>

Write the elements of the dataset as a text file (or set of text
files) in a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element
to convert it to a line of text in the file.


scala> sc.textFile("file:///root/t_word").flatMap(_.split("")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).map(t=>t._1+"\t"+t._2).saveAsTextFile("hdfs:///demo/results02")

10.saveAsSequenceFile( path )<存储二进制>

Write the elements of the dataset as a Hadoop SequenceFile in a given
path in the local filesystem,HDFS or any other Hadoop-supported file
system. This is available on RDDs of key-value pairs that implement
Hadoop’s Writable interface. In Scala, it is also available on types
that are implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String, etc).

该方法只能用于RDD[(k,v)]类型。并且K/v都必须实现Writable接口,由于使用Scala编程,Spark已经实现隐式转换将Int, Double, String, 等类型可以自动的转换为Writable

scala> sc.textFile("file:///root/t_word").flatMap(_.split("
scala> sc.sequenceFile[String,Int]("hdfs:///demo/results03").collect
res29: Array[(String, Int)] = Array((a,1), (baby,1), (come,1), (day,2), (demo,1),
(good,2), (hello,1), (is,1), (on,1), (spark,1), (study,1), (this,1), (up,1))
