什么是RDD
RDD 是 Spark 的计算模型。RDD(Resilient Distributed Dataset)叫做弹性的分布式数
据集合,是 Spark 中最基本的数据抽象,它代表一个不可变、只读的,被分区的数据集。
操作 RDD 就像操作本地集合一样,有很多的方法可以调用,使用方便,而无需关心底层的
调度细节。
RDD宽依赖:父RDD的分区被子RDD的多个分区使用 例如 groupByKey、reduceByKey、sortByKey等操作会产生宽依赖,会产生shuffle
RDD窄依赖:父RDD的每个分区都只被子RDD的一个分区使用 例如map、filter、union等操作会产生窄依赖
RDD的三种创建形式
每一次的transformation算子,都生成一个新的rdd。
Action:立即执行计算
每次遇到一个action,就会产生一个job。
要么把结果写入到文件系统中,要么把结果展示在driver端,要么把结果打印。
Application:
在每一个application中,有几个action,就会产生几个job。
常用算子
Transformation 懒加载 不会将操作递交到集群
map:返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成
flatMap:压平操作,先map后flat
filter:过滤(返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成)
union:并集(对原RDD和参数RDD求并集后返回一个新的RDD)
intersection:交集(对原RDD和参数RDD求交集后返回一个新的RDD)
distinct:(对源RDD进行去重后返回一个新的RDD)
groupBy:分组
sortBy:排序
Action
collect:收集,将数据收集到Driver端
saveAsTextFile:保存文件
count:求个数
first:第一个元素
take:求集合中的n个元素
foreach:对原来的rdd中的每隔元素,执行fun操作
foreach和map的区别:
1:foreach没有返回值,map有返回值
2:foreach是action操作,会触发计算,map是tranformation操作,延迟计算
高级算子
mapPartition
可以认为是Map的变种,可以对分区进行并行处理,两者的区别是调用的颗粒度不一样,map的输入函数是应用于RDD的每个元素,而mapPartition的输入函数是应用于RDD的每个分区
为什么mapPartition是一个迭代器,因为分区中可能有太多的数据,一次性拿出来内存可能放不下导致内存溢出,所以使用迭代器一条条的拿出来
mapPartitionWithIndex
对RDD的每个分区进行操作,
scala> val list = sc.parallelize(1 to 10, 3)
list: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[82] at parallelize at :24
scala> val func = (index:Int, it:Iterator[Int]) => it.map(s"index:$index, ele:"+_)
func: (Int, Iterator[Int]) => Iterator[String] =
将数据均匀分配
scala> list.mapPartitionsWithIndex(func).collect
res38: Array[String] = Array(
index:0, ele:1, index:0, ele:2, index:0, ele:3,
index:1, ele:4, index:1, ele:5, index:1, ele:6,
index:2, ele:7, index:2, ele:8, index:2, ele:9, index:2, ele:10
)
aggregate:聚合 (全局跟局部)
案例一:
scala> val a= sc.parallelize(List(1,2,3,4,5), 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[87] at parallelize at :24
scala> val func = (index:Int, it:Iterator[Int]) => it.map(s"index:$index, ele:"+_)
func: (Int, Iterator[Int]) => Iterator[String] =
scala> a.mapPartitionsWithIndex(func).collect
res43: Array[String] = Array(
index:0, ele:1, index:0, ele:2,
index:1, ele:3, index:1, ele:4, index:1, ele:5
)
def aggregate[U](zeroValue: U)(seqOp: (U, Int) => U,combOp: (U, U) => U)(implicit evidence$30: scala.reflect.ClassTag[U]): U
(zeroValue: U):初始值
seqOp: (U, Int) => U:局部操作函数
combOp: (U, U) => U:全局操作函数
a.aggregate(0)(_+_, _+_) =》15
解释:
第一个分区:1+2
第二个分区:3+4+5
结果:3+12
scala> a.aggregate(100)(_+_, _+_)
res45: Int = 315
解释:
第一个分区:100+1+2
第二个分区:100+3+4+5
结果:100+103+112
scala> a.aggregate(2)(_+_, _*_)
res47: Int = 140
取得分区里面的最大值
scala> a.aggregate(0)(Math.max(_,_), _+_)
res49: Int = 7
scala> a.aggregate(10)(Math.max(_,_), _+_)
res50: Int = 30
案例二:
scala> val b = sc.parallelize(List("a", "b", "c", "d", "e", "f"), 2)
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[90] at parallelize at :24
scala> val func = (index:Int, it:Iterator[String]) => it.map(s"index:$index, ele:"+_)
func: (Int, Iterator[String]) => Iterator[String] =
scala> b.mapPartitionsWithIndex(func).collect
res52: Array[String] = Array(index:0, ele:a, index:0, ele:b, index:0, ele:c, index:1, ele:d, index:1, ele:e, index:1, ele:f)
scala> b.aggregate
def aggregate[U](zeroValue: U)(seqOp: (U, String) => U,combOp: (U, U) => U)(implicit evidence$30: scala.reflect.ClassTag[U]): U
调用
scala> b.aggregate("*")(_+_, _+_)
//原因分区并行处理,速度不一定哪一个先完成
结果**abc*def
或 **def*abc
foldByKey
scala> val rdd1 = sc.parallelize(List("dog", "cat", "wolf", "bear"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24
scala> val rdd2 = rdd1.map(x=>(x.length, x))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at :26
scala> rdd2.collect
res0: Array[(Int, String)] = Array((3,dog), (3,cat), (4,wolf), (4,bear))
scala> val rdd3 = rdd2.foldByKey
def foldByKey(zeroValue: String)(func: (String, String) => String): org.apache.spark.rdd.RDD[(Int, String)]
def foldByKey(zeroValue: String,numPartitions: Int)(func: (String, String) => String): org.apache.spark.rdd.RDD[(Int, String)]
def foldByKey(zeroValue: String, partitioner: org.apache.spark.Partitioner)(func: (String, String) => String): org.apache.spark.rdd.RDD[(Int, String)]
scala> val rdd3 = rdd2.foldByKey("")(_+_)
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[2] at foldByKey at :28
scala> rdd3.collect
res1: Array[(Int, String)] = Array((4,wolfbear), (3,dogcat))
实现wordcount词频统计
scala> val lines = sc.textFile("hdfs://bigdata01:9000/input/words")
lines: org.apache.spark.rdd.RDD[String] = hdfs://bigdata01:9000/input/words MapPartitionsRDD[4] at textFile at :24
scala> val wordAndOne = lines.flatMap(_.split(" ")).map((_, 1))
wordAndOne: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at :26
scala> wordAndOne.foldByKey(0)(_+_).collect
res3: Array[(String, Int)] = Array((scala,4), (hive,5), ("",1), (java,1), (spark,5), (hadoop,5), (hbase,2))
scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "cat"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24
scala> val rdd2 = rdd1.keyBy(_.length)
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at :26
scala> rdd2.collect
res0: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,cat))
scala> val rdd3 = rdd2.keyBy(_._2.length)
rdd3: org.apache.spark.rdd.RDD[(Int, (Int, String))] = MapPartitionsRDD[2] at keyBy at :28
scala> rdd3.collect
res1: Array[(Int, (Int, String))] = Array((3,(3,dog)), (6,(6,salmon)), (6,(6,salmon)), (3,(3,cat))
scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "cat"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at :24
scala> val rdd2 = rdd1.map(x=>(x.length, x))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[4] at map at :26
scala> rdd2.keys.collect
res3: Array[Int] = Array(3, 6, 6, 3)
scala> rdd2.values.collect
res4: Array[String] = Array(dog, salmon, salmon, cat)
scala> val a = sc.parallelize(List("dog", "cat", "salmon", "salmon"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at :24
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[8] at keyBy at :26
scala> b.collect
res5: Array[(Int, String)] = Array((3,dog), (3,cat), (6,salmon), (6,salmon))
scala> val c = sc.parallelize(List("cat", "salmon", "dog", "wolf", "bear"), 3)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at parallelize at :24
scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at keyBy at :26
scala> d.collect
res7: Array[(Int, String)] = Array((3,cat), (6,salmon), (3,dog), (4,wolf), (4,bear))
scala> b.join(d).collect
res8: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,salmon)), (3,(dog,dog)), (3,(dog,cat)), (3,(cat,dog)), (3,(cat,cat)))
scala> val a = sc.parallelize(List("dog", "cat", "salmon", "salmon", "rabbit", "turkey"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at parallelize at :24
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[15] at keyBy at :26
scala> val c = sc.parallelize(List("cat", "salmon", "dog", "wolf", "bear"), 3)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at :24
scala> d.collect
res9: Array[(Int, String)] = Array((3,cat), (6,salmon), (3,dog), (4,wolf), (4,bear))
scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at keyBy at :26
scala> d.leftOuterJoin(b).collect
res14: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))), (6,(salmon,Some(salmon))), (3,(cat,Some(dog))), (3,(cat,Some(cat))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (4,(wolf,None)), (4,(bear,None)))