Spark基础(RDD)(常用算子)

什么是RDD
RDD 是 Spark 的计算模型。RDD(Resilient Distributed Dataset)叫做弹性的分布式数
据集合
,是 Spark 中最基本的数据抽象,它代表一个不可变、只读的,被分区的数据集。
操作 RDD 就像操作本地集合一样,有很多的方法可以调用,使用方便,而无需关心底层的
调度细节。

RDD宽依赖:父RDD的分区被子RDD的多个分区使用 例如 groupByKey、reduceByKey、sortByKey等操作会产生宽依赖,会产生shuffle

RDD窄依赖:父RDD的每个分区都只被子RDD的一个分区使用 例如map、filter、union等操作会产生窄依赖

RDD的三种创建形式

  1. 集合并行化创建(通过 scala 集合创建) scala 中的本地集合—> spark RDD
    val arr = Array(1,2,3,4,5
    val rdd = sc.parallelize(arr)
    val rdd =sc.makeRDD(arr)
    通过集合并行化方式创建 RDD,适用于本地测试,做实验
  2. 读取外部文件系统,比如 HDFS 等
    val rdd2 = sc.textFile(“hdfs://hdp-01:9000/words.txt”)
    // 读取本地文件
    val rdd2 = sc.textFile(“file:///root/words.txt”) // 文件的前缀,可加可不加
  3. rdd之间的转换
    已存在的rdd,经过调用转换类的算子,生成一个新的rdd
    RDD上的算子
    主要分两类
    Transformation:延时计算,不会触发计算,懒加载
    lazy 执行的,并没有立即去执行,例如一些方法,map方法,仅仅是记录了传递的函数的操作。
    当transformation算子遇到action算子的时候,才开始触发任务的执行

每一次的transformation算子,都生成一个新的rdd。

Action:立即执行计算

每次遇到一个action,就会产生一个job。
要么把结果写入到文件系统中,要么把结果展示在driver端,要么把结果打印。
Application:
在每一个application中,有几个action,就会产生几个job。
常用算子

  1. Transformation 懒加载 不会将操作递交到集群
    map:返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成
    flatMap:压平操作,先map后flat
    filter:过滤(返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成)
    union:并集(对原RDD和参数RDD求并集后返回一个新的RDD)
    intersection:交集(对原RDD和参数RDD求交集后返回一个新的RDD)
    distinct:(对源RDD进行去重后返回一个新的RDD)
    groupBy:分组
    sortBy:排序

  2. Action
    collect:收集,将数据收集到Driver端
    saveAsTextFile:保存文件
    count:求个数
    first:第一个元素
    take:求集合中的n个元素
    foreach:对原来的rdd中的每隔元素,执行fun操作
    foreach和map的区别:
    1:foreach没有返回值,map有返回值
    2:foreach是action操作,会触发计算,map是tranformation操作,延迟计算
    高级算子

  3. mapPartition
    可以认为是Map的变种,可以对分区进行并行处理,两者的区别是调用的颗粒度不一样,map的输入函数是应用于RDD的每个元素,而mapPartition的输入函数是应用于RDD的每个分区
    为什么mapPartition是一个迭代器,因为分区中可能有太多的数据,一次性拿出来内存可能放不下导致内存溢出,所以使用迭代器一条条的拿出来

  4. mapPartitionWithIndex
    对RDD的每个分区进行操作,
    scala> val list = sc.parallelize(1 to 10, 3)
    list: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[82] at parallelize at :24

     scala> val func = (index:Int, it:Iterator[Int]) => it.map(s"index:$index, ele:"+_)
     func: (Int, Iterator[Int]) => Iterator[String] = 
    
     将数据均匀分配
     scala> list.mapPartitionsWithIndex(func).collect
     res38: Array[String] = Array(
     	index:0, ele:1, index:0, ele:2, index:0, ele:3, 
     	index:1, ele:4, index:1, ele:5, index:1, ele:6, 
     	index:2, ele:7, index:2, ele:8, index:2, ele:9, index:2, ele:10
     )
    
  5. aggregate:聚合 (全局跟局部)
    案例一:
    scala> val a= sc.parallelize(List(1,2,3,4,5), 2)
    a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[87] at parallelize at :24

     		scala> val func = (index:Int, it:Iterator[Int]) => it.map(s"index:$index, ele:"+_)
     		func: (Int, Iterator[Int]) => Iterator[String] = 
    
     		scala> a.mapPartitionsWithIndex(func).collect
     		res43: Array[String] = Array(
     			index:0, ele:1, index:0, ele:2, 
     			index:1, ele:3, index:1, ele:4, index:1, ele:5
     		)
    
     		def aggregate[U](zeroValue: U)(seqOp: (U, Int) => U,combOp: (U, U) => U)(implicit evidence$30: scala.reflect.ClassTag[U]): U
     		(zeroValue: U):初始值
     		seqOp: (U, Int) => U:局部操作函数
     		combOp: (U, U) => U:全局操作函数
     		a.aggregate(0)(_+_, _+_)		=》15
     		
     		解释:
     			第一个分区:1+2
     			第二个分区:3+4+5
     			结果:3+12
     		
     		scala> a.aggregate(100)(_+_, _+_)
     		res45: Int = 315
     		解释:
     			第一个分区:100+1+2
     			第二个分区:100+3+4+5
     			结果:100+103+112
     			
     		scala> a.aggregate(2)(_+_, _*_)
     		res47: Int = 140
    
     		取得分区里面的最大值
     		scala> a.aggregate(0)(Math.max(_,_), _+_)
     		res49: Int = 7
     		
     		scala> a.aggregate(10)(Math.max(_,_), _+_)
     		res50: Int = 30
     案例二:
     		scala> val b = sc.parallelize(List("a", "b", "c", "d", "e", "f"), 2)
     		b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[90] at parallelize at :24
    
     		scala> val func = (index:Int, it:Iterator[String]) => it.map(s"index:$index, ele:"+_)
     		func: (Int, Iterator[String]) => Iterator[String] = 
    
     		scala> b.mapPartitionsWithIndex(func).collect
     		res52: Array[String] = Array(index:0, ele:a, index:0, ele:b, index:0, ele:c, index:1, ele:d, index:1, ele:e, index:1, ele:f)
    
     		scala> b.aggregate
     		   def aggregate[U](zeroValue: U)(seqOp: (U, String) => U,combOp: (U, U) => U)(implicit evidence$30: scala.reflect.ClassTag[U]): U
    
     		调用
     		scala> b.aggregate("*")(_+_, _+_)
     			
     			//原因分区并行处理,速度不一定哪一个先完成
     			结果**abc*def
     				或 **def*abc
    
  6. foldByKey

scala> val rdd1 = sc.parallelize(List("dog", "cat", "wolf", "bear"))
			rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24

			scala> val rdd2 = rdd1.map(x=>(x.length, x))
			rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at :26

			scala> rdd2.collect
			res0: Array[(Int, String)] = Array((3,dog), (3,cat), (4,wolf), (4,bear))  
			scala> val rdd3 = rdd2.foldByKey
                                                                                                                                                        
			def foldByKey(zeroValue: String)(func: (String, String) => String): org.apache.spark.rdd.RDD[(Int, String)]                                             
			def foldByKey(zeroValue: String,numPartitions: Int)(func: (String, String) => String): org.apache.spark.rdd.RDD[(Int, String)]                          
			def foldByKey(zeroValue: String, partitioner: org.apache.spark.Partitioner)(func: (String, String) => String): org.apache.spark.rdd.RDD[(Int, String)]   

			scala> val rdd3 = rdd2.foldByKey("")(_+_)
			rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[2] at foldByKey at :28

			scala> rdd3.collect
			res1: Array[(Int, String)] = Array((4,wolfbear), (3,dogcat))         

			实现wordcount词频统计
			scala> val lines = sc.textFile("hdfs://bigdata01:9000/input/words")
			lines: org.apache.spark.rdd.RDD[String] = hdfs://bigdata01:9000/input/words MapPartitionsRDD[4] at textFile at :24

			scala> val wordAndOne = lines.flatMap(_.split(" ")).map((_, 1))
			wordAndOne: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at :26

			scala> wordAndOne.foldByKey(0)(_+_).collect
			res3: Array[(String, Int)] = Array((scala,4), (hive,5), ("",1), (java,1), (spark,5), (hadoop,5), (hbase,2))
  1. foreachPartition
  2. KeyBy
scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "cat"))
			rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24

			scala> val rdd2 = rdd1.keyBy(_.length)
			rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at :26

			scala> rdd2.collect
			res0: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,cat))    

			scala> val rdd3 = rdd2.keyBy(_._2.length)
			rdd3: org.apache.spark.rdd.RDD[(Int, (Int, String))] = MapPartitionsRDD[2] at keyBy at :28

			scala> rdd3.collect
			res1: Array[(Int, (Int, String))] = Array((3,(3,dog)), (6,(6,salmon)), (6,(6,salmon)), (3,(3,cat))
  1. keys,values
scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "cat"))
			rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at :24

			scala> val rdd2 =  rdd1.map(x=>(x.length, x))
			rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[4] at map at :26

			scala> rdd2.keys.collect
			res3: Array[Int] = Array(3, 6, 6, 3)

			scala> rdd2.values.collect
			res4: Array[String] = Array(dog, salmon, salmon, cat)
  1. join
scala> val a = sc.parallelize(List("dog", "cat", "salmon", "salmon"), 2)
			a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at :24

			scala> val b = a.keyBy(_.length)
			b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[8] at keyBy at :26

			scala> b.collect
			res5: Array[(Int, String)] = Array((3,dog), (3,cat), (6,salmon), (6,salmon))

			scala> val c = sc.parallelize(List("cat", "salmon", "dog", "wolf", "bear"), 3)
			c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at parallelize at :24

			scala> val d = c.keyBy(_.length)
			d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at keyBy at :26

			scala> d.collect
			res7: Array[(Int, String)] = Array((3,cat), (6,salmon), (3,dog), (4,wolf), (4,bear))

			scala> b.join(d).collect
			res8: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,salmon)), (3,(dog,dog)), (3,(dog,cat)), (3,(cat,dog)), (3,(cat,cat)))
  1. leftOuterJoin
scala> val a = sc.parallelize(List("dog", "cat", "salmon", "salmon", "rabbit", "turkey"), 3)
			a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at parallelize at :24

			scala> val b = a.keyBy(_.length)
			b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[15] at keyBy at :26

			scala> val c = sc.parallelize(List("cat", "salmon", "dog", "wolf", "bear"), 3)
			c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at :24

			scala> d.collect
			res9: Array[(Int, String)] = Array((3,cat), (6,salmon), (3,dog), (4,wolf), (4,bear))

			scala> val d = c.keyBy(_.length)
			d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at keyBy at :26

			scala> d.leftOuterJoin(b).collect
			res14: Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))), (6,(salmon,Some(salmon))), (3,(cat,Some(dog))), (3,(cat,Some(cat))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (4,(wolf,None)), (4,(bear,None)))
  1. groupByKey 根据key进行分组
  2. cogroup 表示将多个rdd的同一个key对应的value组合到一起

你可能感兴趣的:(Spark,RDD)