[Spark] RDD的基本操作

1. RDD的基本操作

RDD支持三种类型的操作:

1)transformation

transformations,转换。从一个RDD转换成另外一个RDD(RDD是不可变的)。

例如:map函数,对RDD里每一个元素做同一件事,将一个RDD转换成另外一个RDD

          RDDA(1,2,3,4,5)        map( +1 )        RDDB(2,3,4,5,6)

2)actions

actions,操作。它会在数据集上计算后返回一个值给driver应用程序(控制台也是一个driver)。

例如:reduce函数,操作一个聚合方法,返回一个结果给客户端

          RDDB(2,3,4,5,6)        reduce(a+b)        ==> 最终返回RDDB里面所有元素的和给客户端

3)cache

cache,缓存策略。支持内存缓存和硬盘缓存,用来提高计算的效率等。

重要:RDD中的transformation是lazy的,它仅仅只记住transformation的关系,并不会立即计算结果出来;仅仅当有action时,才会计算结果。这种设计使得spark运行更加高效。例如:rdda.map().reduce(_+_)  会返回一个结果,因为它有action。


2. rdd transformation常用操作

2.1 map:对元素做相同的操作

[hadoop@hadoop01 ~]$ spark-shell --master local[2]
scala> val a = sc.parallelize(1 to 9)	// 创建一个rdda,这中间有一个数据类型推导的过程     
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

scala> a.map(x=>x*2)	// transformation
res0: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :27

scala> 

transformation时,spark web ui上没有任何job会产生。

scala> val b = a.map(x=>x*2)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at map at :26

scala> b.collect	// action, collect方法是将数据集里面的所有元素以数组的格式返回
res1: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)

scala> 

action时,spark web ui上会产生一个job。

[Spark] RDD的基本操作_第1张图片


scala> val a = sc.parallelize(List("scala","java","python"))     // 现在的rdda是一个String类型(类型推导)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at :24

scala> val b = a.map(x=>(x,1))    // 把rdda的每一个元素编程一个tuple得到rddb,rdd的每一个元素都是一个tuple(String,Int)
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at :26

scala> 

以上场景可以用来做wordcount以及排序。

注意:spark rdd api编程和他们使用scala进行编程,是一模一样的。只不过scala编程是一个单机的,spark编程可以是单机的也可以是分布式的。所以scala中的集合操作一定要掌握。


2.2 filter:对元素进行过滤

scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at :24

scala> a.filter(_ % 2 == 0).collect	// 取rdda中的偶数集合
res2: Array[Int] = Array(2, 4, 6, 8, 10)

scala> a.filter(_ < 4).collect	// 取rdda中小于4的元素集合
res3: Array[Int] = Array(1, 2, 3)

scala> val mapRdd = a.map(_ * 2)	// 第一步,rdda中的每个元素乘以2得到mapRdd
mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[8] at map at :26

scala> mapRdd.collect	
res4: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

scala> val filterRdd = mapRdd.filter(_ > 5)	// 第二步,取mapRdd中大于5的元素集合
filterRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at filter at :28

scala> filterRdd.collect
res5: Array[Int] = Array(6, 8, 10, 12, 14, 16, 18, 20)

scala> 

map和filter链式编程:上面rdda => mapRdd => filterRdd的过程可以合成一步,链式编程。

scala> val c = sc.parallelize(1 to 10).map(_ * 2).filter(_ > 5)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[15] at filter at :24

scala> c.collect
res7: Array[Int] = Array(6, 8, 10, 12, 14, 16, 18, 20)

scala> 


2.3 flatMap:

scala> val nums = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
nums: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

scala> nums.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)                             

scala> nums.flatMap(x => 1 to x).collect
res2: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> 

其中:结果集Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)是把nums里面的每一个元素x都变成了1到 x。即1=>1, 2=>1,2, 3=>1,2,3...

重点:flatMap和map的区别是什么?

RDD做词频统计:

scala> val log = sc.textFile("file:///home/hadoop/data/input.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/input.txt MapPartitionsRDD[1] at textFile at :24

scala> log.collect
res0: Array[String] = Array(this	is	test	data, this	is	test	sample)

scala> val splits = log.map(x => x.split("\t"))
splits: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at :26

scala> splits.collect
res1: Array[Array[String]] = Array(Array(this, is, test, data), Array(this, is, test, sample))

scala> val worda = splits.map(x => (x,1))
worda: org.apache.spark.rdd.RDD[(Array[String], Int)] = MapPartitionsRDD[3] at map at :28

scala> worda.collect
res2: Array[(Array[String], Int)] = Array((Array(this, is, test, data),1), (Array(this, is, test, sample),1))

scala>

这里最后并没有得到我们想要的结果,单词没有被压扁单独统计,而是以行计数统计的。

尝试flatMap,把所有单词压扁到一行,再来做map:

scala> val log = sc.textFile("file:///home/hadoop/data/input.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/input.txt MapPartitionsRDD[5] at textFile at :24

scala> val splits = log.flatMap(x => x.split("\t")) 	// 把每个单词压平到一行里面去
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at flatMap at :26

scala> splits.collect
res4: Array[String] = Array(this, is, test, data, this, is, test, sample)

scala> val worda = splits.map(x => (x,1))
worda: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at map at :28

scala> worda.collect
res5: Array[(String, Int)] = Array((this,1), (is,1), (test,1), (data,1), (this,1), (is,1), (test,1), (sample,1))

scala>

这里可以看到,基本已经是我们要的结果了,每个单词被压平到一行里面去,并计数。

最后一步借助reduceByKey来做wordcount:

scala> worda.reduceByKey(_+_).collect	
res7: Array[(String, Int)] = Array((this,2), (is,2), (data,1), (sample,1), (test,2))

scala> 

reduceByKey即根据key来做集合,过程中存在一个shuffle操作,把相同的key分发到同一个reduce上面去,然后把value一个一个加起来。shuffle过程:

(this,1)	(this,1)	=> (this,2)
(is,1)		(is,1)		=> (is,2)
(test,1)	(test,1)	=> (test,2)
(data,1)			=> (data,1)
(sample,1)			=> (sample,1)

最后我们还可以做一个尝试,按照每个单词出现的次数做降序/升序排列:

// 升序
scala> d.sortBy(f=>f._2).collect
res22: Array[(String, Int)] = Array((data,1), (sample,1), (this,2), (is,2), (test,2))

scala> 
// 降序
scala>  TO BE UPDATE LATER ON...

2.4 mapValues:key 不动,只动value(有用的小技巧)

scala> val a = sc.parallelize(List("scala","java","python")) 
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at :24

scala> val b = a.map(x=>(x.length,x))    // rdda里面的每一个元素都转成一个tuple
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at map at :26

scala> b.collect
res8: Array[(Int, String)] = Array((5,scala), (4,java), (6,python))

scala> b.mapValues("a" + _ + "a").collect    // rddb的每一个元素tuple(key,value)里面的value前后加一个字符"a"
res9: Array[(Int, String)] = Array((5,ascalaa), (4,ajavaa), (6,apythona))

scala> 

2.5 subtract:子集

从集合a里面把集合b里面的元素减掉,返回得到的结果集

scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :24

scala> val c = a.subtract(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at subtract at :28

scala> c.collect
res0: Array[Int] = Array(4, 1, 5)

scala> 

2.6 intersection:交集

取同时存在集合a中和集合b中的元素返回

scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at :24

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at :24

scala> val c = a.intersection(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at intersection at :28

scala> c.collect
res1: Array[Int] = Array(2, 3)

scala> 

2.7 cartesian:笛卡尔积

返回集合a和集合b的笛卡尔积

scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at :24

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at :24

scala> val c = a.cartesian(b)
c: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[16] at cartesian at :28

scala> c.collect
res2: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))

scala> 


3. rdd action常用操作

3.1 count:返回数据集里面元素的个数

scala> val a = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at parallelize at :24

scala> a.count    // rdda里面的元素计数,返回计数的数值
res10: Long = 100

scala> 

3.2 sum:返回数据集里面元素的和

scala> a.sum    // rdda里面的元素求和;还有另外一个求和的做法是reduce
res11: Double = 5050.0    // 这时候得到的结果是一个double类型,如果想转换结果的类型,用下面的方法

scala> a.sum.toInt    // 将求和的结果转成Int类型
res12: Int = 5050

scala> 

3.3 max:返回最大值

scala> a.max
res13: Int = 100

scala> 

3.4 min:返回最小值

scala> a.min
res14: Int = 1

scala> 

3.5 reduce(func):返回元素使用函数func两两操作的结果

scala> a.reduce((x,y)=>x+y)
res16: Int = 5050

scala> a.reduce(_ + _)
res17: Int = 5050

scala> 
上面两种求和的方法是等效的,下面这种写法更常用。


3.6 first:返回数据集里的第一个元素

scala> val a = sc.parallelize(List("scala","java","python")) 
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at :24

scala> a.first    // 取rdda里面第一个元素返回,用下面的take方法也可以实现
res20: String = scala

scala> 

3.7 take(n):返回数据集里的第n个元素

scala> a.take(1)
res22: Array[String] = Array(scala)

scala> a.take(2)
res23: Array[String] = Array(scala, java)

scala> 

take方法中,元素的下标从1开始。

3.8 top(n):倒序返回最大n个元素

scala> sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res29: Array[Int] = Array(9, 8)    // 倒序排列,返回最前面的两个(最大的两个)

scala> sc.parallelize(List("scala","java","python")).top(2)
res30: Array[String] = Array(scala, python)    // 首字母倒序排列,返回最前面的两个

scala> 

如果想正序返回最小,用下面的方法自定义排序

scala> implicit val myOrder = implicitly[Ordering[Int]].reverse    // 自定义排序规则,Int类型则反转结果
myOrder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@6bcf5b19

scala> sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res31: Array[Int] = Array(4, 5)    // 这时候返回的就是最小的两个了

scala> 

3.9 takeSample:


3.10 takeOrdered:

比较takeSample和takeOrdered区别。

你可能感兴趣的:(Spark,spark,rdd)