RDD支持三种类型的操作:
1)transformation
transformations,转换。从一个RDD转换成另外一个RDD(RDD是不可变的)。
例如:map函数,对RDD里每一个元素做同一件事,将一个RDD转换成另外一个RDD
RDDA(1,2,3,4,5) map( +1 ) RDDB(2,3,4,5,6)
2)actions
actions,操作。它会在数据集上计算后返回一个值给driver应用程序(控制台也是一个driver)。
例如:reduce函数,操作一个聚合方法,返回一个结果给客户端
RDDB(2,3,4,5,6) reduce(a+b) ==> 最终返回RDDB里面所有元素的和给客户端
3)cache
cache,缓存策略。支持内存缓存和硬盘缓存,用来提高计算的效率等。
重要:RDD中的transformation是lazy的,它仅仅只记住transformation的关系,并不会立即计算结果出来;仅仅当有action时,才会计算结果。这种设计使得spark运行更加高效。例如:rdda.map().reduce(_+_) 会返回一个结果,因为它有action。
[hadoop@hadoop01 ~]$ spark-shell --master local[2]
scala> val a = sc.parallelize(1 to 9) // 创建一个rdda,这中间有一个数据类型推导的过程
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
scala> a.map(x=>x*2) // transformation
res0: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :27
scala>
transformation时,spark web ui上没有任何job会产生。
scala> val b = a.map(x=>x*2)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at map at :26
scala> b.collect // action, collect方法是将数据集里面的所有元素以数组的格式返回
res1: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
scala>
action时,spark web ui上会产生一个job。
scala> val a = sc.parallelize(List("scala","java","python")) // 现在的rdda是一个String类型(类型推导)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at :24
scala> val b = a.map(x=>(x,1)) // 把rdda的每一个元素编程一个tuple得到rddb,rdd的每一个元素都是一个tuple(String,Int)
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at :26
scala>
以上场景可以用来做wordcount以及排序。
注意:spark rdd api编程和他们使用scala进行编程,是一模一样的。只不过scala编程是一个单机的,spark编程可以是单机的也可以是分布式的。所以scala中的集合操作一定要掌握。
scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at :24
scala> a.filter(_ % 2 == 0).collect // 取rdda中的偶数集合
res2: Array[Int] = Array(2, 4, 6, 8, 10)
scala> a.filter(_ < 4).collect // 取rdda中小于4的元素集合
res3: Array[Int] = Array(1, 2, 3)
scala> val mapRdd = a.map(_ * 2) // 第一步,rdda中的每个元素乘以2得到mapRdd
mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[8] at map at :26
scala> mapRdd.collect
res4: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
scala> val filterRdd = mapRdd.filter(_ > 5) // 第二步,取mapRdd中大于5的元素集合
filterRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at filter at :28
scala> filterRdd.collect
res5: Array[Int] = Array(6, 8, 10, 12, 14, 16, 18, 20)
scala>
map和filter链式编程:上面rdda => mapRdd => filterRdd的过程可以合成一步,链式编程。
scala> val c = sc.parallelize(1 to 10).map(_ * 2).filter(_ > 5)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[15] at filter at :24
scala> c.collect
res7: Array[Int] = Array(6, 8, 10, 12, 14, 16, 18, 20)
scala>
2.3 flatMap:
scala> val nums = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
nums: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
scala> nums.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> nums.flatMap(x => 1 to x).collect
res2: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)
scala>
其中:结果集Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)是把nums里面的每一个元素x都变成了1到 x。即1=>1, 2=>1,2, 3=>1,2,3...
重点:flatMap和map的区别是什么?
RDD做词频统计:
scala> val log = sc.textFile("file:///home/hadoop/data/input.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/input.txt MapPartitionsRDD[1] at textFile at :24
scala> log.collect
res0: Array[String] = Array(this is test data, this is test sample)
scala> val splits = log.map(x => x.split("\t"))
splits: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at :26
scala> splits.collect
res1: Array[Array[String]] = Array(Array(this, is, test, data), Array(this, is, test, sample))
scala> val worda = splits.map(x => (x,1))
worda: org.apache.spark.rdd.RDD[(Array[String], Int)] = MapPartitionsRDD[3] at map at :28
scala> worda.collect
res2: Array[(Array[String], Int)] = Array((Array(this, is, test, data),1), (Array(this, is, test, sample),1))
scala>
这里最后并没有得到我们想要的结果,单词没有被压扁单独统计,而是以行计数统计的。
尝试flatMap,把所有单词压扁到一行,再来做map:
scala> val log = sc.textFile("file:///home/hadoop/data/input.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/input.txt MapPartitionsRDD[5] at textFile at :24
scala> val splits = log.flatMap(x => x.split("\t")) // 把每个单词压平到一行里面去
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at flatMap at :26
scala> splits.collect
res4: Array[String] = Array(this, is, test, data, this, is, test, sample)
scala> val worda = splits.map(x => (x,1))
worda: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at map at :28
scala> worda.collect
res5: Array[(String, Int)] = Array((this,1), (is,1), (test,1), (data,1), (this,1), (is,1), (test,1), (sample,1))
scala>
这里可以看到,基本已经是我们要的结果了,每个单词被压平到一行里面去,并计数。
最后一步借助reduceByKey来做wordcount:
scala> worda.reduceByKey(_+_).collect
res7: Array[(String, Int)] = Array((this,2), (is,2), (data,1), (sample,1), (test,2))
scala>
reduceByKey即根据key来做集合,过程中存在一个shuffle操作,把相同的key分发到同一个reduce上面去,然后把value一个一个加起来。shuffle过程:
(this,1) (this,1) => (this,2)
(is,1) (is,1) => (is,2)
(test,1) (test,1) => (test,2)
(data,1) => (data,1)
(sample,1) => (sample,1)
最后我们还可以做一个尝试,按照每个单词出现的次数做降序/升序排列:
// 升序
scala> d.sortBy(f=>f._2).collect
res22: Array[(String, Int)] = Array((data,1), (sample,1), (this,2), (is,2), (test,2))
scala>
// 降序
scala> TO BE UPDATE LATER ON...
scala> val a = sc.parallelize(List("scala","java","python"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at :24
scala> val b = a.map(x=>(x.length,x)) // rdda里面的每一个元素都转成一个tuple
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at map at :26
scala> b.collect
res8: Array[(Int, String)] = Array((5,scala), (4,java), (6,python))
scala> b.mapValues("a" + _ + "a").collect // rddb的每一个元素tuple(key,value)里面的value前后加一个字符"a"
res9: Array[(Int, String)] = Array((5,ascalaa), (4,ajavaa), (6,apythona))
scala>
2.5 subtract:子集
从集合a里面把集合b里面的元素减掉,返回得到的结果集
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :24
scala> val c = a.subtract(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at subtract at :28
scala> c.collect
res0: Array[Int] = Array(4, 1, 5)
scala>
取同时存在集合a中和集合b中的元素返回
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at :24
scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at :24
scala> val c = a.intersection(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at intersection at :28
scala> c.collect
res1: Array[Int] = Array(2, 3)
scala>
返回集合a和集合b的笛卡尔积
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at :24
scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at :24
scala> val c = a.cartesian(b)
c: org.apache.spark.rdd.RDD[(Int, Int)] = CartesianRDD[16] at cartesian at :28
scala> c.collect
res2: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))
scala>
3. rdd action常用操作
scala> val a = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at parallelize at :24
scala> a.count // rdda里面的元素计数,返回计数的数值
res10: Long = 100
scala>
3.2 sum:返回数据集里面元素的和
scala> a.sum // rdda里面的元素求和;还有另外一个求和的做法是reduce
res11: Double = 5050.0 // 这时候得到的结果是一个double类型,如果想转换结果的类型,用下面的方法
scala> a.sum.toInt // 将求和的结果转成Int类型
res12: Int = 5050
scala>
scala> a.max
res13: Int = 100
scala>
scala> a.min
res14: Int = 1
scala>
scala> a.reduce((x,y)=>x+y)
res16: Int = 5050
scala> a.reduce(_ + _)
res17: Int = 5050
scala>
上面两种求和的方法是等效的,下面这种写法更常用。
3.6 first:返回数据集里的第一个元素
scala> val a = sc.parallelize(List("scala","java","python"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at :24
scala> a.first // 取rdda里面第一个元素返回,用下面的take方法也可以实现
res20: String = scala
scala>
3.7 take(n):返回数据集里的第n个元素
scala> a.take(1)
res22: Array[String] = Array(scala)
scala> a.take(2)
res23: Array[String] = Array(scala, java)
scala>
take方法中,元素的下标从1开始。
3.8 top(n):倒序返回最大n个元素
scala> sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res29: Array[Int] = Array(9, 8) // 倒序排列,返回最前面的两个(最大的两个)
scala> sc.parallelize(List("scala","java","python")).top(2)
res30: Array[String] = Array(scala, python) // 首字母倒序排列,返回最前面的两个
scala>
如果想正序返回最小,用下面的方法自定义排序
scala> implicit val myOrder = implicitly[Ordering[Int]].reverse // 自定义排序规则,Int类型则反转结果
myOrder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@6bcf5b19
scala> sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res31: Array[Int] = Array(4, 5) // 这时候返回的就是最小的两个了
scala>
3.9 takeSample:
3.10 takeOrdered:
比较takeSample和takeOrdered区别。