RDD的特点
SparkContext
创建RDD的两种方法
创建RDD-并行化集合:在驱动程序中的现有集合上调用SparkContext
的parallelize
方法。复制集合的每个元素以形成可以并行操作的分布式数据集。
scala> val info = sc.parallelize(Array(1,3,8,9))
info: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala> info.collect.foreach(println)
1
3
8
9
创建RDD-外部数据集:
在Spark中,可以从Hadoop支持的任何类型的存储源(如HDFS,Cassandra,HBase甚至本地文件系统)创建分布式数据集。Spark提供对文本文件,SequenceFiles和其他类型的Hadoop InputFormat的支持。
SparkContext的textFile方法可用于创建RDD的文本文件。此方法获取文件的URI(计算机上的本地路径或hdfs://
)并读取文件的数据。
用 textFile() 方法来从文件系统中加载数据创建RDD。方法将文件的 URI 作为参数,这个URI可以是:
// 从本地文件系统加载数据
val lines = sc.textFile("file:///root/data/wc.txt")
// 从分布式文件系统加载数据
val lines = sc.textFile("hdfs://linux121:9000/user/root/data/uaction.dat")
RDD提供两种类型的操作:Transformation 和 Action
窄依赖
map(func):对数据集中的每个元素都使用func,然后返回一个新的RDD
scala> val rdd1 = sc.parallelize(1 to 10).map(_*2).collect.foreach(println)
2
4
6
8
10
12
14
16
18
20
rdd1: Unit = ()
filter(func):对数据集中的每个元素都使用func,然后返回一个包含使func为true的元素构成的RDD
// collect 是Action算子,触发Job的执行,将RDD的全部元素从 Executor搜集到 Driver 端。生产环境中禁用
scala> val rdd1 = sc.parallelize(1 to 10).map(_*2).filter(_>10).collect.foreach(println)
12
14
16
18
20
rdd1: Unit = ()
flatMap(func):与 map 类似,每个输入元素被映射为0或多个输出元素
scala> val rdd4 = sc.textFile("data/wc.txt").flatMap(_.split("\\s+")).collect.foreach(println)
hadoop
mapreduce
yarn
hdfs
hadoop
mapreduce
mapreduce
yarn
lagou
lagou
lagou
rdd4: Unit = ()
mapPartitions(func):和map很像,但是map是将func作用在每个元素上,而mapPartitions是func作用在整个分区上。假设一个RDD有N个元素,M个分区(N>> M),那么map的函数将被调用N次,而mapPartitions中的函数仅被调用M次,一次处理一个分区中的所有元素
scala> val rdd4 = sc.textFile("data/wc.txt")
rdd4: org.apache.spark.rdd.RDD[String] = data/wc.txt MapPartitionsRDD[51] at textFile at <console>:24
scala> rdd4.getNumPartitions
res7: Int = 2
scala> rdd4.partitions.length
res8: Int = 2
scala> rdd4.mapPartitions{iter =>Iterator(s"${iter.toList}")}.collect
res9: Array[String] = Array(List(hadoop mapreduce yarn, hdfs hadoop mapreduce), List(mapreduce yarn lagou, lagou, lagou))
scala> rdd4.mapPartitions{iter =>Iterator(s"${iter.toArray.mkString("-")}")}.collect
res10: Array[String] = Array(hadoop mapreduce yarn-hdfs hadoop mapreduce, mapreduce yarn lagou-lagou-lagou)
mapPartitionsWithIndex(func):与 mapPartitions 类似,多了分区索引值信息
scala> rdd4.mapPartitionsWithIndex{(idx,iter) =>Iterator(s"$idx:${iter.toArray.mkString("-")}")}.collect
res11: Array[String] = Array(0:hadoop mapreduce yarn-hdfs hadoop mapreduce, 1:mapreduce yarn lagou-lagou-lagou)
scala> rdd4.mapPartitions(iter=> iter.map(_*2)).collect
res12: Array[String] = Array(hadoop mapreduce yarnhadoop mapreduce yarn, hdfs hadoop mapreducehdfs hadoop mapreduce, mapreduce yarn lagoumapreduce yarn lagou, lagoulagou, lagoulagou)
map 与 mapPartitions 的区别
groupBy(func):按照传入函数的返回值进行分组。将key相同的值放入一个迭代器
scala> val rdd = sc.parallelize(1 to 10).groupBy(_%3)
rdd: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[61] at groupBy at <console>:24
scala> rdd.collect
res18: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(9, 6, 3)), (1,CompactBuffer(7, 4, 10, 1)), (2,CompactBuffer(5, 2, 8)))
glom():将每一个分区形成一个数组,形成新的RDD类型 RDD[Array[T]]
// 将 RDD 中的元素每10个元素分组
scala> val rdd = sc.parallelize(1 to 102)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24
scala> rdd.glom.map(_.sliding(10, 10).toArray).collect
res3: Array[Array[Array[Int]]] = Array(Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Array(11, 12, 13, 14, 15, 16, 17)), Array(Array(18, 19, 20, 21, 22, 23, 24, 25, 26, 27), Array(28, 29, 30, 31, 32, 33, 34)), Array(Array(35, 36, 37, 38, 39, 40, 41, 42, 43, 44), Array(45, 46, 47, 48, 49, 50, 51)), Array(Array(52, 53, 54, 55, 56, 57, 58, 59, 60, 61), Array(62, 63, 64, 65, 66, 67, 68)), Array(Array(69, 70, 71, 72, 73, 74, 75, 76, 77, 78), Array(79, 80, 81, 82, 83, 84, 85)), Array(Array(86, 87, 88, 89, 90, 91, 92, 93, 94, 95), Array(96, 97, 98, 99, 100, 101, 102)))
sample(withReplacement, fraction, seed):采样算子。以指定的随机种子(seed)随机抽样出数量为fraction的数据,withReplacement表示是抽出的数据是否放回,true为有放回的抽样,false为无放回的抽样
// 对数据采样。fraction采样的百分比,近似数
// 有放回的采样,使用固定的种子
scala> rdd.sample(true, 0.2, 2).collect
res4: Array[Int] = Array(2, 4, 5, 7, 9, 15, 34, 38, 38, 39, 40, 42, 45, 45, 52, 53, 58, 62, 70, 70, 71, 82, 87)
// 无放回的采样,使用固定的种子
scala> rdd.sample(false, 0.2, 2).collect
res5: Array[Int] = Array(1, 4, 11, 12, 15, 17, 32, 38, 42, 43, 45, 46, 51, 52, 57, 59, 60, 70, 71, 72, 75, 76, 79, 85, 87, 90, 93)
// 有放回的采样,不设置种子
scala> rdd.sample(false, 0.2).collect
res6: Array[Int] = Array(1, 18, 20, 23, 29, 30, 39, 42, 55, 76, 85, 86, 90, 94, 95)
distinct([numTasks])):对RDD元素去重后,返回一个新的RDD。可传入 numTasks参数改变RDD分区数
scala> val random = scala.util.Random
random: util.Random.type = scala.util.Random$@4bdc0fb0
scala> val arr = (1 to 20).map(x => random.nextInt(10))
arr: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 7, 7, 8, 8, 2, 8, 1, 3, 9, 1, 6, 8, 2, 9, 8, 9, 9, 4, 1)
scala> val rdd = sc.makeRDD(arr)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:26
scala> rdd.distinct.collect
res1: Array[Int] = Array(4, 0, 6, 8, 2, 1, 3, 7, 9)
coalesce(numPartitions):缩减分区数,无shuffle
// RDD重分区
scala> val rdd1 = sc.range(1, 10000, numSlices=10)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[8] at range at <console>:24
scala> val rdd2 = rdd1.filter(_%2==0)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[9] at filter at <console>:25
scala> rdd2.getNumPartitions
res2: Int = 10
// 减少分区数;都生效了
scala> val rdd3 = rdd2.repartition(5)
rdd3: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[13] at repartition at <console>:25
scala> rdd3.getNumPartitions
res3: Int = 5
scala> val rdd4 = rdd2.coalesce(5)
rdd4: org.apache.spark.rdd.RDD[Long] = CoalescedRDD[14] at coalesce at <console>:25
scala> rdd4.getNumPartitions
res4: Int = 5
repartition(numPartitions):增加或减少分区数,有shuffle
// 增加分区数
scala> val rdd5 = rdd2.repartition(20)
rdd5: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[18] at repartition at <console>:25
scala> rdd5.getNumPartitions
res5: Int = 20
// 增加分区数,这样使用没有效果
scala> val rdd6 = rdd2.coalesce(20)
rdd6: org.apache.spark.rdd.RDD[Long] = CoalescedRDD[19] at coalesce at <console>:25
scala> rdd6.getNumPartitions
res6: Int = 10
// 增加分区数的正确用法
val rdd6 = rdd2.coalesce(20, true)
scala> rdd6.getNumPartitions
res7: Int = 20
sortBy(func, [ascending], [numTasks]):使用 func 对数据进行处理,对处理后的结果进行排序
// RDD元素排序
scala> val random = scala.util.Random
random: util.Random.type = scala.util.Random$@4bdc0fb0
scala> val arr = (1 to 20).map(x => random.nextInt(10))
arr: scala.collection.immutable.IndexedSeq[Int] = Vector(7, 2, 3, 5, 6, 5, 0, 1, 2, 3, 4, 8, 5, 1, 6, 2, 3, 9, 4, 5)
scala> val rdd = sc.makeRDD(arr)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at makeRDD at <console>:26
scala> rdd.collect
res7: Array[Int] = Array(7, 2, 3, 5, 6, 5, 0, 1, 2, 3, 4, 8, 5, 1, 6, 2, 3, 9, 4, 5)
// 数据全局有序,默认升序
scala> rdd.sortBy(x=>x).collect
res8: Array[Int] = Array(0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8, 9)
// 降序
scala> rdd.sortBy(x=>x,false).collect
res9: Array[Int] = Array(9, 8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 0)
coalesce 与 repartition 的区别:
宽依赖的算子(shuffle):groupBy、distinct、repartition、sortBy,intersection、subtract
RDD之间的交、并、差算子,分别为:intersection(otherRDD),union(otherRDD),subtract (otherRDD
scala> val rdd1 = sc.range(1, 21)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[32] at range at <console>:24
scala> val rdd2 = sc.range(10, 31)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[34] at range at <console>:24
// 交集
scala> rdd1.intersection(rdd2).sortBy(x=>x).collect
res10: Array[Long] = Array(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
// 并集
scala> rdd1.union(rdd2).sortBy(x=>x).collect
res11: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
// 差集
scala> rdd1.subtract(rdd2).sortBy(x=>x).collect
res12: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
// 检查分区数
scala> rdd1.intersection(rdd2).getNumPartitions
res13: Int = 6
scala> rdd1.union(rdd2).getNumPartitions
res14: Int = 12
scala> rdd1.subtract(rdd2).getNumPartitions
res15: Int = 6
cartesian(otherRDD):笛卡尔积
// 笛卡尔积
scala> val rdd1 = sc.range(1, 5)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[73] at range at <console>:24
scala> val rdd2 = sc.range(6, 10)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[75] at range at <console>:24
scala> rdd1.cartesian(rdd2).collect
res16: Array[(Long, Long)] = Array((1,6), (1,7), (1,8), (1,9), (2,6), (2,7), (2,8), (2,9), (3,6), (3,7), (3,8), (3,9), (4,6), (4,7), (4,8), (4,9))
// 检查分区数
scala> rdd1.cartesian(rdd2).getNumPartitions
res17: Int = 36
zip(otherRDD):将两个RDD组合成 key-value 形式的RDD,默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
// 拉链操作
scala> rdd1.zip(rdd2).collect
res18: Array[(Long, Long)] = Array((1,6), (2,7), (3,8), (4,9))
scala> rdd1.zip(rdd2).getNumPartitions
res19: Int = 6
// zip操作要求:两个RDD的partition数量以及元素数量都相同,否则会抛出异常
scala> val rdd2 = sc.range(6, 20)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[81] at range at <console>:24
scala> rdd1.zip(rdd2).collect
23/02/23 13:42:56 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 30.0 (TID 192, 192.168.88.121, executor 1): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
union是窄依赖。得到的RDD分区数为:两个RDD分区数之和
cartesian是窄依赖: