RDD分区及重新分区

RDD分区

rdd划分成很多的分区(partition)分布到集群的节点,分区的多少涉及对这个rdd进行并行计算的粒度。分区是一个概念,变换前后的新旧分区在物理上可能是同一块内存或存储,这种优化防止函数式不变性导致的内存需求无限扩张。在rdd中用户可以使用partitions方法获取RDD划分的分区数,当然用户也可以设定分区数目。如果没有指定将使用默认值,而默认值是该程序所分配到的cpu核数,如果是从hdfs文件创建,默认为文件的数据块数。

scala> val part=sc.textFile("file:/hadoop/spark/README.md")
part: org.apache.spark.rdd.RDD[String] = file:/hadoop/spark/README.md MapPartitionsRDD[5] at textFile at :24
scala> part.partitions.size
res2: Int = 2

scala> val part=sc.textFile("file:/hadoop/spark/README.md",4)
part: org.apache.spark.rdd.RDD[String] = file:/hadoop/spark/README.md MapPartitionsRDD[7] at textFile at :24
scala> part.partitions.size
res3: Int = 4

RDD分区计算(Iterator)

spark中RDD计算是以分区为单位的,而且计算函数都是在对迭代器复合,不需要保存每次计算结果。分区计算一般使用mapPartitions等操作进行,mapPartitions的输入函数应用于每一个分区,也就是把每个分区的内容作为整体来处理的:

def mapPartitions [U:ClassTag](f:Iterator[T]=>Iterator[U],preservesPartitioning:Boolean=false):RDD[U]

在下面的例子中,函数iterfunc是把分区中的一个元素和它的下一个元素组成一个Tuple,因为分区中最后一个元素没有下一个元素,所以(3,4)和(6,7)不在结果中。

val a=sc.parallelize(1 to 9,3)
#查看每个分区的内容
scala> a.mapPartitionsWithIndex{(partid,iter)=>{
     | var part_map=scala.collection.mutable.Map[String,List[Int]]()
     | var part_name="part_"+partid
     | part_map(part_name)=List[Int]()
     | while(iter.hasNext){
     | part_map(part_name):+=iter.next()}
     | part_map.iterator}}.collect
res9: Array[(String, List[Int])] = Array((part_0,List(1, 2, 3)), (part_1,List(4, 5, 6)), (part_2,List(7, 8, 9)))

scala> def iterfunc [T](iter:Iterator[T]):Iterator[(T,T)]={
     | var res=List[(T,T)]()
     | var pre=iter.next
     | while(iter.hasNext){
     | val cur=iter.next
     | res::=(pre,cur)
     | pre=cur}
     | res.iterator}
iterfunc: [T](iter: Iterator[T])Iterator[(T, T)]
scala> a.mapPartitions(iterfunc).collect
res10: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

RDD分区函数

分区划分对于shuffle类操作很关键,它决定该操作的父RDD和子RDD之间的依赖类型。例如Join操作,如果协同划分的话,两个父RDD之间,父RDD与子RDD之间能形成一致的分区安排,即同一个key保证被映射到同一个分区,这样就能形成窄依赖。反之,如果没有协同划分,导致宽依赖。这里所说的协同划分是指定分区划分器以产生前后一致的分区安排。
在spark中默认提供两种划分器:哈希分区划分器(hashPartitioner)和范围分区划分器(RangePartitioner),且partitioner只存在于(K,V)类型的RDD中,对于非(K,V)类型的Partitioner值为none。
在以下程序中,首先构造一个MappedRDD,其partitioner的值为none,然后对RDD进行groupByKey操作group_rdd变量,对于groupByKey操作而言,这里创建了新的HashPartitioner对象。

scala> var part=sc.textFile("file:/hadoop/spark/README.md")
part: org.apache.spark.rdd.RDD[String] = /hadoop/spark/README.md MapPartitionsRDD[12] at textFile at :24
scala> part.partitioner
res11: Option[org.apache.spark.Partitioner] = None
 val group_rdd=part.map(x=>(x,x)).groupByKey(new org.apache.spark.HashPartitioner(4))
group_rdd: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[16] scala> group_rdd.partitioner
res14: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@4)
at groupByKey at :26
#查看每个分区的内容
scala> part.mapPartitionsWithIndex{(partid,iter)=>{
     |  var part_map=scala.collection.mutable.Map[String,List[String]]()
     |  var part_name="part_"+partid
     |  part_map(part_name)=List[String]()
     | while(iter.hasNext){
     | part_map(part_name):+=iter.next()}
     |  part_map.iterator}}.collect
res19: Array[(String, List[String])] = Array((part_0,List(# Apache Spark, "", Spark is a fast and general cluster computing system for Big Data. It provides, high-level APIs in Scala, Java, Python, and R, and an optimized engine that, supports general computation graphs for data analysis. It also supports a, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, MLlib for machine learning, GraphX for graph processing,, and Spark Streaming for stream processing., "", , "", "", ## Online Documentation, "", You can find the latest Spark documentation, including a programming, guide, on the [project web page](http://spark.apache.org/documentation.html)., This README file only contains basic setup instructions., "", ## Building Spark, "", Spark ...

分区函数

coalesce(numpartitions:Int,shuffle:Boolean=false):RDD[T]
repartition(numPartitions:Int):RDD[T]

coalesce和repartition都是对RDD进行重新分区。coalesce操作使用HashPartitioner进行重分区,第一个参数为重分区的数目,第二个为是否shuffle,默认情况为false。repartition操作是coalesce函数第二个参数为true的实现。如果分区的数目大于原来的分区数,那么必须指定shuffle参数为true,否则分区数不变。

glom():RDD[Array[T]]

glom操作是RDD中的每一个分区所有类型为T的数据转变成元素类型为T的数组[Array[T]]

mapPartitions操作和map类似,只不过映射的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器,mapPartitionsWithIndex作用类似于mapPartitions,只是输入参数多了一个分区索引。

scala> var rdd1=sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at makeRDD at :24
#mapPartitions累加每个分区的数
scala> var rdd3=rdd1.mapPartitions{x=>{
     | var result=List[Int]()
     | var i=0
     | while(x.hasNext){
     | i+=x.next()}
     | result.::(i).iterator}}
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[24] at mapPartitions at :26
scala> rdd3.collect
res20: Array[Int] = Array(3, 12)
scala> var rdd2=rdd1.mapPartitionsWithIndex{
     | (x,iter)=>{
     |  var result=List[String]()
     | var i=0
     | while(iter.hasNext){
     | i+=iter.next()}
     | result.::(x+"|"+i).iterator}}
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitionsWithIndex at :26
scala> rdd2.collect
res21: Array[String] = Array(0|3, 1|12)
partitionBy(partitioner:Partitioner):RDD[(K,V)]

partitionBy操作根据partitioner函数生成新的ShuffleRDD,将原RDD重新分区

scala> var rdd1=sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[28] at makeRDD at :24
scala> var rdd2=rdd1.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[29] at partitionBy at :26
#查看分区中的元素
scala> rdd2.mapPartitionsWithIndex{
     | (partIdx,iter)=>{
     | var part_map=scala.collection.mutable.Map[String,List[(Int,String)]]()
     | while(iter.hasNext){
     | var part_name="part_"+partIdx
     | var elem=iter.next()
     | if(part_map.contains(part_name)){
     | var elems=part_map(part_name)
     |  elems::=elem
     | part_map(part_name)=elems
     | }else{
     | part_map(part_name)=List[(Int,String)]{elem}
     | }}
     | part_map.iterator}}.collect
res23: Array[(String, List[(Int, String)])] = Array((part_0,List((4,D), (2,B))), (part_1,List((3,C), (1,A))))

你可能感兴趣的:(spark学习)