Spark 重分区函数:coalesce和repartition区别与实现,可以优化Spark程序性能

源码包路径: org.apache.spark.rdd.RDD


Spark 重分区函数:coalesce和repartition区别与实现,可以优化Spark程序性能_第1张图片


Return a new RDDthat is reduced into numPartitions partitions. This results in a narrowdependency, e.g. if you go from 1000 partitions to 100 partitions, there willnot be a shuffle, instead each of the 100 new partitions will claim 10 of thecurrent partitions. However, if you're doing a drastic(激烈的,猛烈的)coalesce, e.g. to numPartitions = 1, this may result in your computation takingplace on fewer nodes than you like (e.g. one node in the case of numPartitions= 1). To avoid this, you can pass shuffle = true. This will add a shuffle step,but means the current upstream partitions will be executed in parallel (perwhatever the current partitioning is). Note: With shuffle = true, you canactually coalesce to a larger number of partitions. This is useful if you have asmall number of partitions, say 100, potentially with a few partitions beingabnormally large. Calling coalesce(1000, shuffle = true) will result in 1000partitions with the data distributed using a hash partitioner.





注意:第二个参数shuffle=true,将会产生多于之前的分区数目,例如你有一个个数较少的分区,假如是100,调用coalesce(1000, shuffle = true)将会使用一个  HashPartitioner产生1000个分区分布在集群节点上。这个(对于提高并行度)是非常有用的。


Return a new RDDthat has exactly numPartitions partitions. Can increase or decrease the levelof parallelism in this RDD. Internally, this uses a shuffle to redistributedata. If you are decreasing the number of partitions in this RDD, considerusing coalesce, which can avoid performing a shuffle.



你可能感兴趣的:(Spark 重分区函数:coalesce和repartition区别与实现,可以优化Spark程序性能)