在Spark1.2的时候,Spark将默认基于Hash的Shuffle改为了默认基于Sort的Shuffle。那么二者在Shuffle过程中具体的Behavior究竟如何,Hash based shuffle有什么问题,Sort Based Shuffle有什么问题,
先看源代码分析下Hash Based Shuffle的流程,然后在从大方面去理解,毕竟,看代码是见数目不见森林。等见了树木之后,再看看森林是什么样的。
1.Hash Shuffle总体架构图
2. 示例程序
package spark.examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SparkWordCountHashShuffle {
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\devsoftware\\hadoop-2.5.2\\hadoop-2.5.2");
val conf = new SparkConf()
conf.setAppName("SparkWordCount")
conf.setMaster("local[3]")
//Hash based Shuffle;
conf.set("spark.shuffle.manager", "hash");
val sc = new SparkContext(conf)
val rdd = sc.textFile("file:///D:/word.in.3",4); //数据至少产生4个分区
val rdd1 = rdd.flatMap(_.split(" "))
val rdd2 = rdd1.map((_, 1))
val rdd3 = rdd2.reduceByKey(_ + _, 3); ///3个分区对应3个ResultTask
rdd3.saveAsTextFile("file:///D:/wordout" + System.currentTimeMillis());
sc.stop
}
}
调用rdd3.toDebugString得到如下的RDD依赖关系图(其实在ShuffledRDD之后,即在saveAsTextFile内部还会继续对rdd3进行转换,此处不考虑,ShuffledRDD是经过Shuffle过形成的RDD)
(3) ShuffledRDD[4] at reduceByKey at SparkWordCountHashShuffle.scala:18 []
+-(5) MappedRDD[3] at map at SparkWordCountHashShuffle.scala:17 []
| FlatMappedRDD[2] at flatMap at SparkWordCountHashShuffle.scala:16 []
| file:///D:/word.in.3 MappedRDD[1] at textFile at SparkWordCountHashShuffle.scala:15 []
| file:///D:/word.in.3 HadoopRDD[0] at textFile at SparkWordCountHashShuffle.scala:15 []
Shuffle写操作发生在ShuffleMapTask中,Shuffle读操作发生在ResultTask中。ResultTask通过MapOutputTrackerMaster来获取ShuffleMapTask写数据的位置,因此,当ShuffleMapTask执行完后会更新MapOutputTrackerMaster以记录Shuffle写入数据的位置,而ResultTask则读取MapOutputTrackerMaster的相关信息读取ShuffleMapTask的写入数据
3. Hash Shuffle Write
3.1 ShuffleMapTask的runTask方法
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val ser = SparkEnv.get.closureSerializer.newInstance()
///反序列化taskBinary得到rdd和dep,rdd是Shuffle前的最后一个RDD,即wordcount中的MappedRDD[3]
///dep是ShuffleDependency
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
metrics = Some(context.taskMetrics)
var writer: ShuffleWriter[Any, Any] = null
try {
///获取shuffleManager,此处是HashShuffleManager
val manager = SparkEnv.get.shuffleManager
///根据dep.shuffleHandle以及partitionId获取HashShuffleWriter,
///首先,ShuffleWriter是与RDD的一个分区关联的,因此M个ShuffleMapTask(对应m个partition),就会产生m个writer
///dep.shuffleHandle获取的是什么,下面分析
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
////调用HashShuffleWriter的write方法,写入的数据(入参是RDD中,index为partition的分区数据集合(Iteratable)
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
///stop做了什么事?调用stop的返回值的get方法以返回MapStatus对象,至于MapStatus对象中有什么数据,后面分析
return writer.stop(success = true).get
} catch {
case e: Exception =>
try {
if (writer != null) {
writer.stop(success = false)
}
} catch {
case e: Exception =>
log.debug("Could not stop writer", e)
}
throw e
}
}
3.2 反序列化taskBinary
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
问题在于rdd和dep指的是什么?rdd是ShuffleMapStage的最后一个RDD,dep是ShuffleDependency类型,表示这个Stage对于它依赖的Stage而言是Shuffle依赖的。
rdd和dep是在DAGScheduler的submitMissingTasks中序列化的,代码片段如下
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] =
if (stage.isShuffleMap) { ///rdd来自于stage.rdd,dep来自于stage.shuffleDep.get,这个stage是ShuffleMapStage
closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : AnyRef).array()
} else {
closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : AnyRef).array()
}
taskBinary = sc.broadcast(taskBinaryBytes)///通过broadcast,由driver向workers传播
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString)
runningStages -= stage
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}")
runningStages -= stage
return
}
3.3 dep.shuffleHandle
dep是ShuffleDependency对象;dep.shuffleHandle的类型是ShuffleHandle,实际类型是BasicShuffleHandle。shuffleHandle是ShuffleDependency的一个成员变量,在实例化ShuffleDependency的时候,即给它进行复制。复制是调用HashShuffleManager的registerShuffle方法实现的,registerShuffle有三个参数,shuffleId,ShuffleMapStage的最后一个RDD(这里的MappedRDD[3]的分区数,以及ShuffleDependency对象本身)。
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.size, this)
_rdd是ShuffleDependency的一个成员,这个rdd是ShuffledRDD构造时传入的,如下是ShuffledRDD的getDependencies方法,prev就是ShuffledRDD依赖的RDD,就是这里的_rdd。
registerShuffle记录的是ShuffledRDD依赖的rdd的partition数目
override def getDependencies: Seq[Dependency[_]] = {
List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}
3.4 HashShuffleManager的registerShuffle方法
/* Register a shuffle with the manager and obtain a handle for it to pass to tasks. */
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int, ////可见这个参数是mapper RDD的partition数目
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
new BaseShuffleHandle(shuffleId, numMaps, dependency)
}
3.4.2 关于BaseShuffleHandle
/**
* A basic ShuffleHandle implementation that just captures registerShuffle's parameters.
*/
private[spark] class BaseShuffleHandle[K, V, C](
shuffleId: Int,
val numMaps: Int,
val dependency: ShuffleDependency[K, V, C])
extends ShuffleHandle(shuffleId)
/**
* An opaque handle to a shuffle, used by a ShuffleManager to pass information about it to tasks.
*
* @param shuffleId ID of the shuffle
*/
private[spark] abstract class ShuffleHandle(val shuffleId: Int) extends Serializable {}
BaseShuffleHandle更像是一个case class,注意它是可序列化的,正如BaseShuffleHandle的方法说明,用于存放shuffle的信息的。
3.5manger.getWriter方法此处的manager是HashShuffleManager,
/** Get a writer for a given partition. Called on executors by map tasks. */
override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext)
: ShuffleWriter[K, V] = {
new HashShuffleWriter(
shuffleBlockManager, handle.asInstanceOf[BaseShuffleHandle[K, V, _]], mapId, context)
}
可见此处的getWriter返回一个HashShuffleWriter,它是针对Mapper partitions中一个partition返回的(mapId的含义就是一个mapper的一个partition的index)。同时携带一个BaseShuffleHandle(启动携带了shuffleId,mapper partitions总数以及ShuffleDependency)。在构造HashShuffleWriter的过程中,出现了shuffleBlockManager对象,注意getWriter是在HashShuffleManager中定义的,因此ShuffleBlockManager是HashShuffleManager的一个实例,代码定义如下,也就是说,对于Hash Shuffle而言,它的ShuffleBlockManager是FileShuffleBlockManager类型的,这个类中定义了Hash Shuffle时,ShuffleMapTask写磁盘时的文件载体就在这里面定义,待会儿介绍
override def shuffleBlockManager: FileShuffleBlockManager = {
fileShuffleBlockManager
}
3.6 HashShuffleWriter实例化完后,调用它的write方法(注意,HahsShuffleWriter的实际存储载体是FileShuffleBlockManager):
调用writer.write方法进行实际的写数据操作
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
write方法的入参是一个partition的数据集合(Iteratable),这个partition是一个整数,是mapper的partitions的一个partition的index
/** Write a bunch of records to this task's output */
override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
///上面看到ShuffleDependency构造时,包含了如下信息:
/// List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
///根据dep的aggregator和mapSideCombine定义的不同情况,决定对分区数据是否进行按照Key进行Map端的合并
val iter =
if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) {///如果定义了dep.aggregator同时定义了dep.mapSideCombine,则对Key进行combine操作,这是一个map端的combine 就是_ + _操作
dep.aggregator.get.combineValuesByKey(records, context) ////内部使用HashMap进行combine
} else { ////如果定义了dep.aggregator但是未定义map端的combiner
records
}
}
else if (dep.aggregator.isEmpty && dep.mapSideCombine) { ///如果定义了dep.mapSideCombine但是没有定义dep.aggregator,则抛出异常
throw new IllegalStateException("Aggregator is empty for map-side combine")
}
else { //直接返回,不进行Map端的按照Key的合并
records
}
////遍历iter,每个partition生成一个文件?不是!是根据不同的Key获取不同的输出文件(一共partitioner.partition个文件)
for (elem <- iter) {
///根据元素Key得到bucketId,此处的关键是dep.partitioner指的是Shuffle前的最后一个RDD的分区方法还是Shuffle后的第一个RDD的分区方法
val bucketId = dep.partitioner.getPartition(elem._1) ///根据Key获取bucketId
////根据bucketId获得一个writer,根据bucketId获得不同的writer,也就是不同的(Key,Value)写到不同的文件中了(依据elem所对应的bucketId)
///writers是shuffle的函数,参数是bucketId
shuffle.writers(bucketId).write(elem)
}
}
3.7 Aggregator.combineValuesByKey
Aggregator.combineValuesByKey(即Mapper端做combine)是比较复杂的一步,它依据是否要spill磁盘分成了使用AppendOnlyMap做combine和ExternalAppendOnlyMap做combine,方法的结果是一样的,就是返回一个可迭代的数据集合(比较长,后面再展开说)
3.8 遍历每个元素,调用dep.partitioner.getPartition(elem._1)获取bucketId
此处的dep.partitioner是Shuffle前的最后一个RDD(MappedRDD[3])定义的partitoner还 是Shuffle后的第一个RDD(ShuffledRDD)定义的partitioner
ShuffleDependency的partitioner是作为构造参数传入到ShuffleDependen中的,它的注释是用于对shuffle输出进行分区。通过调试也确认了,这个partitioner指的是ShuffledRDD的分区数,即它是Shuffle后的第一个RDD(ShuffledRDD)定义的partitioner。
调试发现dep.partitioner是一个分区数为3的HashPartitioner。
这也就是不难理解,dep.partitioner.getPartition(elem._1)获取的是这个elem按照ShuffledRDD的分区算法存放到指定的位置,因此,bucketId是ShuffledRDD的分区的index。
3.8.2. Partitioner的getPartition方法:
def getPartition(key: Any): Int = key match {
case null => 0
////使用Utils.nonNegativeMod的方法计算Key的Hash取模
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
3.8.3 Utils.nonNegativeMod方法
/* Calculates 'x' modulo 'mod', takes to consideration sign of x,
* i.e. if 'x' is negative, than 'x' % 'mod' is negative too
* so function return (x % mod) + mod in that case.
*/
def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}
3.9调用shuffle.writers(bucketId)获取一个目标(ShuffledRDD每个分区对应的ResultTask的拉取数据的源头)的writer,然后调用write将元素写入
3.10 首先看一下shuffle变量在HashShuffleWriter中的定义
///shuffleBlockManager的类型是FileShuffleBlockManager
private val shuffle = shuffleBlockManager.forMapTask(dep.shuffleId, mapId, numOutputSplits, ser,
writeMetrics)
3.11 shuffleBlockManager.forMapTask方法
/**
* Get a ShuffleWriterGroup for the given map task, which will register it as complete
* when the writers are closed successfully
*/
///mapId:是map端的partitionId,numBuckets是ResultTask的个数或者ShuffledRDD的分区数
///forMapTask是针对每个mapId,建立numBuckets个数(Reducer个数)的File?
def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer,
writeMetrics: ShuffleWriteMetrics) = {
new ShuffleWriterGroup {
shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numBuckets))
private val shuffleState = shuffleStates(shuffleId)
private var fileGroup: ShuffleFileGroup = null
val writers: Array[BlockObjectWriter] = if (consolidateShuffleFiles) { ///如果是consolidateShuffleFiles,把shuffle聚合在一起
fileGroup = getUnusedFileGroup()
Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
blockManager.getDiskWriter(blockId, fileGroup(bucketId), serializer, bufferSize,
writeMetrics)
}
} else {
Array.tabulate[BlockObjectWriter](numBuckets) { bucketId => ///创建一个个数为numBuckets的数组,数组元素类型是BlockObjectWriter
val blockId = ShuffleBlockId(shuffleId, mapId, bucketId) //ShuffleBlockId对象,入参:shuffleId,mapId以及每个reducer的partitionId
//blockManager的类型是org.apache.spark.storage.BlockManager
//BlockManager的类注释是Manager running on every node (driver and executors) which provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
//diskBlockManager的类型是DiskBlockManager
//DiskBlockManager:
/*Creates and maintains the logical mapping between logical blocks and physical on-disk
* locations. By default, one block is mapped to one file with a name given by its BlockId.
* However, it is also possible to have a block map to only a segment of a file, by calling
* mapBlockToFileSegment().
val blockFile = blockManager.diskBlockManager.getFile(blockId) ///根据上面的三方面信息,获取一个文件blockFile,一共M*N个文件
// Because of previous failures, the shuffle file may already exist on this machine.
// If so, remove it.
if (blockFile.exists) {
if (blockFile.delete()) {
logInfo(s"Removed existing shuffle file $blockFile")
} else {
logWarning(s"Failed to remove existing shuffle file $blockFile")
}
}
///根据blockId,blockFile获取一个BlockObjectWriter,blockId和blockFile有点重复,因为blockFile中已经包含了blockId的信息
///bufferSize取自SparkConf中配置的spark.shuffle.file.buffer.kb参数,以kb为单位,默认为32,即32kb,用于写文件的缓冲
blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)
}
}
由于forMapTask返回的ShuffleWriterGroup类型的对象,因此shuffle变量是ShuffleWriterGroup类型的,而ShuffleWriterGroup对象有一个writers成员
3.11.1 ShuffleBlockId
这个类像是JavaBean,它有唯一的一个name,用户获取这个ShuffleBlockId的名称,其中的reduceId,就是上面构造时传入的bucketId
name = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
3.11.2 blockManager.diskBlockManager.getFile(blockId)
根据blockid获取一个File,注意,此时这个File还没有创建,如果这个File已经存在,首先将其删除
它是调用DiskBlockManager的getFile方法
def getFile(blockId: BlockId): File = getFile(blockId.name)
getFile继续调用重载的getFile(fileName)
def getFile(filename: String): File = {
// Figure out which local directory it hashes to, and which subdirectory in that
//对文件名做Hash
//filename,例如shuffle_0_0_0
val hash = Utils.nonNegativeHash(filename)
//此处首先要知道localDirs是什么含义,通过它得到dirId,(dirId是一个目录的索引,即localDir[dirId]将得到具体的的目录)
//localDirs就是制定的存放map结果数据的临时目录,可以指定多个,用逗号分隔
//在wordcount例子中,没有指定spark.local.dir,默认去java.io.tmp的目录,并且localDirs的长度为1
//此时dirId为0
val dirId = hash % localDirs.length
//subDirsPerLocalDir是什么?它取自SparkConf的spark.diskStore.subDirectories配置参数,默认为64
//因为localDirs.length为1,那么subDirId=hash%subDirsPerLocalDir, 0~63的数字
val subDirId = (hash / localDirs.length) % subDirsPerLocalDir
// Create the subdirectory if it doesn't already exist
//subDir是个二维数组:
//private val subDirs = Array.fill(localDirs.length)(new Array[File](subDirsPerLocalDir))
///fill接收两个参数,第一个参数为n,表示对于0到n-1,每个元素都第二个参数填充,因此subDirs是个二维数组,表示对于每个localDir,都有0到subDirsPerLocalDir个数的子目录
///根据dirId和subDirId获取子目录的文件对象,应该还是null,经验证是null
var subDir = subDirs(dirId)(subDirId)
//子目录尚不存在
if (subDir == null) {
subDir = subDirs(dirId).synchronized { ///subDirs(dirId)得到的是一个一维数组
val old = subDirs(dirId)(subDirId) ///线程同步的两阶段检查
if (old != null) {
old
} else {
val newDir = new File(localDirs(dirId), "%02x".format(subDirId)) ///将subDirId转换成16进制,
newDir.mkdir()
subDirs(dirId)(subDirId) = newDir ///给二维数组赋值
newDir ///赋给subDir
}
}
}
///文件所在的目录,以及文件名,但是并未创建File,即没有调用File.createNewFile
///subDir是${java.io.tmp}/spark-local-20150219132253-c917/0c(或者0d,是个单调增的16进制数,)
new File(subDir, filename)
}
localDirs:
/* Create one local directory for each path mentioned in spark.local.dir; then, inside this
* directory, create multiple subdirectories that we will hash files into, in order to avoid
* having really large inodes at the top level. */
//Gets or creates the directories listed in spark.local.dir or SPARK_LOCAL_DIRS,
private[spark] val localDirs: Array[File] = createLocalDirs(conf)
郁闷的是,本地的断点调试进不了这个代码,源代码和class文件已经不匹配了,先把wordcount程序执行ShuffleMapTask生成的map结果,写下来,然后反推代码的含义
C:\Users\hadoop\AppData\Local\Temp\spark-local-20150219132253-c917>tree /f
文件夹 PATH 列表
卷序列号为 4E9D-390C
C:.
├─0c
│ shuffle_0_0_0
│
├─0d
│ shuffle_0_0_1
│
├─0e
│ shuffle_0_0_2
│ shuffle_0_2_0
│
├─0f
│ shuffle_0_2_1
│ shuffle_0_3_0
│
├─10
│ shuffle_0_2_2
│ shuffle_0_3_1
│
├─11
│ shuffle_0_3_2
│
├─12
└─13
经过上面的验证,localDirs是 C:\Users\hadoop\AppData\Local\Temp\spark-local-20150219132253-c917,而它下面的0c,0d...13则是16进制的子dirs。每个目录下最多有64个。
3.11.3 获取到blockFile之后,执行如下语句,获取Writer,返回的类型为BlockObjectWriter
blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)
上面的语句的实现方法如下:
def getDiskWriter(
blockId: BlockId,
file: File,
serializer: Serializer,
bufferSize: Int,
writeMetrics: ShuffleWriteMetrics): BlockObjectWriter = {
val compressStream: OutputStream => OutputStream = wrapForCompression(blockId, _)
val syncWrites = conf.getBoolean("spark.shuffle.sync", false)
new DiskBlockObjectWriter(blockId, file, serializer, bufferSize, compressStream, syncWrites,
writeMetrics)
}
可见它是返回DiskBlockObjectWriter,有压缩算法serializer?
至此,FileShuffleBlockManager的forMapTask已经分析完了
3.12 通过shuffle.writers(bucketId)获取到FileShuffleBlockManager的forMapTask返回的DiskShuffleBlockWriter对象,调用它的write方法
override def write(value: Any) {
if (!initialized) {
open()
}
objOut.writeObject(value) ///写入二进制流
if (writesSinceMetricsUpdate == 32) {
writesSinceMetricsUpdate = 0
updateBytesWritten()
} else {
writesSinceMetricsUpdate += 1
}
}
3.13 当这个RDD的partition中的数据写完后,代码回到ShuffleMapTask的runTask中,执行最后一步,
return writer.stop(success = true).get
此时有两步操作,首先关闭上面的writer,因为在写的时候,打开了R个文件,需要关闭;其次是要讲写入的数据通知MapOutputTrackerMaster
/** Close this writer, passing along whether the map completed */
override def stop(initiallySuccess: Boolean): Option[MapStatus] = {
var success = initiallySuccess
try {
if (stopping) {
return None
}
stopping = true
if (success) {
try {
Some(commitWritesAndBuildStatus()) ///这是干啥?应该是作为返回值的,使用Some包装
} catch {
case e: Exception =>
success = false
revertWrites()
throw e
}
} else {
revertWrites()
None
}
} finally {
// Release the writers back to the shuffle block manager.
if (shuffle != null && shuffle.writers != null) { ///try的commitWritesAndBuildStatus已经关闭了所有打开的shuffle的writers,这里为什么还要release?
try {
shuffle.releaseWriters(success)
} catch {
case e: Exception => logError("Failed to release shuffle writers", e)
}
}
}
}
3.13.1
commitWritesAndBuildStatus
private def commitWritesAndBuildStatus(): MapStatus = {
// Commit the writes. Get the size of each bucket block (total block size).
//每个writer都有写数据
val sizes: Array[Long] = shuffle.writers.map { writer: BlockObjectWriter =>
writer.commitAndClose() ///提交并关闭
writer.fileSegment().length ///fileSegment()的长度如何结算的?这是每个writer写数据的长度
}
///sizes是数组,表示本map所有的针对所有的reduce的数据都已经产生,每个mapper为每个reducer产生一个文件
MapStatus(blockManager.shuffleServerId, sizes)
}
3.13.2 DiskBlockObjectWriter的fileSegment()方法
override def fileSegment(): FileSegment = {
///三个参数:
//initialPosition表示内容在文件的起始位置, finalPosition-initialPosition表示这个Segment的长度,对于没有启用consolidatition的map out,每个Seg就是一个完成的文件
new FileSegment(file, initialPosition, finalPosition - initialPosition)
}
3.14上面在commitWritesAndBuildStatus方法中返回了MapStatus对象,此对象尚没有给MapOutputTrackerMaster登记自己shuffle数据的位置
由于Spark的源代码和二进制包不同步,导致代码无法跟踪,先暂时到这里,先接着分析Hash Based Shuffle读吧。
上面对Hash based Shuffle write进行了源代码的剖析,还有一部分没有涉及,就是map端的combine操作,Aggragator.combineValuesByKey操作,没有进行涉及,再写。
其他【不包含在上面的分钟】
传入的partition数和实际的partition个数的对应关系
conf.set("spark.shuffle.manager", "hash");
1. 指定partition书目的textFile操作
val rdd = sc.textFile("file:///D:/word.in.3",4); //4是最小partition书目
2. 如下的代码来自于HadoopRDD.scala,当前的minPartitions的值是4,得到的inputSplits的值是5,也就是Partition的数目为5
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
ResultTask个数与Map Partition个数之间的关系,
1.如果ResultTask没有指定个数,那么默认是与Map Partition的个数相同;如果指定了,则按照指定的值创建ResultTask实例
package spark.examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SparkWordCount {
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\devsoftware\\hadoop-2.5.2\\hadoop-2.5.2");
val conf = new SparkConf()
conf.setAppName("SparkWordCount")
conf.setMaster("local")
//Hash based Shuffle;
conf.set("spark.shuffle.manager", "hash");
val sc = new SparkContext(conf)
val rdd = sc.textFile("file:///D:/word.in.3",4); //4表示最小Partition书目
println(rdd.toDebugString)
val rdd1 = rdd.flatMap(_.split(" "))
println("rdd1:" + rdd1.toDebugString)
val rdd2 = rdd1.map((_, 1))
println("rdd2:" + rdd2.toDebugString)
val rdd3 = rdd2.reduceByKey(_ + _, 3); ///3表示ReduceTask的个数,如果不指定则与Map Partition的个数相同
println("rdd3:" + rdd3.toDebugString)
rdd3.saveAsTextFile("file:///D:/wordout" + System.currentTimeMillis());
sc.stop
}
}
HashBased Shuffle Map产生的文件数,与Map Partition个数和ReduceTask个数的关系
1.Map的中间结果默认存放在java.io.tmp目录下,如果指定了则保存到指定目录
2.如果一个RDD有N个Partition,会产生N个ShuffleMapTask。
3.如果有1个ResultTask,那么最后的结果,会产生1个结果文件.Part-00000。如果有R个ReduceTask(即ResultTask),则会产生R个结果文件。
4.M个partition,N个reduceTask,产生多少个Map文件?M*N。 例如:
/tmp/0c/shuffle_0_0_0
/tmp/0d/shuffle_0_0_1
/tmp/0d/shuffle_0_0_2
/tmp/0e/shuffle_0_2_0
/tmp/0f/shuffle_0_2_1
/tmp/0f/shuffle_0_3_0
shuffle后面的三个数字的含义:
- shuffleId
- PartiontionID
- ReduceTaskId,表明该partition将由第几个ReuceTask进行处理。最大值是2,因为一共3个ReduceTask
并行度
是指执行ReduceTask有几个core来执行,同时执行的个数。(除了一个local【4】的方式,还有一个设置并行度的参数)。设置了并行度后,上面的文件个数不变。
/**
* The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
* run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
*/
def setMaster(master: String): SparkConf = {
set("spark.master", master)
}
spark.shuffle.consolidateFiles选项
示例源代码:
package spark.examples.shuffle
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
object SparkHashShuffleConsolidationFile {
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\devsoftware\\hadoop-2.5.2\\hadoop-2.5.2");
val conf = new SparkConf()
conf.setAppName("SparkWordCount")
conf.setMaster("local[3]")
//Hash based Shuffle;
conf.set("spark.shuffle.manager", "hash");
//使用文件聚合
conf.set("spark.shuffle.consolidateFiles", "true");
val sc = new SparkContext(conf)
//10个以上的分区,每个分区对应一个Map Task
//读取一个1M的文件
val rdd = sc.textFile("file:///D:/server.log", 10);
val rdd1 = rdd.flatMap(_.split(" "))
val rdd2 = rdd1.map((_, 1))
//6个Reducer
val rdd3 = rdd2.reduceByKey(_ + _, 6);
rdd3.saveAsTextFile("file:///D:/wordcount" + System.currentTimeMillis());
println(rdd3.toDebugString)
sc.stop
}
}
结果Map Task产生了13个目录,文件内容:
C:.
├─00
│ merged_shuffle_0_5_2
│
├─01
│ merged_shuffle_0_4_2
│ merged_shuffle_0_5_1
│
├─02
│ merged_shuffle_0_3_2
│ merged_shuffle_0_4_1
│ merged_shuffle_0_5_0
│
├─03
│ merged_shuffle_0_2_2
│ merged_shuffle_0_3_1
│ merged_shuffle_0_4_0
│
├─04
│ merged_shuffle_0_1_2
│ merged_shuffle_0_2_1
│ merged_shuffle_0_3_0
│
├─05
│ merged_shuffle_0_0_2
│ merged_shuffle_0_1_1
│ merged_shuffle_0_2_0
│
├─06
│ merged_shuffle_0_0_1
│ merged_shuffle_0_1_0
│
├─07
│ merged_shuffle_0_0_0
│
├─0c
├─0d
├─0e
├─11
└─13
1. 结果显示一共六个Mapper,3个Reducer,18个文件,分析原因
2. 每个文件有个merged前缀,何意
加大输入文件的规模,看看结果?结果还是一样。为什么只有6个Mapper,而且只有3个Reducer(3是跟并行度有关的吧?)