1,在SparkEnv.create()初始化了MapOutputTrackerMaster(记录ShuffleMapTask输出信息)
val mapOutputTracker = if (isDriver) {
/* MapOutputTrackerMaster属于driver,这里使用TimeStampedHashMap来跟踪 map的输出信息,也可以将旧信息进行清除
* 一、MapOutputTracker的作用
* 1,获得mapper的输入信息,方便reducer取得对应的信息
* 2,每个mapper和reducer都有自己的唯一标识mapperid,reducerId
* 3,每个reducer可以对应多个map的输入,reducer会去取每个map中的Block,这个过程称为shuffle,每个shuffle也对应shuffleId
*/
new MapOutputTrackerMaster(conf)
} else {
//是运行在executor中的
new MapOutputTrackerWorker(conf)
}
// Have to assign trackerActor afterinitialization as MapOutputTrackerActor
// requires the MapOutputTracker itself
//初始化MapOutputTracker需要给的成员trackerEndpoint进行赋MapOutputTrackerMasterEndpoint, MapOutputTracker.ENDPOINT_NAME的值:MapOutputTracker
mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
new MapOutputTrackerMasterEndpoint(
rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))
2,MapOutputTrackerMaster初始化过程做了什么
a,默认开启reduce Task 数据本地性:spark.shuffle.reduceLocality.enabled:true
b,限制每个rdd的partition的总个数:SHUFFLE_PREF_MAP_THRESHOLD小于1000,这样做更省资源
c,可优化的地方:Map output输出到本地性的比率:private val REDUCER_PREF_LOCS_FRACTION = 0.2
d,driver端:将存储Map output输出的block manager的地址及task运行时输出大小给reduce 的值给 mapStatuses,基于Timestamp的hashMap来存放mapStatuses和缓存系列化statuses:TimeStampedHashMap[Int, Array[Byte]]()
e, 可优化的地方:初始new MetadataCleaner()来清理mapStatatus和缓存系列化statuses,默认情况如果不设置spark.cleaner.ttl的值是不会清理的
/**
* MapOutputTracker for the driver. This uses TimeStampedHashMap to keep track of map
* output information, which allows old output information based on a TTL.
*
* MapOutputTrackerMaster属于driver,这里使用TimeStampedHashMap来跟踪 map的输出信息,
对于存储的旧信息可以被清除掉
* 为每个shuffle准备其所需要的所有map out,可以加速map outs传送给shuffle的速度
* 一、MapOutputTracker(是MapOutputTrackerMaster和MapOutputTrackerWorker父类)的作用
* 1,获得mapper的输入信息,方便reducer取得对应的信息
* 2,每个mapper和reducer都有自己的唯一标识mapperid,reducerId
* 3,每个reducer可以对应多个map的输入,reducer会去取每个map中的Block,这个过程称为shuffle,
每个shuffle也对应shuffleId
*
*二、MapOutputTrackerMaster和MapOutputTrackerWorker(运行在Executor中)
都继承了MapOutputTracker
* 1,MapOutputTrackerMaster是用来记录每个stage中ShuffleMapTasks的map out输出
* a,shuffleReader读取shuffle文件之前就是去请求MapOutputTrackerMaster 要自己处理的数据
在哪里
* b,MapOutputTracker给它返回一批 MapOutputTrackerWorker的列表(地址,port等信息)
* 2,MapOutputTrackerWorker是仅仅作为cache用来执行shuffle计算
*
*/
private[spark] class MapOutputTrackerMaster(conf: SparkConf)
extends MapOutputTracker(conf) {
/** Cache a serialized version of the output statuses for each shuffle to send them out faster
* 缓存每个shuffle的输出状态的序列化版本,以更快地发送它们 */
private var cacheEpoch = epoch
/** Whether to compute locality preferences for reduce tasks
* 是否为reduce task 计算本地性*/
private val shuffleLocalityEnabled = conf.getBoolean("spark.shuffle.reduceLocality.enabled", true)
// Number of map and reduce tasks above which we do not assign
preferred locations based on map output sizes.
We limit the size of jobs for which assign preferred locations as
computing the top locations by size becomes expensive.
//在一定数量的map和reduce task之上,我们不会基于map的输入大小来赋值数据本地性,
// 直接限制job的大小来赋数据本地性,会比map的输出大小来计算本地性更省简单一些
private val SHUFFLE_PREF_MAP_THRESHOLD = 1000
// NOTE: This should be less than 2000 as we use HighlyCompressedMapStatus beyond that
//注意:这应该是小于2000,因为我们使用HighlyCompressedMapStatus
private val SHUFFLE_PREF_REDUCE_THRESHOLD = 1000
// Fraction of total map output that must be at a location for it to considered as a preferred
// location for a reduce task. Making this larger will focus on fewer locations
where most data can be read locally, but may lead to more delay in scheduling
if those locations are busy.
//所有map输出的比率,必须考虑reduce task的数据本地性,这个值变大之后,本地性数据变多,
可能会造成延迟
private val REDUCER_PREF_LOCS_FRACTION = 0.2
/**
* Timestamp based HashMap for storing mapStatuses and cached serialized
statuses in the driver, so that statuses are dropped only by explicit
de-registering or by TTL-based cleaning (if set). Other than these two scenarios,
nothing should be dropped from this HashMap.
* driver基于Timestamp的hashMap来存放mapStatuses和缓存系列化statuses:
TimeStampedHashMap[Int, Array[Byte]](),
* 所以将MapStatus去掉只能显示的去注消或周期性的删除,除了这两种情况,
这个TimeStampedHashMap不会去掉任何数据
*
* TimeStampedHashMap的key是Timestamp,MapStatus是一个由 ShuffleMapTask
从DAGScheduler调度中之后返回的对象:
* 该对象包含block manager的地址是task运行时reduce输出的大小,传递给ReduceTask
*/
protected val mapStatuses = new TimeStampedHashMap[Int, Array[MapStatus]]()
private val cachedSerializedStatuses = new TimeStampedHashMap[Int, Array[Byte]]()
// For cleaning up TimeStampedHashMaps,定时去清理mapStatuses,
cachedSerializedStatuses中的kv,如果spark.cleaner.ttl不设置值不会清理,可以优先的地方
private val metadataCleaner =
new MetadataCleaner(MetadataCleanerType.MAP_OUTPUT_TRACKER, this.cleanup, conf)
//在map out的集合mapStatuses中注册新的Shuffle,参数为Shuffle id和map的个数
def registerShuffle(shuffleId: Int, numMaps: Int) {
if (mapStatuses.put(shuffleId, new Array[MapStatus](numMaps)).isDefined) {
throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")
}
}
//根据Shuffle id取得TimeStampedHashMap[Int, Array[MapStatus]]对应的Array[MapStatus],
给这个Array[MapStatus]对应在索引赋MapStatus
def registerMapOutput(shuffleId: Int, mapId: Int, status: MapStatus) {
//mapStatuses: TimeStampedHashMap[Int, Array[MapStatus]]()
val array = mapStatuses(shuffleId)
array.synchronized {
array(mapId) = status
}
}
。。。。。
3,查看一下MapOutputTracker做了哪些工作:
a, 将trackerEndpoint成员设置出来,方便sparkEnv初始化时将MapOutputTrackerMasterEndpoint设置给它
mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
new MapOutputTrackerMasterEndpoint(
rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))
b,初始mapStatuses有不同的行为在driver端和executers
1),在driver上,它记录ShuffleMapTasks的map outputs输出的记录
2),在executors上,只是简单的cache一下,会有相应的trigger去driver端取HashMap数据
c, 初始化epoch的值用来记录fetch时失败的次数,方便客户端去清除数据
d,private val fetching = new HashSet[Int],记录哪个executor在获取map的输出
/**
* Class that keeps track of the locationof the map output of
* a stage. This is abstract becausedifferent versions of MapOutputTracker
* (driver and executor) use differentHashMap to store its metadata.
*
* 1,MapOutputTracker会在每个stage跟踪map的输出,是抽像类是因为在driver和executor上
使用不同的hashMap来存储元数据。
* master上,用来记录ShuffleMapTasks所需的map out的所在地;worker上,
仅仅作为cache用来执行shuffle计算
*。。。。。
5、MapOutputTrackerMasterEndpoint初始化时没有做什么就是设置了一下,mapOutputStatuses.length不能大于128M,超过会报错。(2.2版本我看一下不是这么实现 的,直直接生成一个GetMapOutputMessage(shuffleId,RpcConllContext)传给MapoutputTackerMaster了)
/** RpcEndpointclass for MapOutputTrackerMaster.
* 是RpcEndpoint子类,可以被多线程使用,这是放在MapOutputTrackerMaster构造方法中的,在sparkEnv.create时初始出来的 */
private[spark] class MapOutputTrackerMasterEndpoint(
override val rpcEnv: RpcEnv, tracker:MapOutputTrackerMaster, conf: SparkConf)
extends RpcEndpoint with Logging {
//以字节为单位返回Akka消息的已配置最大帧frame大小。这个maxFrameSizeBytes返回值是128M
val maxAkkaFrameSize= AkkaUtils.maxFrameSizeBytes(conf)
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] ={
case GetMapOutputStatuses(shuffleId:Int) =>
val hostPort= context.senderAddress.hostPort
logInfo("Asked to send map output locations for shuffle" + shuffleId + " to " + hostPort)
val mapOutputStatuses= tracker.getSerializedMapOutputStatuses(shuffleId)
val serializedSize= mapOutputStatuses.length
if (serializedSize> maxAkkaFrameSize) {
val msg= s"Map output statuses were $serializedSize bytes which " +
s"exceeds spark.akka.frameSize ($maxAkkaFrameSize bytes)."
/* For SPARK-1244 we'll opt for just logging an error andthen sending it to the sender.
* A bigger refactoring(SPARK-1239) will ultimately remove this entire code path. */
val exception= new SparkException(msg)
logError(msg, exception)
context.sendFailure(exception)
} else {
context.reply(mapOutputStatuses)
}
6,MetadataCleaner初始化时做了哪些事,这个类在Blockmanager都用到这个类
/**
* Runs a timer task to periodicallyclean up metadata (e.g. old files or hashtable entries)
* 运行一个定时器定期清理原数据,如旧文件或hashTable实例kv
* 从sparkEnv初始化进来时是对应 MetadataCleanerType.MAP_OUTPUT_TRACKER
*/
private[spark] class MetadataCleaner(
cleanerType: MetadataCleanerType.MetadataCleanerType,
cleanupFunc: (Long) => Unit,
conf: SparkConf)
extends Logging
{
val name= cleanerType.toString
//初始化进来的时候,这个getDelaySeconds,对应spark.cleaner.ttl.MAP_OUTPUT_TRACKER值返回-1
private val delaySeconds = MetadataCleaner.getDelaySeconds(conf, cleanerType)
//math.max(10,-1/10) ==> 10s
private val periodSeconds = math.max(10, delaySeconds / 10)
private val timer = new Timer(name+ " cleanup timer", true)
private val task = new TimerTask {
override def run() {
try {
cleanupFunc(System.currentTimeMillis()- (delaySeconds * 1000))
logInfo("Ran metadata cleaner for " + name)
} catch {
case e: Exception => logError("Error running cleanup task for " + name, e)
}
}
}
//spark.cleaner.ttl的默认是-1, 得到的spark.cleaner.ttl.MAP_OUTPUT_TRACKER值对应的
delaySeconds的值是-1所以默认timer是不会清理数据的
//所以默认timer是不会清理数据的//从sparkEnv初始化进来时是对应 MetadataCleanerType.MAP_OUTPUT_TRACKER,
得到的值是spark.cleaner.ttl.MAP_OUTPUT_TRACKER
def systemProperty(which: MetadataCleanerType.MetadataCleanerType): String= {===》如果设置spark.cleaner.ttl就会调用cleanup方法
/**
* 在指定时间清除mapStatuses:TimeStampedHashMap[Int,Array[MapStatus]]和cachedSerializedStatuses:TimeStampedHashMap[Int,Array[Byte]] 里面的kv数据
* spark.cleaner.ttl的默认是-1, 得到的spark.cleaner.ttl.MAP_OUTPUT_TRACKER值对应的delaySeconds的值是-1
所以默认timer是不会清理数据的
*/
private def cleanup(cleanupTime: Long) {
mapStatuses.clearOldValues(cleanupTime)
cachedSerializedStatuses.clearOldValues(cleanupTime)
}
===》清理的方法很简单,就是将ConcurrentHashMap对应的Iterater取出遍历,然后判断key的时间,小于参数时间就进行清理
private[spark] case class TimeStampedValue[V](value: V, timestamp: Long)
/**
* 这是scala.collection.mutable.Map的自定义实现,它存储插入时间戳和每个键值对。
* 如果指定,则每次访问时每个对的时间戳都可以更新。 然后可以使用clearOldValues方法删除时间戳超过特定阈值时间的键值对。
* 这个kv对的是scala的可变map: scala.collection.mutable.HashMap
*/
private[spark] class TimeStampedHashMap[A, B](updateTimeStampOnGet: Boolean = false)
extends mutable.Map[A, B]() with Logging{
//声明了一个并发的ConcurrentHashMap
private val internalMap = new ConcurrentHashMap[A, TimeStampedValue[B]]()
…
def getEntrySet: Set[Entry[A, TimeStampedValue[B]]]= internalMap.entrySet
…
override def size: Int = internalMap.size
override def foreach[U](f: ((A, B)) => U) {
//这是一个ConcurrentHashMap[A,TimeStampedValue[B]]对应的Set[Entry[A,TimeStampedValue[B]]]
val it = getEntrySet.iterator
while(it.hasNext){
val entry= it.next()
val kv =(entry.getKey, entry.getValue.value)
f(kv)
}
}
….
def clearOldValues(threshTime: Long, f: (A, B) => Unit) {
//这是一个ConcurrentHashMap[A,TimeStampedValue[B]]对应的Set[Entry[A,TimeStampedValue[B]]]
val it =getEntrySet.iterator
while (it.hasNext){
val entry= it.next()
//小于threshTime的kv都清掉
if (entry.getValue.timestamp< threshTime) {
f(entry.getKey, entry.getValue.value)
logDebug("Removing key " + entry.getKey)
it.remove() //iterator调用remove方法只能每next一次,才能对应it.remove()
}
}
}
/** Removes old key-value pairs that have timestampearlier than `threshTime`. */
def clearOldValues(threshTime: Long) {
clearOldValues(threshTime, (_, _) => ())
}。。。。
7,最后看一下MapStatus这个类,它会根据partition的长度来选择不同的子类来存储ShuffleMapTask的输出
/**
* Result returned by a ShuffleMapTask toa scheduler. Includes the block manager address that the task ran on as well asthe sizes of outputs for each reducer, for passing on to the reduce tasks.
* MapStatus是一个由 ShuffleMapTask从DAGScheduler调度中之后返回的对象:block manager的地址及task运行时输出大小给reduce,传递给ReduceTask
*
mapStatuses有不同的行为在driver端和executers
1),在driver上,它记录ShuffleMapTasks的map outputs输出的记录
2),在executors上,只是简单的cache一下,会有相应的trigger去driver端取HashMap数据
*/
private[spark] sealed trait MapStatus {
/** Location where this task was run. */
def location: BlockManagerId
/**
* Estimated size for the reduce block,in bytes.
* If a block is non-empty, then thismethod MUST return a non-zero size. Thisinvariant is necessary for correctness, since block fetchers are allowed toskip zero-size blocks.
* 评估reduce块的大小,单位是字节。
* 如果一个块是非空的,那么这个方法务必返回一个非零大小。 这个非变量的值必须正确,因为块提取器,允许跳过零大小的块。
*/
def getSizeForBlock(reduceId: Int): Long
}
private[spark] objectMapStatus {
/**
* 在 partition 小于2000 和大于 2000 的两种场景下,Spark 使用不同的数据结构来在 shuffle 时记录相关信息,
* 在 partition大于 2000 时,会用HighlyCompressedMapStatus更高效 [压缩] 的数据结构来存储信息。
* 所以如果你的partition 没到 2000,但是很接近 2000,使用CompressedMapStatus来存信息。
* 可以放心的把partition 设置为 2000 以上。
*/
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length> 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
。。。
}
}
/**
* A [[MapStatus]] implementation that tracks the size of each block. Sizefor each block is represented using a single byte.
*
* 是[[MapStatus]]实现类,用于跟踪每个块的大小。 每个块的大小用一个字节表示。
* @param loc location where thetask is being executed.
* @param compressedSizes size ofthe blocks, indexed by reduce partition id.
*/
private[spark] class CompressedMapStatus(
private[this] var loc: BlockManagerId,
private[this] var compressedSizes: Array[Byte])
extends MapStatus with Externalizable {
protected def this() = this(null, null.asInstanceOf[Array[Byte]]) // For deserialization only
def this(loc: BlockManagerId, uncompressedSizes: Array[Long]) {
this(loc, uncompressedSizes.map(MapStatus.compressSize))
}
override def location:BlockManagerId = loc
。。。。
}
/**
* A [[MapStatus]] implementation that only stores the average size ofnon-empty blocks,
* plus a bitmap for tracking whichblocks are empty.
*
* 是[[MapStatus]]实现类,它只存储非空块的平均大小,使用bitmpa跟踪非空块
*
* @param loc location where thetask is being executed
* @param numNonEmptyBlocks thenumber of non-empty blocks
* @param emptyBlocks a bitmaptracking which blocks are empty
* @param avgSize average size ofthe non-empty blocks
*/
private[spark] class HighlyCompressedMapStatusprivate (
private[this] var loc: BlockManagerId,
private[this] var numNonEmptyBlocks: Int,
private[this] var emptyBlocks: RoaringBitmap,
private[this] var avgSize: Long)
extends MapStatus with Externalizable {