spark-core_20: MapOutputTrackerMaster、MapOutputTracker、MapOutputTrackerMasterEndpoint等源码分析

1,在SparkEnv.create()初始化了MapOutputTrackerMaster(记录ShuffleMapTask输出信息)

val mapOutputTracker = if (isDriver) {
 
/* MapOutputTrackerMaster属于driver,这里使用TimeStampedHashMap来跟踪 map的输出信息,也可以将旧信息进行清除
    * 一、MapOutputTracker的作用
    * 1,获得mapper的输入信息,方便reducer取得对应的信息
    * 2,每个mapper和reducer都有自己的唯一标识mapperid,reducerId
    * 3,每个reducer可以对应多个map的输入,reducer会去取每个map中的Block,这个过程称为shuffle,每个shuffle也对应shuffleId
    */

  new MapOutputTrackerMaster(conf)
}
else {
 
//是运行在executor中的
  new MapOutputTrackerWorker(conf)
}

// Have to assign trackerActor afterinitialization as MapOutputTrackerActor
// requires the MapOutputTracker itself
//初始化MapOutputTracker需要给的成员trackerEndpoint进行赋MapOutputTrackerMasterEndpoint, MapOutputTracker.ENDPOINT_NAME的值:MapOutputTracker
mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
 
new MapOutputTrackerMasterEndpoint(
   
rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

2,MapOutputTrackerMaster初始化过程做了什么

a,默认开启reduce Task 数据本地性:spark.shuffle.reduceLocality.enabled:true

b,限制每个rdd的partition的总个数:SHUFFLE_PREF_MAP_THRESHOLD小于1000,这样做更省资源

c,可优化的地方:Map output输出到本地性的比率:private val REDUCER_PREF_LOCS_FRACTION = 0.2

d,driver端:将存储Map output输出的block manager的地址及task运行时输出大小给reduce 的值给 mapStatuses,基于Timestamp的hashMap来存放mapStatuses和缓存系列化statuses:TimeStampedHashMap[Int, Array[Byte]]()

e, 可优化的地方:初始new MetadataCleaner()来清理mapStatatus和缓存系列化statuses,默认情况如果不设置spark.cleaner.ttl的值是不会清理的

/**
 * MapOutputTracker for the driver. This uses TimeStampedHashMap to keep track of map
 * output information, which allows old output information based on a TTL.
  *
  * MapOutputTrackerMaster属于driver,这里使用TimeStampedHashMap来跟踪 map的输出信息,
    对于存储的旧信息可以被清除掉
  *   为每个shuffle准备其所需要的所有map out,可以加速map outs传送给shuffle的速度
  * 一、MapOutputTracker(是MapOutputTrackerMaster和MapOutputTrackerWorker父类)的作用
  * 1,获得mapper的输入信息,方便reducer取得对应的信息
  * 2,每个mapper和reducer都有自己的唯一标识mapperid,reducerId
  * 3,每个reducer可以对应多个map的输入,reducer会去取每个map中的Block,这个过程称为shuffle,
    每个shuffle也对应shuffleId
  *
  *二、MapOutputTrackerMaster和MapOutputTrackerWorker(运行在Executor中)
   都继承了MapOutputTracker
  * 1,MapOutputTrackerMaster是用来记录每个stage中ShuffleMapTasks的map out输出
  *    a,shuffleReader读取shuffle文件之前就是去请求MapOutputTrackerMaster 要自己处理的数据
       在哪里
  *    b,MapOutputTracker给它返回一批 MapOutputTrackerWorker的列表(地址,port等信息)
  * 2,MapOutputTrackerWorker是仅仅作为cache用来执行shuffle计算
  *
 */
private[spark] class MapOutputTrackerMaster(conf: SparkConf)
  extends MapOutputTracker(conf) {

  /** Cache a serialized version of the output statuses for each shuffle to send them out faster
    * 缓存每个shuffle的输出状态的序列化版本,以更快地发送它们 */
  private var cacheEpoch = epoch

  /** Whether to compute locality preferences for reduce tasks
    * 是否为reduce task 计算本地性*/
  private val shuffleLocalityEnabled = conf.getBoolean("spark.shuffle.reduceLocality.enabled", true)

  // Number of map and reduce tasks above which we do not assign 
preferred locations based on map  output sizes.
 We limit the size of jobs for which assign preferred locations as
 computing the top locations by size becomes expensive.

//在一定数量的map和reduce task之上,我们不会基于map的输入大小来赋值数据本地性,
// 直接限制job的大小来赋数据本地性,会比map的输出大小来计算本地性更省简单一些
  private val SHUFFLE_PREF_MAP_THRESHOLD = 1000
  // NOTE: This should be less than 2000 as we use HighlyCompressedMapStatus beyond that
  //注意:这应该是小于2000,因为我们使用HighlyCompressedMapStatus
  private val SHUFFLE_PREF_REDUCE_THRESHOLD = 1000

  // Fraction of total map output that must be at a location for it to considered as a preferred
  // location for a reduce task. Making this larger will focus on fewer locations
 where most data can be read locally, but may lead to more delay in scheduling 
if those locations are busy.
  //所有map输出的比率,必须考虑reduce task的数据本地性,这个值变大之后,本地性数据变多,
可能会造成延迟
  private val REDUCER_PREF_LOCS_FRACTION = 0.2

  /**
   * Timestamp based HashMap for storing mapStatuses and cached serialized 
statuses in the driver,  so that statuses are dropped only by explicit 
de-registering or by TTL-based cleaning (if set).  Other than these two scenarios,
 nothing should be dropped from this HashMap.
    * driver基于Timestamp的hashMap来存放mapStatuses和缓存系列化statuses:
TimeStampedHashMap[Int, Array[Byte]](),
    * 所以将MapStatus去掉只能显示的去注消或周期性的删除,除了这两种情况,
      这个TimeStampedHashMap不会去掉任何数据
    *
    * TimeStampedHashMap的key是Timestamp,MapStatus是一个由 ShuffleMapTask
      从DAGScheduler调度中之后返回的对象:
    * 该对象包含block manager的地址是task运行时reduce输出的大小,传递给ReduceTask
   */
  protected val mapStatuses = new TimeStampedHashMap[Int, Array[MapStatus]]()
  private val cachedSerializedStatuses = new TimeStampedHashMap[Int, Array[Byte]]()

  // For cleaning up TimeStampedHashMaps,定时去清理mapStatuses,
cachedSerializedStatuses中的kv,如果spark.cleaner.ttl不设置值不会清理,可以优先的地方
  private val metadataCleaner =
    new MetadataCleaner(MetadataCleanerType.MAP_OUTPUT_TRACKER, this.cleanup, conf)
  //在map out的集合mapStatuses中注册新的Shuffle,参数为Shuffle id和map的个数
  def registerShuffle(shuffleId: Int, numMaps: Int) {
    if (mapStatuses.put(shuffleId, new Array[MapStatus](numMaps)).isDefined) {
      throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")
    }
  }
  //根据Shuffle id取得TimeStampedHashMap[Int, Array[MapStatus]]对应的Array[MapStatus],
给这个Array[MapStatus]对应在索引赋MapStatus
  def registerMapOutput(shuffleId: Int, mapId: Int, status: MapStatus) {
    //mapStatuses:  TimeStampedHashMap[Int, Array[MapStatus]]()
    val array = mapStatuses(shuffleId)
    array.synchronized {
      array(mapId) = status
    }
  }

。。。。。

3,查看一下MapOutputTracker做了哪些工作:

a, 将trackerEndpoint成员设置出来,方便sparkEnv初始化时将MapOutputTrackerMasterEndpoint设置给它

mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
 
new MapOutputTrackerMasterEndpoint(
   
rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

b,初始mapStatuses有不同的行为在driver端和executers
1),在driver上,它记录ShuffleMapTasks的map outputs输出的记录
2),在executors上,只是简单的cache一下,会有相应的trigger去driver端取HashMap数据

c, 初始化epoch的值用来记录fetch时失败的次数,方便客户端去清除数据

d,private val fetching = new HashSet[Int],记录哪个executor在获取map的输出

/**
 * Class that keeps track of the locationof the map output of
 * a stage. This is abstract becausedifferent versions of MapOutputTracker
 * (driver and executor) use differentHashMap to store its metadata.
  *

  * 1,MapOutputTracker会在每个stage跟踪map的输出,是抽像类是因为在driver和executor上

使用不同的hashMap来存储元数据。

   * master上,用来记录ShuffleMapTasks所需的map out的所在地;worker上,

仅仅作为cache用来执行shuffle计算

  *
  * 2,MapOutputTrackerMaster和MapOutputTrackerWorker都继承了MapOutputTracker
  * 网友总结:
  * MapOutputTracker是 SparkEnv初始化时重要组件之一 是master-slave的结构
  * 用来跟踪记录shuffleMapTask的输出位置(shuffleMapTask要写到哪里去),
  * shuffleReader读取shuffle文件之前就是去请求MapOutputTrackerMaster 要自己处理的数据在哪里?
  * MapOutputTracker给它返回一批 MapOutputTrackerWorker的列表(地址,port等信息)
  * shuffleReader开始读取文件 进行后期处理
  *
 */

private[spark] abstract class MapOutputTracker(conf: SparkConf) extends Logging {

 
/** Set to the MapOutputTrackerMasterEndpoint living onthe driver.
    * 在driver上设置MapOutputTrackerMasterEndpoint为living活动的。
    * trackerEndpoint:的值是MapOutputTrackerMasterEndpoint,
    * 在sparkEnv.create时初始化时,当MapOutputTrackerMaster实例化时,会给该属性设置值
* mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(..new MapOutputTrackerMasterEndpoint(
        ...,mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],..) */

 
var trackerEndpoint: RpcEndpointRef = _

  /**
   * This HashMap has different behaviorfor the driver and the executors.
   *
   * On the driver, it serves as thesource of map outputs recorded from ShuffleMapTasks.
   * On the executors, it simply servesas a cache, in which a miss triggers a fetch from the
   * driver's corresponding HashMap.
   *
   * Note: because mapStatuses is accessedconcurrently, subclasses should make sure it's a
   * thread-safe map.
    * 这个mapStatuses有不同的行为在driver端和executers
    * 1,在driver上,它记录ShuffleMapTasks的map outputs输出的记录
    * 2,在executors上,只是简单的cache一下,会有相应的trigger去driver端取HashMap数据
   */

 
protected val mapStatuses: Map[Int, Array[MapStatus]]

 
/**
   * Incremented every time a fetch failsso that client nodes know to clear
   * their cache of map output locationsif this happens.
    * 每次当一个fetch失败时递增该值,这样客户节点知道如果发生这种情况,就可以清除它们的映射输出位置缓存。
   */

 
protected var epoch: Long = 0
 
protected val epochLock = new AnyRef

 
/** Remembers which map output locations are currentlybeing fetched on an executor.
    * 记住,哪个map输出位置当前正在被exeuctor获取。*/

 
private val fetching = new HashSet[Int]

。。。。。

 

5、MapOutputTrackerMasterEndpoint初始化时没有做什么就是设置了一下,mapOutputStatuses.length不能大于128M,超过会报错。(2.2版本我看一下不是这么实现 的,直直接生成一个GetMapOutputMessage(shuffleId,RpcConllContext)传给MapoutputTackerMaster了)

/** RpcEndpointclass for MapOutputTrackerMaster.
  * 是RpcEndpoint子类,可以被多线程使用,这是放在MapOutputTrackerMaster构造方法中的,在sparkEnv.create时初始出来的 */

private[spark] class MapOutputTrackerMasterEndpoint(
   
override val rpcEnv: RpcEnv, tracker:MapOutputTrackerMaster, conf: SparkConf)
 
extends RpcEndpoint with Logging {
 
//以字节为单位返回Akka消息的已配置最大帧frame大小。这个maxFrameSizeBytes返回值是128M
  val maxAkkaFrameSize= AkkaUtils.maxFrameSizeBytes(conf)

 
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] ={
   
case GetMapOutputStatuses(shuffleId:Int) =>
     
val hostPort= context.senderAddress.hostPort
     
logInfo("Asked to send map output locations for shuffle" + shuffleId + " to " + hostPort)
     
val mapOutputStatuses= tracker.getSerializedMapOutputStatuses(shuffleId)
     
val serializedSize= mapOutputStatuses.length
  
   if (serializedSize> maxAkkaFrameSize) {
       
val msg= s"Map output statuses were $serializedSize bytes which " +
         
s"exceeds spark.akka.frameSize ($maxAkkaFrameSize bytes)."

       
/* For SPARK-1244 we'll opt for just logging an error andthen sending it to the sender.
         * A bigger refactoring(SPARK-1239) will ultimately remove this entire code path. */

        val exception= new SparkException(msg)
       
logError(msg, exception)
       
context.sendFailure(exception)
      } else {
       
context.reply(mapOutputStatuses)
      }

6,MetadataCleaner初始化时做了哪些事,这个类在Blockmanager都用到这个类

/**
 * Runs a timer task to periodicallyclean up metadata (e.g. old files or hashtable entries)
  * 运行一个定时器定期清理原数据,如旧文件或hashTable实例kv
  * 从sparkEnv初始化进来时是对应 MetadataCleanerType.MAP_OUTPUT_TRACKER
 */

private[spark] class MetadataCleaner(
   
cleanerType: MetadataCleanerType.MetadataCleanerType,
   
cleanupFunc: (Long) => Unit,
   
conf: SparkConf)
 
extends Logging
{
 
val name= cleanerType.toString
  //初始化进来的时候,这个getDelaySeconds,对应spark.cleaner.ttl.MAP_OUTPUT_TRACKER值返回-1
  private val delaySeconds = MetadataCleaner.getDelaySeconds(conf, cleanerType)
 
//math.max(10,-1/10) ==> 10s
 
private val periodSeconds = math.max(10, delaySeconds / 10)
 
private val timer = new Timer(name+ " cleanup timer", true)


 
private val task = new TimerTask {
   
override def run() {
     
try {
       
cleanupFunc(System.currentTimeMillis()- (delaySeconds * 1000))
       
logInfo("Ran metadata cleaner for " + name)
     
} catch {
       
case e: Exception => logError("Error running cleanup task for " + name, e)
     
}
    }
  }

  //spark.cleaner.ttl的默认是-1, 得到的spark.cleaner.ttl.MAP_OUTPUT_TRACKER值对应的

delaySeconds的值是-1所以默认timer是不会清理数据的

  //所以默认timer是不会清理数据的
  if (delaySeconds > 0) {
   
logDebug(
      "Starting metadata cleaner for " + name + " with delay of " + delaySeconds + "seconds " + "and period of " + periodSeconds + "secs")
   
timer.schedule(task, delaySeconds* 1000, periodSeconds * 1000)
 
}

  def cancel() {
   
timer.cancel()
 
}
}

private[spark] objectMetadataCleanerType extends Enumeration {

 
val MAP_OUTPUT_TRACKER, SPARK_CONTEXT, HTTP_BROADCAST, BLOCK_MANAGER,
 
SHUFFLE_BLOCK_MANAGER, BROADCAST_VARS= Value

 
type MetadataCleanerType= Value

  //从sparkEnv初始化进来时是对应 MetadataCleanerType.MAP_OUTPUT_TRACKER,

得到的值是spark.cleaner.ttl.MAP_OUTPUT_TRACKER

  def systemProperty(which: MetadataCleanerType.MetadataCleanerType): String= {
   
"spark.cleaner.ttl." + which.toString
 
}
}

// TODO: This mutates a Conf to set properties right now,which is kind of ugly when used in the
// initialization of StreamingContext. It'sokay for users trying to configure stuff themselves.
private[spark] objectMetadataCleaner {
 
/** spark.cleaner.ttl:
    * Spark会记住任何元数据(生成的阶段,生成的任务等)的持续时间(秒)。 定期清理将确保比此时间更早的元数据。
    * 这对于运行Spark几个小时/天是很有用的(例如,在Spark Streaming应用程序中运行24/7)。 请注意,任何持续存储超过此持续时间的RDD也会被清除。
    * 默认值是-1
    */

 
def getDelaySeconds(conf: SparkConf): Int = {
   
conf.getTimeAsSeconds("spark.cleaner.ttl", "-1").toInt
 
}

  def getDelaySeconds(
     
conf: SparkConf,
     
cleanerType: MetadataCleanerType.MetadataCleanerType): Int = {
   
//初始化进来的时候,这个getDelaySeconds,对应spark.cleaner.ttl.MAP_OUTPUT_TRACKER值返回-1
    conf.get(MetadataCleanerType.systemProperty(cleanerType), getDelaySeconds(conf).toString).toInt
 
}

  def setDelaySeconds(
     
conf: SparkConf,
     
cleanerType: MetadataCleanerType.MetadataCleanerType,
     
delay: Int) {
   
conf.set(MetadataCleanerType.systemProperty(cleanerType), delay.toString)
 
}

===》如果设置spark.cleaner.ttl就会调用cleanup方法

/**
  *  在指定时间清除mapStatuses:TimeStampedHashMap[Int,Array[MapStatus
]]和cachedSerializedStatuses:TimeStampedHashMap[Int,Array[Byte]]
里面的kv数据
  *  spark.cleaner.ttl的默认是-1, 得到的spark.cleaner.ttl.MAP_OUTPUT_TRACKER值对应的delaySeconds的值是-1
      所以默认timer是不会清理数据的
  */

private def cleanup(cleanupTime: Long) {
 
mapStatuses.clearOldValues(cleanupTime)
 
cachedSerializedStatuses.clearOldValues(cleanupTime)
}

===》清理的方法很简单,就是将ConcurrentHashMap对应的Iterater取出遍历,然后判断key的时间,小于参数时间就进行清理

 

private[spark] case class TimeStampedValue[V](value: V, timestamp: Long)
/**
  * 这是scala.collection.mutable.Map的自定义实现,它存储插入时间戳和每个键值对。
  * 如果指定,则每次访问时每个对的时间戳都可以更新。 然后可以使用clearOldValues方法删除时间戳超过特定阈值时间的键值对。
  * 这个kv对的是scala的可变map: scala.collection.mutable.HashMap
 */

private[spark] class TimeStampedHashMap[A, B](updateTimeStampOnGet: Boolean = false)
 
extends mutable.Map[A, B]() with Logging{
 
//声明了一个并发的ConcurrentHashMap
  private val internalMap = new ConcurrentHashMap[A, TimeStampedValue[B]]()
 

  def getEntrySet: Set[Entry[A, TimeStampedValue[B]]]= internalMap.entrySet
 

  override def size: Int = internalMap.size
 
override def foreach[U](f: ((A, B)) => U) {
   
//这是一个ConcurrentHashMap[A,TimeStampedValue[B]]对应的Set[Entry[A,TimeStampedValue[B]]]
   val it = getEntrySet.iterator
   
while(it.hasNext){
     
val entry= it.next()
     
val kv =(entry.getKey, entry.getValue.value)
     
f(kv)
    }
  }
….
 
def clearOldValues(threshTime: Long, f: (A, B) => Unit) {
   
//这是一个ConcurrentHashMap[A,TimeStampedValue[B]]对应的Set[Entry[A,TimeStampedValue[B]]]
    val it =getEntrySet.iterator
   
while (it.hasNext){
     
val entry= it.next()
     
//小于threshTime的kv都清掉
      if (entry.getValue.timestamp< threshTime) {
       
f(entry.getKey, entry.getValue.value)
       
logDebug("Removing key " + entry.getKey)
       
it.remove() //iterator调用remove方法只能每next一次,才能对应it.remove()
      }
   
}
  }

  /** Removes old key-value pairs that have timestampearlier than `threshTime`. */
 
def clearOldValues(threshTime: Long) {
   
clearOldValues(threshTime, (_, _) => ())
 
}。。。。

7,最后看一下MapStatus这个类,它会根据partition的长度来选择不同的子类来存储ShuffleMapTask的输出

/**
 * Result returned by a ShuffleMapTask toa scheduler. Includes the block manager address that the task ran on as well asthe sizes of outputs for each reducer, for passing on to the reduce tasks.
  * MapStatus是一个由 ShuffleMapTask从DAGScheduler调度中之后返回的对象:block manager的地址及task运行时输出大小给reduce,传递给ReduceTask
  *
   mapStatuses有不同的行为在driver端和executers
    1),在driver上,它记录ShuffleMapTasks的map outputs输出的记录
    2),在executors上,只是简单的cache一下,会有相应的trigger去driver端取HashMap数据

 */

private[spark] sealed trait MapStatus {
 
/** Location where this task was run. */
 
def location: BlockManagerId

 
/**
   * Estimated size for the reduce block,in bytes.
   * If a block is non-empty, then thismethod MUST return a non-zero size.  Thisinvariant is necessary for correctness, since block fetchers are allowed toskip zero-size blocks.
    * 评估reduce块的大小,单位是字节。
    * 如果一个块是非空的,那么这个方法务必返回一个非零大小。 这个非变量的值必须正确,因为块提取器,允许跳过零大小的块。
   */

 
def getSizeForBlock(reduceId: Int): Long
}
private[spark] objectMapStatus {
 
/**
    * 在 partition 小于2000 和大于 2000 的两种场景下,Spark 使用不同的数据结构来在 shuffle 时记录相关信息,
    * 在 partition大于 2000 时,会用HighlyCompressedMapStatus更高效 [压缩] 的数据结构来存储信息。
    * 所以如果你的partition 没到 2000,但是很接近 2000,使用CompressedMapStatus来存信息。
    * 可以放心的把partition 设置为 2000 以上。
    */

 
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
   
if (uncompressedSizes.length> 2000) {
     
HighlyCompressedMapStatus(loc, uncompressedSizes)
   
} else {
     
new CompressedMapStatus(loc, uncompressedSizes)
   
}
  }
。。。
 
}
}


/**
 * A
[[MapStatus]] implementation that tracks the size of each block. Sizefor each block is represented using a single byte.
 *
  * 是
[[MapStatus]]
实现类,用于跟踪每个块的大小。 每个块的大小用一个字节表示。
 * @param loc location where thetask is being executed.
 * @param compressedSizes size ofthe blocks, indexed by reduce partition id.
 */

private[spark] class CompressedMapStatus(
   
private[this] var loc: BlockManagerId,
   
private[this] var compressedSizes: Array[Byte])
 
extends MapStatus with Externalizable {

 
protected def this() = this(null, null.asInstanceOf[Array[Byte]])  // For deserialization only

 
def this(loc: BlockManagerId, uncompressedSizes: Array[Long]) {
   
this(loc, uncompressedSizes.map(MapStatus.compressSize))
 
}

  override def location:BlockManagerId = loc
。。。。
}

/**
 * A
[[MapStatus]] implementation that only stores the average size ofnon-empty blocks,
 * plus a bitmap for tracking whichblocks are empty.
 *
  * 是
[[MapStatus]]
实现类,它只存储非空块的平均大小,使用bitmpa跟踪非空块
  *
 * @param loc location where thetask is being executed
 * @param numNonEmptyBlocks thenumber of non-empty blocks
 * @param emptyBlocks a bitmaptracking which blocks are empty
 * @param avgSize average size ofthe non-empty blocks
 */

private[spark] class HighlyCompressedMapStatusprivate (
   
private[this] var loc: BlockManagerId,
   
private[this] var numNonEmptyBlocks: Int,
   
private[this] var emptyBlocks: RoaringBitmap,
   
private[this] var avgSize: Long)
 
extends MapStatus with Externalizable {


你可能感兴趣的:(spark,core)