一直不知道driver端如何接收executor的心跳,看了源代码终于知道了,
首先SparkContext中初始化了HeartbeatReceiver,HeartbeatReceiver是运行在driver中从执行器接收心跳…
/**
* rpcEnv是一个抽象类
* env.rpcEnv.setupEndpoint实际调用的是NettyRpcEnv的setupEndpoint方法
* driver端接收executor的心跳
*/
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
// 这一句代码是什么意思呢?ask方法是=》这个方法只发送一次消息,从不重试。上面代码有初始化_heartbeatReceiver的
// TaskSchedulerIsSet仅仅是一个对象,表示一个TaskSchedulerIs设置事件,由HeartbeatReceiver类中的receiveAndReply方法处理
// 这句话整体就是把_taskScheduler的值赋值给taskScheduler
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
下面是一个发送心跳的方法,看源代码可以知道在创建executor成功的时候,就调用了这个方法
// 执行发送心跳的线程
startDriverHeartbeater()
/**
* Schedules a task to report heartbeat and partial metrics for active tasks to driver.
*
* 调度一个任务,向驱动程序报告活动任务的心跳和部分度量。
*
* Executor心跳线程的间隔由属性spark.executor.heartbeatInterval配置,默认是10s,此外超时时间是30秒,
* 超时重试测试是3次,重试时间间隔是3000毫秒,使用actorSystem.actorSelection(url)方法查找到匹配的Actor
* 引用,url是akka.tcp://sparkDZriver@$driverHost:$driverPort/user/heartbeatReceiver,最终创建一个运行
* 过程中,每次回休眠1 0000-2 0000毫秒的线程。次现场从runningTasks获取最新有关的Task的测量信息,将其与executorId,
* blockManagerId分装成Heartbeat消息,向HeartbeatReceiver发送Heartbeat消息。
*
* 这个心跳线程的作用是什么呢?其作用有两个:
* 1.更新正在处理的任务的测量信息;
* 2.通知BlockManagerMaster,此Executor上的BlockManager依然活着。
*
*/
private def startDriverHeartbeater(): Unit = {
val intervalMs = conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s")
// Wait a random interval so the heartbeats don't end up in sync
val initialDelay = intervalMs + (math.random * intervalMs).asInstanceOf[Int]
val heartbeatTask = new Runnable() {
// reportHeartBeat()方法是活动的tasK向driver报告心跳和度量。
override def run(): Unit = Utils.logUncaughtExceptions(reportHeartBeat())
}
// 这是一个定时执行的任务,没10秒执行一次
heartbeater.scheduleAtFixedRate(heartbeatTask, initialDelay, intervalMs, TimeUnit.MILLISECONDS)
}
/** Reports heartbeat and metrics for active tasks to the driver.
* 活动的tasK向driver报告心跳和度量。
* */
private def reportHeartBeat(): Unit = {
// list of (task id, accumUpdates) to send back to the driver 一个list集合格式为(task id, accumUpdates)这个要发送给driver
val accumUpdates = new ArrayBuffer[(Long, Seq[AccumulatorV2[_, _]])]()
// 返回此JVM进程花费在垃圾收集中的总时间。
val curGCTime = computeTotalGcTime()
// 循环每个正在运行的任务,把任务id和对应的metrics信息封装起来
for (taskRunner <- runningTasks.values().asScala) {
if (taskRunner.task != null) {
// 将所有临时[[ShuffleReadMetrics]]中的值合并到' _shuffleReadMetrics '中。
taskRunner.task.metrics.mergeShuffleReadMetrics()
// GC的差值时间
taskRunner.task.metrics.setJvmGCTime(curGCTime - taskRunner.startGCTime)
accumUpdates += ((taskRunner.taskId, taskRunner.task.metrics.accumulators()))
}
}
// 组成一个Heartbeat事件
val message = Heartbeat(executorId, accumUpdates.toArray, env.blockManager.blockManagerId)
try {
// 将HeartbeatResponse事件发送给HeartbeatReceiver的receiveAndReply()方法处理,并且等待结果
val response = heartbeatReceiverRef.askSync[HeartbeatResponse](
message, RpcTimeout(conf, "spark.executor.heartbeatInterval", "10s"))
//心跳成功
if (response.reregisterBlockManager) {
logInfo("Told to re-register on heartbeat")
env.blockManager.reregister()
}
// 心跳成功,重试失败次数清零
heartbeatFailures = 0
} catch {
// 心跳失败,重试失败次数加一
case NonFatal(e) =>
logWarning("Issue communicating with driver in heartbeater", e)
heartbeatFailures += 1
if (heartbeatFailures >= HEARTBEAT_MAX_FAILURES) {
logError(s"Exit as unable to send heartbeats to driver " +
s"more than $HEARTBEAT_MAX_FAILURES times")
System.exit(ExecutorExitCode.HEARTBEAT_FAILURE)
}
}
}
// 将HeartbeatResponse事件发送给HeartbeatReceiver的receiveAndReply()方法处理,并且等待结果
val response = heartbeatReceiverRef.askSync[HeartbeatResponse](
message, RpcTimeout(conf, "spark.executor.heartbeatInterval", "10s"))
这里在HeartbeatReceiver的receiveAndReply()方法处理
// Messages received from executors 从executors接收信息
case heartbeat @ Heartbeat(executorId, accumUpdates, blockManagerId) =>
// 如果调度器存在,那么接收心跳
if (scheduler != null) {
// 如果存储一次接收心跳的executor(格式:executor id >该执行器最后一次心跳收到的时间戳)里面有这个executor,说明这个executor从上个时间到现在都是正常的
if (executorLastSeen.contains(executorId)) {
// 更改这个executor的最后心跳时间
executorLastSeen(executorId) = clock.getTimeMillis()
eventLoopThread.submit(new Runnable {
// 执行给定的块。如果有错误,记录非致命错误,并且只抛出致命错误
override def run(): Unit = Utils.tryLogNonFatalError {
// 更新正在执行的任务的度量标准,并让master知道BlockManager还活着。
val unknownExecutor = !scheduler.executorHeartbeatReceived(
executorId, accumUpdates, blockManagerId)
// 心跳响应事件HeartbeatResponse
val response = HeartbeatResponse(reregisterBlockManager = unknownExecutor)
// 这个响应事件有谁去处理呢?
context.reply(response)
}
})
// 如果存储一次接收心跳的executor(格式:executor id >该执行器最后一次心跳收到的时间戳)里面没有这个executor,
// 说明这个executor从上个时间到可能还不存在,或者没有注册,或者死掉了
} else {
// This may happen if we get an executor's in-flight heartbeat immediately
// after we just removed it. It's not really an error condition so we should
// not log warning here. Otherwise there may be a lot of noise especially if
// we explicitly remove executors (SPARK-4134).
// 如果我们在刚取出一个执行器executor的飞行心跳时,可能会发生这种情况。这并不是一个错误的条件,所以我们不应该在这里记录警告。
// 否则有可能如果我们明确排除执行者是一个很大的噪声尤其是(spark-4134)。
logDebug(s"Received heartbeat from unknown executor $executorId")
context.reply(HeartbeatResponse(reregisterBlockManager = true))
}
// 如果调度器都不存在,那么心跳没人接受
} else {
// Because Executor will sleep several seconds before sending the first "Heartbeat", this
// case rarely happens. However, if it really happens, log it and ask the executor to
// register itself again.
// 因为执行器在发送第一个“心跳”之前会睡几秒钟,这种情况很少发生。但是,如果真的发生了,记录它,并要求执行器再次注册自己。
logWarning(s"Dropping $heartbeat because TaskScheduler is not ready yet")
context.reply(HeartbeatResponse(reregisterBlockManager = true))
}
// 更新正在执行的任务的度量标准,并让master知道BlockManager还活着。
val unknownExecutor = !scheduler.executorHeartbeatReceived(
executorId, accumUpdates, blockManagerId)
这个方法调用的是TaskScheduler特质的方法
/**
* Update metrics for in-progress tasks and let the master know that the BlockManager is still
* alive. Return true if the driver knows about the given block manager. Otherwise, return false,
* indicating that the block manager should re-register.
*
* 更新正在执行的任务的度量标准,并让master知道BlockManager还活着。如果驱动程序driver知道给定的块管理器block manager,则返回true。否则,
* 返回false,指示区块管理器block manager应该重新注册。
*/
def executorHeartbeatReceived(
execId: String,
accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])],
blockManagerId: BlockManagerId): Boolean
因此要看实现类TaskSchedulerImpl的executorHeartbeatReceived方法。
/**
* Update metrics for in-progress tasks and let the master know that the BlockManager is still
* alive. Return true if the driver knows about the given block manager. Otherwise, return false,
* indicating that the block manager should re-register.
*
*
* 更新正在执行的任务的度量标准,并让master知道BlockManager仍然是活的。如果驱动程序driver知道给定的块管理器block manager,
* 则返回true。否则,返回false,指示区块管理器block manager应该重新注册。
*
* 这段代码通过遍历taskMetrics,一句taskIdToTaskSetId和activeTaskSets找到TaskSetManager,然后将taskId,TaskSetManager.stageId,
* TaskSetManager.taskSet.attempt,TaskMetrics分装到类型Array[(Long,Int,Int,TaskMetrics)]的数组metricsWithStageIds中。
*/
override def executorHeartbeatReceived(
execId: String,
accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])],
blockManagerId: BlockManagerId): Boolean = {
// (taskId, stageId, stageAttemptId, accumUpdates)
val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] = synchronized {
accumUpdates.flatMap { case (id, updates) =>
val accInfos = updates.map(acc => acc.toInfo(Some(acc.value), None))
taskIdToTaskSetManager.get(id).map { taskSetMgr =>
(id, taskSetMgr.stageId, taskSetMgr.taskSet.stageAttemptId, accInfos)
}
}
}
// 最后调用了dagScheduler的executorHeartbeatReceived方法
dagScheduler.executorHeartbeatReceived(execId, accumUpdatesWithTaskIds, blockManagerId)
}
上段代码主要看dagScheduler.executorHeartbeatReceived(execId, accumUpdatesWithTaskIds, blockManagerId),最后调用了dagScheduler的executorHeartbeatReceived方法
/**
* Update metrics for in-progress tasks and let the master know that the BlockManager is still
* alive. Return true if the driver knows about the given block manager. Otherwise, return false,
* indicating that the block manager should re-register.
*
* 更新正在执行的任务的度量标准,并让master知道BlockManager还活着。如果驱动程序知道给定的块管理器(block manager),
* 则返回true。否则,返回false,指示区块管理器(block manager)应该重新注册。
*/
def executorHeartbeatReceived(
execId: String,
// (taskId, stageId, stageAttemptId, accumUpdates)
accumUpdates: Array[(Long, Int, Int, Seq[AccumulableInfo])],
blockManagerId: BlockManagerId): Boolean = {
//dagScheduler将execID,accumUpdates封装成SparkListenerExecutorMetricsUpdate事件,并且post到ListenerBus中,此事件用于更新Satge
//的各种测量数据,由SparkListenerBus的doPostEvent方法执行
listenerBus.post(SparkListenerExecutorMetricsUpdate(execId, accumUpdates))
// 最后给BlockManagerManagerMaster持有的BlockMasterEndpoint发送BlockManagerHeartbeat消息。BlockManagerMasterEndpoint在
// 收到消息后会匹配执行heartbeatReceived方法,由BlockManagerMasterEndpoint类中的receiveAndReply方法执行
blockManagerMaster.driverEndpoint.askSync[Boolean](
BlockManagerHeartbeat(blockManagerId), new RpcTimeout(600 seconds, "BlockManagerHeartbeat"))
}
这里面第一句调用
//更新metrics事件
case metricsUpdate: SparkListenerExecutorMetricsUpdate =>
listener.onExecutorMetricsUpdate(metricsUpdate)
第二句调用调用heartbeatReceived方法
case BlockManagerHeartbeat(blockManagerId) =>
context.reply(heartbeatReceived(blockManagerId))
heartbeatReceived方法
/**
* Return true if the driver knows about the given block manager. Otherwise, return false,
* indicating that the block manager should re-register.
*
* 如果driver知道给定的块管理器,返回true。否则,返回false,表示块管理器应该重新注册。
*/
private def heartbeatReceived(blockManagerId: BlockManagerId): Boolean = {
if (!blockManagerInfo.contains(blockManagerId)) {
blockManagerId.isDriver && !isLocal
} else {
blockManagerInfo(blockManagerId).updateLastSeenMs()
true
}
}
这里有个疑问没解决
// 这个响应事件有谁去处理呢?
context.reply(response)
这里调用的是RpcCallContext特质中的
/**
* Reply a message to the sender. If the sender is [[RpcEndpoint]], its [[RpcEndpoint.receive]]
* will be called.
*
* 返回一个回调信息给信息的发送者,如果发送者是[[RpcEndpoint]],那么它的[[RpcEndpoint.receive]]方法将会被调用
*/
def reply(response: Any): Unit
那么我们只能看看它的实现类,但是不知道是那个
我们看看这个代码
// 将HeartbeatResponse事件发送给HeartbeatReceiver的receiveAndReply()方法处理,并且等待结果
val response = heartbeatReceiverRef.askSync[HeartbeatResponse](
message, RpcTimeout(conf, "spark.executor.heartbeatInterval", "10s"))
发现的heartbeatReceiverRef类型这样构造出来的
// must be initialized before running startDriverHeartbeat()
// 在运行startDriverHeartbeat()之前必须初始化
private val heartbeatReceiverRef =
RpcUtils.makeDriverRef(HeartbeatReceiver.ENDPOINT_NAME, conf, env.rpcEnv)
然后看看makeDriverRef方法
/**
* Retrieve a `RpcEndpointRef` which is located in the driver via its name.
* 返回一个RpcEndpointRef
*/
def makeDriverRef(name: String, conf: SparkConf, rpcEnv: RpcEnv): RpcEndpointRef = {
val driverHost: String = conf.get("spark.driver.host", "localhost")
val driverPort: Int = conf.getInt("spark.driver.port", 7077)
Utils.checkHost(driverHost, "Expected hostname")
rpcEnv.setupEndpointRef(RpcAddress(driverHost, driverPort), name)
}
然后我看看RpcEndpointRef,RpcEndpointRef是RpcEndpoint的一个远程引用,是线程安全的,她有两个实现子类:AkkaRpcEndpointRef和NettyRpcEndpointRef。
后面的不好追踪了,以后再解决