本篇博文我们主要分析SparkStreaming中的Receiver启动的过程。我们都知道StreamingContext是SparkStreaming程序的的主要入口,我们先看一下它的部分源码:
class StreamingContext private[streaming] (
sc_ : SparkContext,
cp_ : Checkpoint,
batchDur_ : Duration
) extends Logging {
//DStreamGraph中保存了一系列的DStream之间的依赖关系
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
cp_.graph.setContext(this)
cp_.graph.restoreCheckpointData()
cp_.graph
} else {
require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(batchDur_)
newGraph
}
}
private[streaming] val scheduler = new JobScheduler(this)
}
在这里我们可以看到当初始化一个StreamingContext对象的时候,就会新建一个DStreamGraph和JobScheduler对象,在这里DStreamGraph主要存储了DStream和它们之间的依赖关系
我们看看JobScheduler中的下面代码:
//创建JobScheduler的时候就会创建JobGenerator
private val jobGenerator = new JobGenerator(this)
我们就可以看到在创建JobScheduler后随即就会创建JobGenerator,至于有什么作用,我们后面再说
当开始启动一个StreamingContext的时候,我们首先需要调用StreamingContext#start方法
def start(): Unit = synchronized {
...
scheduler.start()
...
}
在它的start方法中主要就是调用了jobScheduler的start方法
...
//创建了receiverTracker组件,数据接收相关,创建并启动
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
receiverTracker.start()
//启动Generator
//至此,我们说的StreamingContext的几个相关的组件就都创建出来了
// 然后就是启动输入DStream关联的Receiver
//逻辑都在ReceiverTracker的start方法中
jobGenerator.start()
...
在jobSceduler的start方法中,会创建一个ReceiverTracker对象,然后调用其start方法
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
//endpoint为ReceiverTrackerEndpoint
//初始化一个endpoint,用来接收和处理来自ReceiverTracker和Receiver发送的消息
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
//这个start的方法主要的就是调用了launchReceivers方法,这个ReceiverTracker的主要作用就是启动Receiver
//将各个receivers分发到executors上去
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
先判断一下当前的receiverInputStreams数组是否为空,因为在SparkStreaming中除了ReceiverInputStream还有InputStream,类似于读取hadoop目录中的文件就不需要Receiver这个组件,所以这边需要过滤一下,然后我们发现初始化了一个ReceiverTrackerEndPoint,这个组件的主要功能就是用来处理发送给ReceiverTracker的消息,最最重要的是后面执行了一个launchReceivers方法
private def launchReceivers(): Unit = {
//获取DStreamGraph.inputStreams的receivers,即数据接收器
val receivers = receiverInputStreams.map(nis => {
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
})
runDummySparkJob()
//给消息接收处理器endpoint发送StartALLReceivers消息,直接返回不等待消息被处理
endpoint.send(StartAllReceivers(receivers))
}
先获取与receiverInputStreams对应的receivers,然后向ReceiverTrackerEndpoint发送startAllReceivers消息,rte接收到消息后的处理如下:
case StartAllReceivers(receivers) =>
//根据流数据接收器分发策略,匹配流数据接收器Receiver和Executor
val scheduledExecutors = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
//遍历每一个receiver,根据上面得到的目的executors调用startReceiver方法
for (receiver <- receivers) {
val executors = scheduledExecutors(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
//将流数据接收器的最佳位置保存起来
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
//在指定的executor中启动receiver
startReceiver(receiver, executors)
}
首先会为每一个receiver按照一定的策略指定一个executor,然后调用startReceiver(receiver, executors)
在指定的executor上启动receiver
private def startReceiver(receiver: Receiver[_], scheduledExecutors: Seq[String]): Unit = {
...
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
// Function to start the receiver on the worker node
//这是一个将要被提交到executor上的函数,用来启动receiver,
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
//创建一个ReceiverSupervisorImpl对象supervisor
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
//调用start方法来启动receiver
supervisor.start()
//调用awaitTermination来阻塞住主线程
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// Create the RDD using the scheduledExecutors to run the receiver in a Spark job
//将Receiver与其相对应的executor封装成一个RDD
val receiverRDD: RDD[Receiver[_]] =
if (scheduledExecutors.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
ssc.sc.makeRDD(Seq(receiver -> scheduledExecutors))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
//将封装了receiver和其对应的executor的RDD,以及要在ReceiverRDD上执行的函数组装成Job提交
// 从而正真地在executor上启动receiver
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try to restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
}(submitJobThreadPool)
logInfo(s"Receiver ${receiver.streamId} started")
}
在这里,我们将会看到SparkStreaming设计时最精妙的地方,首先将receiver及其对应的executor转换成一个RDD,然后将封装了receiver和其对应的executor的RDD,以及要在ReceiverRDD上执行的函数startReceiverFunc组装成Job提交,当任务提交的时候,在executors上就会调用这个函数对传入的RDD进行处理,然后我们看到在startReceiverFunc函数中,会调用supervisor#start方法
def start() {
onStart()
startReceiver()
}
上面的方法是在其父类ReceiverSupervisor上的,然后会继续调用父类的startReceiver方法
def startReceiver(): Unit = synchronized {
try {
//先调用ReceiverSupervisorImpl的onReceiverStart方法进行注册
//如果注册成功,则继续进行流数据接收器Receiver的启动
if (onReceiverStart()) {
logInfo("Starting receiver")
receiverState = Started
//receiver中就是真正的数据接收
receiver.onStart()
logInfo("Called receiver onStart")
} else {
// The driver refused us
//如果Driver端的TrackReceiver拒绝注册或者注册失败,则停止流数据接收器
//并发送注销流数据接收器DeregisterReceiver消息
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
在这个方法中就会调用receiver的onStart方法进行接收数据,通常在onStart方法里面会开启一个线程,例如我们的socketReceiver中的源码就是这样的:
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
至此我们的SparkStreaming的Receiver的启动过程就分析完了