Spark源码解析之SparkStreaming中Receiver的启动

本篇博文我们主要分析SparkStreaming中的Receiver启动的过程。我们都知道StreamingContext是SparkStreaming程序的的主要入口,我们先看一下它的部分源码:

class StreamingContext private[streaming] (
  sc_ : SparkContext,
   cp_ : Checkpoint,
   batchDur_ : Duration
 ) extends Logging {

 //DStreamGraph中保存了一系列的DStream之间的依赖关系
 private[streaming] val graph: DStreamGraph = {
   if (isCheckpointPresent) {
     cp_.graph.setContext(this)
     cp_.graph.restoreCheckpointData()
     cp_.graph
   } else {
     require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
     val newGraph = new DStreamGraph()
     newGraph.setBatchDuration(batchDur_)
     newGraph
   }
 }

 private[streaming] val scheduler = new JobScheduler(this)
}

在这里我们可以看到当初始化一个StreamingContext对象的时候,就会新建一个DStreamGraph和JobScheduler对象,在这里DStreamGraph主要存储了DStream和它们之间的依赖关系

我们看看JobScheduler中的下面代码:

//创建JobScheduler的时候就会创建JobGenerator
private val jobGenerator = new JobGenerator(this)

我们就可以看到在创建JobScheduler后随即就会创建JobGenerator,至于有什么作用,我们后面再说

当开始启动一个StreamingContext的时候,我们首先需要调用StreamingContext#start方法

def start(): Unit = synchronized {
 ...
 scheduler.start()
 ...
}

在它的start方法中主要就是调用了jobScheduler的start方法

...
 //创建了receiverTracker组件,数据接收相关,创建并启动
 receiverTracker = new ReceiverTracker(ssc)
 inputInfoTracker = new InputInfoTracker(ssc)
 receiverTracker.start()
 //启动Generator
 //至此,我们说的StreamingContext的几个相关的组件就都创建出来了
 // 然后就是启动输入DStream关联的Receiver
 //逻辑都在ReceiverTracker的start方法中
 jobGenerator.start()
...

在jobSceduler的start方法中,会创建一个ReceiverTracker对象,然后调用其start方法

def start(): Unit = synchronized {
   if (isTrackerStarted) {
     throw new SparkException("ReceiverTracker already started")
   }

   if (!receiverInputStreams.isEmpty) {
     //endpoint为ReceiverTrackerEndpoint
     //初始化一个endpoint,用来接收和处理来自ReceiverTracker和Receiver发送的消息
     endpoint = ssc.env.rpcEnv.setupEndpoint(
       "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))

     //这个start的方法主要的就是调用了launchReceivers方法,这个ReceiverTracker的主要作用就是启动Receiver
     //将各个receivers分发到executors上去
     if (!skipReceiverLaunch) launchReceivers()
     logInfo("ReceiverTracker started")
     trackerState = Started
   }
}

先判断一下当前的receiverInputStreams数组是否为空,因为在SparkStreaming中除了ReceiverInputStream还有InputStream,类似于读取hadoop目录中的文件就不需要Receiver这个组件,所以这边需要过滤一下,然后我们发现初始化了一个ReceiverTrackerEndPoint,这个组件的主要功能就是用来处理发送给ReceiverTracker的消息,最最重要的是后面执行了一个launchReceivers方法

private def launchReceivers(): Unit = {
   //获取DStreamGraph.inputStreams的receivers,即数据接收器
   val receivers = receiverInputStreams.map(nis => {
     val rcvr = nis.getReceiver()
     rcvr.setReceiverId(nis.id)
     rcvr
   })

   runDummySparkJob()

   //给消息接收处理器endpoint发送StartALLReceivers消息,直接返回不等待消息被处理
   endpoint.send(StartAllReceivers(receivers))
}

先获取与receiverInputStreams对应的receivers,然后向ReceiverTrackerEndpoint发送startAllReceivers消息,rte接收到消息后的处理如下:

case StartAllReceivers(receivers) =>
  //根据流数据接收器分发策略,匹配流数据接收器Receiver和Executor
  val scheduledExecutors = schedulingPolicy.scheduleReceivers(receivers, getExecutors)

  //遍历每一个receiver,根据上面得到的目的executors调用startReceiver方法
  for (receiver <- receivers) {
    val executors = scheduledExecutors(receiver.streamId)
    updateReceiverScheduledExecutors(receiver.streamId, executors)
    //将流数据接收器的最佳位置保存起来
    receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
    //在指定的executor中启动receiver
    startReceiver(receiver, executors)
  }

首先会为每一个receiver按照一定的策略指定一个executor,然后调用startReceiver(receiver, executors)在指定的executor上启动receiver

 private def startReceiver(receiver: Receiver[_], scheduledExecutors: Seq[String]): Unit = {
   ...
   val serializableHadoopConf =
     new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

   // Function to start the receiver on the worker node
   //这是一个将要被提交到executor上的函数,用来启动receiver,
   val startReceiverFunc: Iterator[Receiver[_]] => Unit =
     (iterator: Iterator[Receiver[_]]) => {
       if (!iterator.hasNext) {
         throw new SparkException(
           "Could not start receiver as object not found.")
       }
       if (TaskContext.get().attemptNumber() == 0) {
         val receiver = iterator.next()
         assert(iterator.hasNext == false)

         //创建一个ReceiverSupervisorImpl对象supervisor
         val supervisor = new ReceiverSupervisorImpl(
           receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)

         //调用start方法来启动receiver
         supervisor.start()

         //调用awaitTermination来阻塞住主线程
         supervisor.awaitTermination()
       } else {
         // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
       }
     }

   // Create the RDD using the scheduledExecutors to run the receiver in a Spark job
   //将Receiver与其相对应的executor封装成一个RDD
   val receiverRDD: RDD[Receiver[_]] =
     if (scheduledExecutors.isEmpty) {
       ssc.sc.makeRDD(Seq(receiver), 1)
     } else {
       ssc.sc.makeRDD(Seq(receiver -> scheduledExecutors))
     }

   receiverRDD.setName(s"Receiver $receiverId")
   ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
   ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

   //将封装了receiver和其对应的executor的RDD,以及要在ReceiverRDD上执行的函数组装成Job提交
   // 从而正真地在executor上启动receiver
   val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
     receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())

   // We will keep restarting the receiver job until ReceiverTracker is stopped
   future.onComplete {
     case Success(_) =>
       if (!shouldStartReceiver) {
         onReceiverJobFinish(receiverId)
       } else {
         logInfo(s"Restarting Receiver $receiverId")
         self.send(RestartReceiver(receiver))
       }
     case Failure(e) =>
       if (!shouldStartReceiver) {
         onReceiverJobFinish(receiverId)
       } else {
         logError("Receiver has been stopped. Try to restart it.", e)
         logInfo(s"Restarting Receiver $receiverId")
         self.send(RestartReceiver(receiver))
       }
   }(submitJobThreadPool)
   logInfo(s"Receiver ${receiver.streamId} started")
}

在这里,我们将会看到SparkStreaming设计时最精妙的地方,首先将receiver及其对应的executor转换成一个RDD,然后将封装了receiver和其对应的executor的RDD,以及要在ReceiverRDD上执行的函数startReceiverFunc组装成Job提交,当任务提交的时候,在executors上就会调用这个函数对传入的RDD进行处理,然后我们看到在startReceiverFunc函数中,会调用supervisor#start方法

def start() {
    onStart()
    startReceiver()
}

上面的方法是在其父类ReceiverSupervisor上的,然后会继续调用父类的startReceiver方法

def startReceiver(): Unit = synchronized {
  try {
     //先调用ReceiverSupervisorImpl的onReceiverStart方法进行注册
     //如果注册成功,则继续进行流数据接收器Receiver的启动
     if (onReceiverStart()) {
       logInfo("Starting receiver")
       receiverState = Started
       //receiver中就是真正的数据接收
       receiver.onStart()
       logInfo("Called receiver onStart")
     } else {
       // The driver refused us
       //如果Driver端的TrackReceiver拒绝注册或者注册失败,则停止流数据接收器
       //并发送注销流数据接收器DeregisterReceiver消息
       stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
     }
   } catch {
     case NonFatal(t) =>
       stop("Error starting receiver " + streamId, Some(t))
   }
}

在这个方法中就会调用receiver的onStart方法进行接收数据,通常在onStart方法里面会开启一个线程,例如我们的socketReceiver中的源码就是这样的:

def onStart() {
  // Start the thread that receives data over a connection
  new Thread("Socket Receiver") {
    setDaemon(true)
    override def run() { receive() }
  }.start()
}

def receive() {
   var socket: Socket = null
   try {
     logInfo("Connecting to " + host + ":" + port)
     socket = new Socket(host, port)
     logInfo("Connected to " + host + ":" + port)
     val iterator = bytesToObjects(socket.getInputStream())
     while(!isStopped && iterator.hasNext) {
       store(iterator.next)
     }
     if (!isStopped()) {
       restart("Socket data stream had no more data")
     } else {
       logInfo("Stopped receiving")
     }
   } catch {
     case e: java.net.ConnectException =>
       restart("Error connecting to " + host + ":" + port, e)
     case NonFatal(e) =>
       logWarning("Error receiving data", e)
       restart("Error receiving data", e)
   } finally {
     if (socket != null) {
       socket.close()
       logInfo("Closed socket to " + host + ":" + port)
     }
   }
}

至此我们的SparkStreaming的Receiver的启动过程就分析完了

你可能感兴趣的:(spark)