在Spark Streaming中使用StreamingContext来操作DStream,这也好比Spark Core中SparkContext对于RDD,DStream可以说是RDD的又一层封装,作用于DStream上的Operation可以大概分为以下两类:
Spark Streaming主要由3个主要的模块构成。
下面来通过一个简单的实例来探究Spark Streaming的执行过程。
val ssc = new StreamingContext(sc,Seconds(2))
val lines = ssc.socketTextStream("localhost",8888)
val words = lines.flatMap(_.split(" "))
val pairs = words.map((word,1))
val wordcount = pairs.reduceByKey(_+_)
wordcount.print()
ssc.start()
ssc.awaitTermination()
在探究之前,我们先来了解一下StreamingContext内部的几个成员变量。
在StreamingContext中,各个组件的初始化过程如下:
private[streaming] val conf = sc.conf
private[streaming] val env = SparkEnv.get
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
cp_.graph.setContext(this)
cp_.graph.restoreCheckpointData()
cp_.graph
} else {
assert(batchDur_ != null, "Batch duration for streaming context cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(batchDur_)
newGraph
}
}
private val nextReceiverInputStreamId = new AtomicInteger(0)
private[streaming] var checkpointDir: String = {
if (isCheckpointPresent) {
sc.setCheckpointDir(cp_.checkpointDir)
cp_.checkpointDir
} else {
null
}
}
private[streaming] val checkpointDuration: Duration = {
if (isCheckpointPresent) cp_.checkpointDuration else graph.batchDuration
}
private[streaming] val scheduler = new JobScheduler(this)
private[streaming] val waiter = new ContextWaiter
private[streaming] val progressListener = new StreamingJobProgressListener(this)
private[streaming] val uiTab: Option[StreamingTab] =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(new StreamingTab(this))
} else {
None
}
/** Register streaming source to metrics system */
private val streamingSource = new StreamingSource(this)
SparkEnv.get.metricsSystem.registerSource(streamingSource)
/** Enumeration to identify current state of the StreamingContext */
private[streaming] object StreamingContextState extends Enumeration {
type CheckpointState = Value
val Initialized, Started, Stopped = Value
}
import StreamingContextState._
private[streaming] var state = Initialized
StreamingContext初始化完成之后,调用socketTextStream来创建 SocketInputDstream:
def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}
def socketStream[T: ClassTag](
hostname: String,
port: Int,
converter: (InputStream) => Iterator[T],
storageLevel: StorageLevel
): ReceiverInputDStream[T] = {
new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
}
在父类InputDStream中,它会把自己加入DStreamGraph中,InputDStream中的代码片段如下:
//InputStream将自己添加到DStreamGraph中
ssc.graph.addInputStream(this)
在SocketInputDStream类中,会定义产生一个新的SocketReceiver,源码如下:
//产生一个新的SocketReceiver
def getReceiver(): Receiver[T] = {
new SocketReceiver(host, port, bytesToObjects, storageLevel)
}
在创建完SocketReceiver后,会进行一系列的我们 自定义的转换操作,如flatMap,reduceByKey等操作。那么Spark Streaming是如何接受到客户端发送来的数据呢?底层又是如何创建Socket并且生成Spark Job的,下面,我们带着这一系列问题来探究其原理。
上面剖析到Spark Streaming在创建完SocketReceiver后会进行执行我们自定义的操作,执行完这些操作之后,StreamingContext就会调用它的start方法进行启动,在该start方法中会调用JobScheduler的start方法。其源码如下:
/**
* Start the execution of the streams.
*
* @throws SparkException if the context has already been started or stopped.
*/
def start(): Unit = synchronized {
if (state == Started) {
throw new SparkException("StreamingContext has already been started")
}
if (state == Stopped) {
throw new SparkException("StreamingContext has already been stopped")
}
validate()
sparkContext.setCallSite(DStream.getCreationSite())
//调用JobScheduler.start方法
scheduler.start()
state = Started
}
在JobScheduler中的start源码如下,它会创建ReceiverTracker,然后调用它的start方法:
def start(): Unit = synchronized {
if (eventActor != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventActor = ssc.env.actorSystem.actorOf(Props(new Actor {
def receive = {
case event: JobSchedulerEvent => processEvent(event)
}
}), "JobScheduler")
listenerBus.start()
receiverTracker = new ReceiverTracker(ssc)
//调用ReceiverTracker.start方法
receiverTracker.start()
jobGenerator.start()
logInfo("Started JobScheduler")
}
在ReceiverTracker的start方法会判断InputStreams是否存在,并且在Executor中启动Receiver线程,最后调用ReceiverLaunch.start方法。源码如下:
/** Start the actor and receiver execution thread. */
def start() = synchronized {
if (actor != null) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
actor = ssc.env.actorSystem.actorOf(Props(new ReceiverTrackerActor),
"ReceiverTracker")
// private val receiverExecutor = new ReceiverLauncher()
//调用ReceiverLauncher的start方法
if (!skipReceiverLaunch) receiverExecutor.start()
logInfo("ReceiverTracker started")
}
}
在ReceiverLauncher线程中调用startReceivers方法:
/** This thread class runs all the receivers on the cluster. */
class ReceiverLauncher {
@transient val env = ssc.env
@volatile @transient private var running = false
@transient val thread = new Thread() {
override def run() {
try {
SparkEnv.set(env)
startReceivers()
} catch {
case ie: InterruptedException => logInfo("ReceiverLauncher interrupted")
}
}
}
def start() {
thread.start()
}
在startReceivers方法中主要将接收到的数据 定义为RDD并且分发到各个worker节点上去,在这个函数中,会首先启动ReceiverSupervisor,然后由ReceiverSupervisor来触发Receiver执行。startReceivers方法源码如下:
/**
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
*/
/**
* 定义RDD
*/
private def startReceivers() {
val receivers = receiverInputStreams.map(nis => {
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
})
// Right now, we only honor preferences if all receivers have them
val hasLocationPreferences = receivers.map(_.preferredLocation.isDefined).reduce(_ && _)
// Create the parallel collection of receivers to distributed them on the worker nodes
//定义RDD并且分布到worker节点上
val tempRDD =
if (hasLocationPreferences) {
val receiversWithPreferences = receivers.map(r => (r, Seq(r.preferredLocation.get)))
ssc.sc.makeRDD[Receiver[_]](receiversWithPreferences)
} else {
ssc.sc.makeRDD(receivers, receivers.size)
}
val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf = new SerializableWritable(ssc.sparkContext.hadoopConfiguration)
// Function to start the receiver on the worker node
val startReceiver = (iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
val receiver = iterator.next()
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
//启动ReceiverSupervisor
supervisor.start()
supervisor.awaitTermination()
}
// Run the dummy Spark job to ensure that all slaves have registered.
// This avoids all the receivers to be scheduled on the same node.
if (!ssc.sparkContext.isLocal) {
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
}
// Distribute the receivers and start them
logInfo("Starting " + receivers.length + " receivers")
running = true
ssc.sparkContext.runJob(tempRDD, ssc.sparkContext.clean(startReceiver))
running = false
logInfo("All of the receivers have been terminated")
}
在ReceiverSupervisor中的start方法源码如下:
/** Start the supervisor */
def start() {
onStart()
startReceiver()
}
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
logInfo("Starting receiver")
//启动Receiver
//Receiver是一个抽象方法,SocketInputDStream继承了Receiver
receiver.onStart()
logInfo("Called receiver onStart")
onReceiverStart()
receiverState = Started
} catch {
case t: Throwable =>
stop("Error starting receiver " + streamId, Some(t))
}
}
在start方法中调用了startReceiver方法,在startReceiver方法中,调用了Receiver抽象类的start方法,其中SocketInputDStream继承了该抽象类,因此会调用SocketInputDStream.onStart方法,因此,其源码如下:
//实现了抽象类Receiver的onStart方法
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
/** Create a socket connection and receive data until receiver is stopped */
//创建真正的Scoket连接接收数据
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
logInfo("Stopped receiving")
restart("Retrying connecting to " + host + ":" + port)
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
restart("Error receiving data", t)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
在onStart方法中调用了receive方法,在该方法中可以看到Socket被真正的创建,不断的接收发送来的数据。
这里需要注意的是:Receiver运行在Worker机器上的Executor JVM进程之中,而不是Driver Application的JVM之内。
上面我们看到了Spark Streaming最后在SocketReceiver中使用receiver函数真正的创建了Socket去接收数据,那么接收的数据将如何存储,并且如何进行后续的处理呢?
数据的处理流程如下:
下面我们来看看BlockGenerator中的源码。其成员变量如下:
private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any])
private val clock = new SystemClock()
private val blockInterval = conf.getLong("spark.streaming.blockInterval", 200)
private val blockIntervalTimer =
new RecurringTimer(clock, blockInterval, updateCurrentBuffer, "BlockGenerator")
//队列大小
private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
//该线程通过调用keepPushingBlocks函数将blocksForPush队列中的成员传递给BlockManager
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
@volatile private var currentBuffer = new ArrayBuffer[Any]
@volatile private var stopped = false
在向blocksForPushing队列中添加定义的Block时,当队列满的时候会进行阻塞操作,我们也可以看到通过参数"spark.streaming.blockInterval"可以设置队列的大小,这里也可以作为我们在实际开发中的一个调优点,增大队列的大小,提升处理的速度,不过该队列占有太多的内存,也有可能导致Spark作业的内存不足,发生OOM,因此,可根据实际情况设置,一般情况我们采用默认值即可。
updateCurrentBuffer函数源码如下:
/** Change the buffer to which single records are added to. */
private def updateCurrentBuffer(time: Long): Unit = synchronized {
try {
val newBlockBuffer = currentBuffer
currentBuffer = new ArrayBuffer[Any]
if (newBlockBuffer.size > 0) {
val blockId = StreamBlockId(receiverId, time - blockInterval)
//将当前buffer中的数据封装为一个新的Block,放入blockForPush队列
val newBlock = new Block(blockId, newBlockBuffer)
listener.onGenerateBlock(blockId)
//放入队列,当队列满的时候阻塞
blocksForPushing.put(newBlock) // put is blocking when queue is full
logDebug("Last element in " + blockId + " is " + newBlockBuffer.last)
}
} catch {
case ie: InterruptedException =>
logInfo("Block updating timer thread was interrupted")
case e: Exception =>
reportError("Error in block updating thread", e)
}
}
keepPushingBlocks函数源码如下:
/** Keep pushing blocks to the BlockManager. */
private def keepPushingBlocks() {
logInfo("Started block pushing thread")
try {
while(!stopped) {
Option(blocksForPushing.poll(100, TimeUnit.MILLISECONDS)) match {
case Some(block) => pushBlock(block)
case None =>
}
}
// Push out the blocks that are still left
logInfo("Pushing out the last " + blocksForPushing.size() + " blocks")
while (!blocksForPushing.isEmpty) {
logDebug("Getting block ")
val block = blocksForPushing.take()
pushBlock(block)
logInfo("Blocks left to push " + blocksForPushing.size())
}
logInfo("Stopped block pushing thread")
} catch {
case ie: InterruptedException =>
logInfo("Block pushing thread was interrupted")
case e: Exception =>
reportError("Error in block pushing thread", e)
}
}
在上面的源码中它会调用pushBlock函数将数据传递给BlockManager:
private def pushBlock(block: Block) {
listener.onPushBlock(block.id, block.buffer)
logInfo("Pushed block " + block.id)
}
在pushBlock函数中,调用了BlockGeneratorListener接口的onPushBlock函数,注意这里会使用它的匿名实现类调用onPushBlock实现方法,并且在该实现方法中调用pushArrayBuffer函数将数据传递给BlockManager。
def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
最后pushArrayBuffer中最终调用pushAndReportBlock函数完成相关操作:
/** Store block and report it to driver */
def pushAndReportBlock(
receivedBlock: ReceivedBlock,
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
val blockId = blockIdOption.getOrElse(nextBlockId)
val numRecords = receivedBlock match {
case ArrayBufferBlock(arrayBuffer) => arrayBuffer.size
case _ => -1
}
val time = System.currentTimeMillis
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
val blockInfo = ReceivedBlockInfo(streamId, numRecords, blockStoreResult)
val future = trackerActor.ask(AddBlock(blockInfo))(askTimeout)
Await.result(future, askTimeout)
logDebug(s"Reported block $blockId")
}
由trackerActor.ask可以知道最终将数据包装成ReceivedBlockInfo传递给ReceiverTracker中,使用的数据结构是HashMap:
/** Remote Akka actor for the ReceiverTracker */
private val trackerActor = {
val ip = env.conf.get("spark.driver.host", "localhost")
val port = env.conf.getInt("spark.driver.port", 7077)
val url = AkkaUtils.address(
AkkaUtils.protocol(env.actorSystem),
SparkEnv.driverActorSystemName,
ip,
port,
"ReceiverTracker")
env.actorSystem.actorSelection(url)
}
剖析了数据的接收过来是如何存储的,下面我们来看看它又是如何将其转换为Spark Job进行数据处理的。
数据处理的函数调用流程如下:
JobGenerator.generateJobs ---> DStreamGraph.generateJobs ---> DStream.generateJob --->getOrCompute --->计算生成RDD
在JobGenerator中的generateJobs函数源码如下所示:
/** Generate jobs and perform checkpoint for the given `time`. */
private def generateJobs(time: Time) {
// Set the SparkEnv in this thread, so that job generation code can access the environment
// Example: BlockRDDs are created in this thread, and it needs to access BlockManager
// Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
SparkEnv.set(ssc.env)
Try {
//以time为关键字获取此时间之前的所有blockIds
jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
//调用DStreamGraph.generateJobs
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val receivedBlockInfos =
jobScheduler.receiverTracker.getBlocksOfBatch(time).mapValues { _.toArray }
jobScheduler.submitJobSet(JobSet(time, jobs, receivedBlockInfos))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
}
eventActor ! DoCheckpoint(time, clearCheckpointDataLater = false)
}
最后在DStream中的generateJob函数中会调用getOrCompute函数,源码如下:
/**
* Generate a SparkStreaming job for the given time. This is an internal method that
* should not be called directly. This default implementation creates a job
* that materializes the corresponding RDD. Subclasses of DStream may override this
* to generate their own jobs.
*/
private[streaming] def generateJob(time: Time): Option[Job] = {
//调用getOrCompute生成RDD
getOrCompute(time) match {
case Some(rdd) => {
val jobFunc = () => {
val emptyFunc = { (iterator: Iterator[T]) => {} }
//调用SparkContext.runJob方法运行计算的RDD
context.sparkContext.runJob(rdd, emptyFunc)
}
Some(new Job(time, jobFunc))
}
case None => None
}
}
在DStream的generateJob函数中我们终于看到了它调用SparkContext的runJob函数来执行Spark作业了,因此,我们也梳理通了整条线路,从数据的接收到数据的存储,再到生成RDD并且执行Spark 作业。
至此,关于Spark Streaming的执行流程源码剖析到这里,如有任何问题,欢迎指教留言讨论。