先贴下案例源码
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Durations, StreamingContext}
/**
* 感谢王家林老师的知识分享
* 王家林老师名片:
* 中国Spark第一人
* 感谢王家林老师的知识分享
* 新浪微博:http://weibo.com/ilovepains
* 微信公众号:DT_Spark
* 博客:http://blog.sina.com.cn/ilovepains
* 手机:18610086859
* QQ:1740415547
* 邮箱:[email protected]
* YY课堂:每天20:00免费现场授课频道68917580
* 王家林:DT大数据梦工厂创始人、Spark亚太研究院院长和首席专家、大数据培训专家、大数据架构师。
*/
object StreamingWordCountSelfScala {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("spark://master:7077").setAppName("StreamingWordCountSelfScala")
val ssc = new StreamingContext(sparkConf, Durations.seconds(5)) // 每5秒收割一次数据
val lines = ssc.socketTextStream("localhost", 9999) // 监听 本地9999 socket 端口
val words = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) // flat map 后 reduce
words.print() // 打印结果
ssc.start() // 启动
ssc.awaitTermination()
ssc.stop(true)
}
}
上文已经从源码分析了InputDStream实例化过程,下一步是
ssc.start()
还是以源码为基础,正如王家林老师所说:
一切问题都源于源码,一切问题也都终结于源码
此时的状态是INITIALIZED。
新建了一个线程,调用了JobScheduler的start。
// StreamingContext.scala line 60
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
scheduler.start()
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
再追踪进JobScheduler.start。
新建了一个EventLoop,定义了onReceive和onError方法
调用了eventLoop.start
实例化ReceiverTracker
调用了receiverTracker.start
调用了jobGenerator.start
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
// 定义onReceive,直接调用processEvent。后续详细解释
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
receiverTracker.start()
jobGenerator.start()
logInfo("Started JobScheduler")
}
让我们先分析EventLoop的实例化。
实例化时定义了一个LinkedBlockingQueue,由这个队列来保存事件。
同时新建了一个线程,定义了该线程启动时的方法,从queue中取一个消息,然后传入onReceive方法。
此onReceive方法再EventLoop是抽象方法,具体的实现在JobScheduler中实例化EventLoop时已经定义。
// EventLoop.scala line 34
private[spark] abstract class EventLoop[E](name: String) extends Logging {
private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()
private val stopped = new AtomicBoolean(false)
private val eventThread = new Thread(name) {
setDaemon(true)
override def run(): Unit = {
try {
while (!stopped.get) {
val event = eventQueue.take() // 从队列中eventQueue取出一个消息
try {
onReceive(event) // 让子类的onReceive调用
} catch {
case NonFatal(e) => {
try {
onError(e)
} catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
}
} catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
// 具体实现交由子类
protected def onReceive(event: E): Unit
...
// 其他方法
}
看看onReceive中,调用的processEvent方法
// JobScheduler.scala line 145
private def processEvent(event: JobSchedulerEvent) {
try {
event match {
case JobStarted(job, startTime) => handleJobStart(job, startTime)
case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime)
case ErrorReported(m, e) => handleError(m, e)
}
} catch {
case e: Throwable =>
reportError("Error in job scheduler", e)
}
}
具体的方法暂时先不细究,到遇到时再深入。
返回来看下 eventLoop.start()
调用了onStart方法
将EventLoop中的线程启动
// EventLoop.scala line 67
def start(): Unit = {
if (stopped.get) {
throw new IllegalStateException(name + " has already been stopped")
}
onStart()
eventThread.start()
}
onStart方法的默认实现是空方法,从注释中可以知道,此方法是在线程启动前调用的。这里是个预留的方法。自定义的时候可以重写。
jobScheduler中实例化的EventLoop并没有重写onStart。
// EventLoop.scala line 111
/**
* Invoked when `start()` is called but before the event thread starts.
*/
protected def onStart(): Unit = {}
时至此时,并没有消息被产生或被消费。从下一刻起,数据开始流起来了。
看eventThread.start,其实是run方法被调用。就是上面提到的。先从队列中取一个消息。
且慢,这时会有消息吗?会吗?没有啊。这个队列eventQueue中根本没往里放,怎么会有呢?
因此此时线程会阻塞,直到取到消息。那么我们就先往下跟踪。
// JobScheduler.scala line 80
receiverTracker = new ReceiverTracker(ssc)
跟踪进去,看类的注释可知,ReceiverTracker是管理ReceiverInputDStreams的实例的 数据接受的执行的。
获取所有使用接收器接受消息的DStream,ReceiverInputDStream,并分配ID
实例化 ReceivedBlockTracker
// ReceiverTracker.scala line 101
private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
private val receiverInputStreamIds = receiverInputStreams.map { _.id }
private val receivedBlockTracker = new ReceivedBlockTracker(
ssc.sparkContext.conf,
ssc.sparkContext.hadoopConfiguration,
receiverInputStreamIds,
ssc.scheduler.clock,
ssc.isCheckpointPresent,
Option(ssc.checkpointDir)
)
private val listenerBus = ssc.scheduler.listenerBus
/** Enumeration to identify current state of the ReceiverTracker */
object TrackerState extends Enumeration {
type TrackerState = Value
val Initialized, Started, Stopping, Stopped = Value
}
import TrackerState._
/** State of the tracker. Protected by "trackerStateLock" */
@volatile private var trackerState = Initialized
// endpoint is created when generator starts.
// This not being null means the tracker has been started and not stopped
private var endpoint: RpcEndpointRef = null
private val schedulingPolicy = new ReceiverSchedulingPolicy()
private val receiverJobExitLatch = new CountDownLatch(receiverInputStreams.size)
/**
* Track all receivers' information. The key is the receiver id, the value is the receiver info.
* It's only accessed in ReceiverTrackerEndpoint.
*/
private val receiverTrackingInfos = new HashMap[Int, ReceiverTrackingInfo]
/**
* Store all preferred locations for all receivers. We need this information to schedule
* receivers. It's only accessed in ReceiverTrackerEndpoint.
*/
private val receiverPreferredLocations = new HashMap[Int, Option[String]]
先看下 ssc.graph.getReceiverInputStreams()
将所有的InputDStream中不是ReceiverInputDStream的过滤掉,然后将每个对象转换成ReceiverInputDStream后返回。
有朋友会问,为什么是数组呢?因为可能有多种数据接收器,可以从Socket,也可以Kafka通过Receiver的方式流入数据。
// DStreamGraph.scala line 101
def getReceiverInputStreams(): Array[ReceiverInputDStream[_]] = this.synchronized {
inputStreams.filter(_.isInstanceOf[ReceiverInputDStream[_]])
.map(_.asInstanceOf[ReceiverInputDStream[_]])
.toArray
}
再看ReceivedBlockTracker,创建了跟踪ReceivedBlock的Tracker。即追踪接受到的数据块。
实例化是只是简单的初始化了一些数据结构,并创建了WAL
//ReceivedBlockTracker.scala line 62
private[streaming] class ReceivedBlockTracker(
conf: SparkConf,
hadoopConf: Configuration,
streamIds: Seq[Int],
clock: Clock,
recoverFromWriteAheadLog: Boolean,
checkpointDirOption: Option[String])
extends Logging {
private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
private val writeAheadLogOption = createWriteAheadLog()
private var lastAllocatedBatchTime: Time = null
// line 251
/** Optionally create the write ahead log manager only if the feature is enabled */
private def createWriteAheadLog(): Option[WriteAheadLog] = {
checkpointDirOption.map { checkpointDir =>
val logDir = ReceivedBlockTracker.checkpointDirToLogDir(checkpointDirOption.get)
WriteAheadLogUtils.createLogForDriver(conf, logDir, hadoopConf)
}
}
...
// 其他方法
}
至此,数据接受器也实例化完成了。不过开始接受数据了吗?没。
那么,截至目前,消息队列已经准备就绪。接收器也准备就绪了。
还要做什么操作,才能让数据流进来呢?
请看下文
感谢王家林老师的知识分享
王家林老师名片:
中国Spark第一人
感谢王家林老师的知识分享
新浪微博:http://weibo.com/ilovepains
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
手机:18610086859
QQ:1740415547
YY课堂:每天20:00免费现场授课频道68917580
王家林:DT大数据梦工厂创始人、Spark亚太研究院院长和首席专家、大数据培训专家、大数据架构师。