案例代码:
OnlineTheTop3ItemForEachCategory2DB.scala
package com.dt.spark.sparkstreaming
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* 使用Spark Streaming+Spark SQL来在线动态计算电商中不同类别中最热门的商品排名,例如手机这个类别下面最热门的三种手机、电视这个类别
* 下最热门的三种电视,该实例在实际生产环境下具有非常重大的意义;
*
* 新浪微博:http://weibo.com/ilovepains/
*
*
* 实现技术:Spark Streaming+Spark SQL,之所以Spark Streaming能够使用ML、sql、graphx等功能是因为有foreachRDD和Transform
* 等接口,这些接口中其实是基于RDD进行操作,所以以RDD为基石,就可以直接使用Spark其它所有的功能,就像直接调用API一样简单。
* 假设说这里的数据的格式:user item category,例如Rocky Samsung Android
*/
object OnlineTheTop3ItemForEachCategory2DB {
def main(args: Array[String]){
/**
* 第1步:创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息,
* 例如说通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置
* 为local,则代表Spark程序在本地运行,特别适合于机器配置条件非常差(例如
* 只有1G的内存)的初学者 *
*/
val conf = new SparkConf() //创建SparkConf对象
conf.setAppName("OnlineTheTop3ItemForEachCategory2DB") //设置应用程序的名称,在程序运行的监控界面可以看到名称
// conf.setMaster("spark://Master:7077") //此时,程序在Spark集群
conf.setMaster("local[6]")
//设置batchDuration时间间隔来控制Job生成的频率并且创建Spark Streaming执行的入口
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("/root/Documents/SparkApps/checkpoint")
val userClickLogsDStream = ssc.socketTextStream("Master", 9999)
val formattedUserClickLogsDStream = userClickLogsDStream.map(clickLog =>
(clickLog.split(" ")(2) + "_" + clickLog.split(" ")(1), 1))
// val categoryUserClickLogsDStream = formattedUserClickLogsDStream.reduceByKeyAndWindow((v1:Int, v2: Int) => v1 + v2,
// (v1:Int, v2: Int) => v1 - v2, Seconds(60), Seconds(20))
val categoryUserClickLogsDStream = formattedUserClickLogsDStream.reduceByKeyAndWindow(_+_,
_-_, Seconds(60), Seconds(20))
categoryUserClickLogsDStream.foreachRDD { rdd => {
if (rdd.isEmpty()) {
println("No data inputted!!!")
} else {
val categoryItemRow = rdd.map(reducedItem => {
val category = reducedItem._1.split("_")(0)
val item = reducedItem._1.split("_")(1)
val click_count = reducedItem._2
Row(category, item, click_count)
})
val structType = StructType(Array(
StructField("category", StringType, true),
StructField("item", StringType, true),
StructField("click_count", IntegerType, true)
))
val hiveContext = new HiveContext(rdd.context)
val categoryItemDF = hiveContext.createDataFrame(categoryItemRow, structType)
categoryItemDF.registerTempTable("categoryItemTable")
val reseltDataFram = hiveContext.sql("SELECT category,item,click_count FROM (SELECT category,item,click_count,row_number()" +
" OVER (PARTITION BY category ORDER BY click_count DESC) rank FROM categoryItemTable) subquery " +
" WHERE rank <= 3")
reseltDataFram.show()
val resultRowRDD = reseltDataFram.rdd resultRowRDD.foreachPartition { partitionOfRecords => {
if (partitionOfRecords.isEmpty){
println("This RDD is not null but partition is null")
} else {
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => {
val sql = "insert into categorytop3(category,item,client_count) values('" + record.getAs("category") + "','" +
record.getAs("item") + "'," + record.getAs("click_count") + ")" val stmt = connection.createStatement();
stmt.executeUpdate(sql);
})
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
}
}
}
}
/**
* 在StreamingContext调用start方法的内部其实是会启动JobScheduler的Start方法,进行消息循环,在JobScheduler
* 的start内部会构造JobGenerator和ReceiverTacker,并且调用JobGenerator和ReceiverTacker的start方法:
* 1,JobGenerator启动后会不断的根据batchDuration生成一个个的Job
* 2,ReceiverTracker启动后首先在Spark Cluster中启动Receiver(其实是在Executor中先启动ReceiverSupervisor),在Receiver收到
* 数据后会通过ReceiverSupervisor存储到Executor并且把数据的Metadata信息发送给Driver中的ReceiverTracker,在ReceiverTracker
* 内部会通过ReceivedBlockTracker来管理接受到的元数据信息
* 每个BatchInterval会产生一个具体的Job,其实这里的Job不是Spark Core中所指的Job,它只是基于DStreamGraph而生成的RDD
* 的DAG而已,从Java角度讲,相当于Runnable接口实例,此时要想运行Job需要提交给JobScheduler,在JobScheduler中通过线程池的方式找到一个
* 单独的线程来提交Job到集群运行(其实是在线程中基于RDD的Action触发真正的作业的运行),为什么使用线程池呢?
* 1,作业不断生成,所以为了提升效率,我们需要线程池;这和在Executor中通过线程池执行Task有异曲同工之妙;
* 2,有可能设置了Job的FAIR公平调度的方式,这个时候也需要多线程的支持;
*
*/
ssc.start()
ssc.awaitTermination()
}
}
ConnectionPool.java
package com.dt.spark.sparkstreaming;
import java.sql.Connection;
import java.sql.DriverManager;
import java.util.LinkedList;
public class ConnectionPool {
static LinkedList<Connection> connectionQueue = null;
static {
try {
Class.forName("com.mysql.jdbc.Driver");
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
public synchronized static Connection getConnection() {
try {
if(connectionQueue == null) {
connectionQueue = new LinkedList<Connection>();
for(int i = 0 ; i < 5 ; i++) {
Connection conn = DriverManager.getConnection("jdbc:mysql://192.168.110.237:3306/sparkstreaming","root","123456");
connectionQueue.push(conn);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return connectionQueue.poll();
}
public static void returnConnection(Connection conn) {
connectionQueue.push(conn);
}
}
1、打包程序,放到服务器上。
2、 需要先执行nc,如果直接执行打包的程序,会报错,会报端口被拒绝的错误,因为9999端口确实未启动。
$ nc -lk 9999
peter samsung androidphone
mike huawei
androidphone
jim xiaomi
jaker apple applephone
lili samsung androidphone
zhangsan samsung androidpad
lisi samsung androidpad
peter samsung androidpad
jack apple applephone
3、把打包的代码提交到集群当中。
$ vim OnlineTheTop3ItemForEachCategory2DB.sh
内容如下:
/usr/local/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class com.dt.spark.sparkstreaming.OnlineTheTop3ItemForEachCategory2DB --jars /home/SparkApps/mysql-connector-java-5.1.26-bin.jar --master spark://Master:7077 /home/SparkApps/OnlineTheTop3ItemForEachCategory2DB.jar
执行shell代码:
$ sh
OnlineTheTop3ItemForEachCategory2DB
.sh
执行过程
1、进入到App IDapp-20160506093713-0009里。
2、会看到生成了很多个Job。
3、点击 Job Id 1 Streaming job running receiver 0
start at OnlineTheTop3ItemForEachCategory2DB.scala:118
这是接受数据的Receiver。
4、进入 Job Id 2 Streaming job from [output operation 0, batch time 09:37:50]
isEmpty at OnlineTheTop3ItemForEachCategory2DB.scala:51
StreamingContext.scala
class StreamingContext private[streaming] (
sc_ : SparkContext,
cp_ : Checkpoint,
batchDur_ : Duration
) extends Logging {
private[streaming] def createNewSparkContext(conf: SparkConf): SparkContext = {
new SparkContext(conf)
}
private[streaming] def createNewSparkContext(
master: String,
appName: String,
sparkHome: String,
jars: Seq[String],
environment: Map[String, String]
): SparkContext = {
val conf = SparkContext.updatedConf(
new SparkConf(), master, appName, sparkHome, jars, environment)
new SparkContext(conf)
}
def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}
def socketStream[T: ClassTag](
hostname: String,
port: Int,
converter: (InputStream) => Iterator[T],
storageLevel: StorageLevel
): ReceiverInputDStream[T] = {
new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
}
/**
* Start the execution of the streams.
*
*
@throws
IllegalStateException if the StreamingContext is already stopped.
*/
def start(): Unit = synchronized {
state
match {
case
INITIALIZED =>
startSite.set(DStream.
getCreationSite())
StreamingContext.
ACTIVATION_LOCK.synchronized {
StreamingContext.
assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.
runInNewThread(
"streaming-start") {
sparkContext.setCallSite(
startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.
SPARK_JOB_INTERRUPT_ON_CANCEL,
"false")
scheduler.start()
}
state = StreamingContextState.
ACTIVE
}
catch {
case
NonFatal(
e) =>
logError(
"Error starting the context, marking it as stopped",
e)
scheduler.stop(
false)
state = StreamingContextState.
STOPPED
throw
e
}
StreamingContext.
setActiveContext(
this)
}
shutdownHookRef = ShutdownHookManager.
addShutdownHook(
StreamingContext.
SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(
env.
metricsSystem !=
null)
env.
metricsSystem.registerSource(
streamingSource)
uiTab.foreach(_.attach())
logInfo(
"StreamingContext started")
case
ACTIVE =>
logWarning(
"StreamingContext has already been started")
case
STOPPED =>
throw new IllegalStateException(
"StreamingContext has already been stopped")
}
}
SocketInputDStream.scala
private[streaming]
class SocketInputDStream[T: ClassTag](
ssc_ : StreamingContext,
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](ssc_) {
def getReceiver(): Receiver[T] = {
new SocketReceiver(host, port, bytesToObjects, storageLevel)
}
}
/* SocketReceiver里onStart方法里开启了线程,这个线程run的时候执行receive方法,receive方法就会连上Socket,一直循环这个接受数据。*/
class SocketReceiver[T: ClassTag](
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends Receiver[T](storageLevel) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
/** Create a socket connection and receive data until receiver is stopped */
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
}
ReceiverInputDStream.scala
/* 这是所有输入流的基类,这个类提供了start和stop方法,当 Spark Streaming 系统启动和启动接受数据的时候被回调 */
abstract class ReceiverInputDStream[
T: ClassTag](
ssc_ : StreamingContext)
extends InputDStream[
T](
ssc_) {
def getReceiver(): Receiver[
T]
def start() {}
def stop() {}
InputDStream.scala
abstract class InputDStream[
T: ClassTag] (
ssc_ : StreamingContext)
extends DStream[T](ssc_) {
DStream.scala
/* 是RDD的模板 */
abstract class DStream[
T: ClassTag] (
@transient
private[streaming]
var
ssc: StreamingContext
)
extends Serializable
with Logging {
private[
streaming]
var
generatedRDDs =
new HashMap[Time, RDD[
T]] ()
/**
* 获得Batch Duration中的相应的RDD,从cache或compute-and-cache中读取;
*/
private[streaming]
final def getOrCompute(
time: Time): Option[RDD[
T]] = {
// If RDD was already generated, then retrieve it from HashMap,
// or else compute the RDD
generatedRDDs.get(
time).orElse {
// Compute the RDD if time is valid (e.g. correct time in a sliding window)
// of RDD generation, else generate nothing.
if (isTimeValid(
time)) {
val
rddOption = createRDDWithLocalProperties(
time,
displayInnerRDDOps =
false) {
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details. We need to have this call here because
// compute() might cause Spark jobs to be launched.
PairRDDFunctions.
disableOutputSpecValidation.withValue(
true) {
compute(
time)
}
}
rddOption.foreach {
case
newRDD =>
// Register the generated RDD for caching and checkpointing
if (
storageLevel != StorageLevel.
NONE) {
newRDD.persist(
storageLevel)
logDebug(
s"Persisting RDD
${
newRDD.
id}
for time
$
time
to
$
storageLevel
")
}
if (
checkpointDuration !=
null && (
time -
zeroTime).isMultipleOf(
checkpointDuration)) {
newRDD.checkpoint()
logInfo(
s"Marking RDD
${
newRDD.
id}
for time
$
time
for checkpointing")
}
generatedRDDs.put(
time,
newRDD)
}
rddOption
}
else {
None
}
}
}
DStream的集成结构:
InputDSream的集成结构
ReceiverInputDStream的集成结构
ForEachDStream.scala
package org.apache.spark.streaming.dstream
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Duration, Time}
import org.apache.spark.streaming.scheduler.Job
import scala.reflect.ClassTag
/**
* 类似于Output DStream
*/
private[streaming]
class ForEachDStream[
T: ClassTag] (
parent: DStream[
T],
foreachFunc: (RDD[
T], Time) => Unit,
displayInnerRDDOps: Boolean
)
extends DStream[Unit](
parent.
ssc) {
override def dependencies:
List[DStream[_]] =
List(
parent)
override def slideDuration: Duration =
parent.slideDuration
override def compute(
validTime: Time): Option[RDD[Unit]] = None
override def generateJob(
time: Time): Option[Job] = {
parent.getOrCompute(
time)
match {
case
Some(
rdd) =>
val
jobFunc = () => createRDDWithLocalProperties(
time,
displayInnerRDDOps) {
foreachFunc(
rdd,
time)
}
Some(
new Job(
time,
jobFunc))
case None => None
}
}
}
/**
* Start the execution of the streams.
*
*
@throws
IllegalStateException
if the StreamingContext is already stopped.
*/
def
start(): Unit = synchronized {
state
match {
case
INITIALIZED =>
startSite.set(
DStream.
getCreationSite())
StreamingContext.
ACTIVATION_LOCK.synchronized {
StreamingContext.
assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.
runInNewThread(
"streaming-start") {
sparkContext.setCallSite(
startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(
SparkContext.
SPARK_JOB_INTERRUPT_ON_CANCEL,
"false")
scheduler.start()
}
state =
StreamingContextState.
ACTIVE
}
catch {
case
NonFatal(e) =>
logError(
"Error starting the context, marking it as stopped", e)
scheduler.stop(
false)
state =
StreamingContextState.
STOPPED
throw e
}
StreamingContext.
setActiveContext(
this)
}
shutdownHookRef =
ShutdownHookManager.
addShutdownHook(
StreamingContext.
SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(
env.metricsSystem !=
null)
env.metricsSystem.registerSource(
streamingSource)
uiTab.foreach(_.attach())
logInfo(
"StreamingContext started")
case
ACTIVE =>
logWarning(
"StreamingContext has already been started")
case
STOPPED =>
throw new IllegalStateException(
"StreamingContext has already been stopped")
}
}
进入到scheduler.start()方法:
JobScheduler.scala中的start方法:
def
start(): Unit = synchronized {
if (
eventLoop !=
null)
return
// scheduler has already been started
logDebug(
"Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <- ssc.
graph.getInputStreams
rateController <- inputDStream.
rateController
} ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
receiverTracker =
new ReceiverTracker(ssc)
inputInfoTracker =
new InputInfoTracker(ssc)
receiverTracker.start()
jobGenerator.start()
logInfo(
"Started JobScheduler")
}
EventLoop.scala的onReceive方法:
package org.apache.spark.util
import java.util.concurrent.atomic.AtomicBoolean
import java.util.concurrent.{BlockingQueue, LinkedBlockingDeque}
import scala.util.control.NonFatal
import org.apache.spark.Logging
/**
* An event loop to receive events from the caller and process all events in the event thread. It
* will start an exclusive event thread to process all events.
*
* Note: The event queue will grow indefinitely. So subclasses should make sure
`
onReceive
`
can
* handle events in time to avoid the potential OOM.
*/
private[spark]
abstract class EventLoop[
E](
name:
String)
extends Logging {
private val
eventQueue: BlockingQueue[
E] =
new LinkedBlockingDeque[
E]()
private val
stopped =
new AtomicBoolean(
false)
private val
eventThread =
new Thread(
name) {
setDaemon(
true)
override def run(): Unit = {
try {
while (!stopped.get) {
/* 先进进出的方式 */
val
event =
eventQueue.take()
try {
onReceive(
event)
}
catch {
case
NonFatal(
e) => {
try {
onError(
e)
}
catch {
case
NonFatal(
e) => logError(
"Unexpected error in " +
name,
e)
}
}
}
}
}
catch {
case
ie:
InterruptedException =>
// exit even if eventQueue is not empty
case
NonFatal(
e) => logError(
"Unexpected error in " +
name,
e)
}
}
}
def start(): Unit = {
if (
stopped.get) {
throw new IllegalStateException(
name +
" has already been stopped")
}
// Call onStart before starting the event thread to make sure it happens before onReceive
onStart()
eventThread.start()
}
def stop(): Unit = {
if (
stopped.compareAndSet(
false,
true)) {
eventThread.interrupt()
var
onStopCalled =
false
try {
eventThread.join()
// Call onStop after the event thread exits to make sure onReceive happens before onStop
onStopCalled =
true
onStop()
}
catch {
case
ie:
InterruptedException =>
Thread.
currentThread().interrupt()
if (!
onStopCalled) {
// ie is thrown from `eventThread.join()`. Otherwise, we should not call `onStop` since
// it's already called.
onStop()
}
}
}
else {
// Keep quiet to allow calling `stop` multiple times.
}
}
/**
* Put the event into the event queue. The event thread will process it later.
*/
def post(
event:
E): Unit = {
eventQueue.put(
event)
}
/**
* Return if the event thread has already been started but not yet stopped.
*/
def isActive: Boolean =
eventThread.isAlive
/**
* Invoked when
`
start()
`
is called but before the event thread starts.
*/
protected def onStart(): Unit = {}
/**
* Invoked when
`
stop()
`
is called and the event thread exits.
*/
protected def onStop(): Unit = {}
/**
* Invoked in the event thread when polling events from the event queue.
*
* Note: Should avoid calling blocking actions in
`
onReceive
`
, or the event thread will be blocked
* and cannot process events in time. If you want to call some blocking actions, run them in
* another thread.
*/
protected def onReceive(
event:
E): Unit
/**
* Invoked if
`
onReceive
`
throws any non fatal error. Any non fatal error thrown from
`
onError
`
* will be ignored.
*/
protected def onError(
e:
Throwable): Unit
}
StreamingContext.scala里JobScheduler。
JobScheduler.scala
class JobScheduler(
val
ssc: StreamingContext)
extends Logging {
def start(): Unit = synchronized {
if (
eventLoop !=
null)
return
// scheduler has already been started
logDebug(
"Starting JobScheduler")
eventLoop =
new EventLoop[JobSchedulerEvent](
"JobScheduler") {
override protected def onReceive(
event: JobSchedulerEvent): Unit = processEvent(
event)
override protected def onError(
e:
Throwable): Unit = reportError(
"Error in job scheduler",
e)
}
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <-
ssc.
graph.getInputStreams
rateController <-
inputDStream.
rateController
}
ssc.addStreamingListener(
rateController)
listenerBus.start(ssc.sparkContext) //
listenerBus是StreamingL
istenerBus
receiverTracker =
new ReceiverTracker(
ssc)
inputInfoTracker =
new InputInfoTracker(
ssc)
receiverTracker.start()
jobGenerator.start()
logInfo(
"Started JobScheduler")
}
processEvent(event) 方法:
private def processEvent(
event: JobSchedulerEvent) {
try {
event
match {
case
JobStarted(
job,
startTime) => handleJobStart(
job,
startTime)
case
JobCompleted(
job,
completedTime) => handleJobCompletion(
job,
completedTime)
case
ErrorReported(
m,
e) => handleError(
m,
e)
}
}
catch {
case
e:
Throwable =>
reportError(
"Error in job scheduler",
e)
}
}
ReceiverTracker.scala的start方法:
/** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException(
"ReceiverTracker already started")
}
if (!
receiverInputStreams.isEmpty) {
endpoint =
ssc.
env.
rpcEnv.setupEndpoint(
"ReceiverTracker",
new
ReceiverTrackerEndpoint(
ssc.
env.
rpcEnv))
if (!
skipReceiverLaunch) launchReceivers()
logInfo(
"ReceiverTracker started")
trackerState =
Started
}
}
private class ReceiverTrackerEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint
override def receive: PartialFunction[Any, Unit] = {
// Local messages
case
StartAllReceivers(
receivers) =>
val
scheduledLocations =
schedulingPolicy.scheduleReceivers(
receivers, getExecutors)
for (
receiver <-
receivers) {
val
executors =
scheduledLocations(
receiver.streamId)
updateReceiverScheduledExecutors(
receiver.streamId,
executors)
receiverPreferredLocations(
receiver.streamId) =
receiver.preferredLocation
startReceiver(
receiver,
executors)
}
case
RestartReceiver(
receiver) =>
// Old scheduled executors minus the ones that are not active any more
val
oldScheduledExecutors = getStoredScheduledExecutors(
receiver.streamId)
val
scheduledLocations =
if (
oldScheduledExecutors.nonEmpty) {
// Try global scheduling again
oldScheduledExecutors
}
else {
val
oldReceiverInfo =
receiverTrackingInfos(
receiver.streamId)
// Clear "scheduledLocations" to indicate we are going to do local scheduling
val
newReceiverInfo =
oldReceiverInfo.copy(
state = ReceiverState.
INACTIVE,
scheduledLocations = None)
receiverTrackingInfos(
receiver.streamId) =
newReceiverInfo
schedulingPolicy.rescheduleReceiver(
receiver.streamId,
receiver.preferredLocation,
receiverTrackingInfos,
getExecutors)
}
// Assume there is one receiver restarting at one time, so we don't need to update
// receiverTrackingInfos
startReceiver(
receiver,
scheduledLocations)
case
c: CleanupOldBlocks =>
receiverTrackingInfos.values.flatMap(_.
endpoint).foreach(_.send(
c))
case
UpdateReceiverRateLimit(
streamUID,
newRate) =>
for (
info <-
receiverTrackingInfos.get(
streamUID);
eP <-
info.
endpoint) {
eP.send(
UpdateRateLimit(
newRate))
}
// Remote messages
case
ReportError(
streamId,
message,
error) =>
reportError(
streamId,
message,
error)
}
/**
* Start a receiver along with its scheduled executors
*/
private def
startReceiver(
receiver: Receiver[_],
scheduledLocations:
Seq[TaskLocation]): Unit = {
def shouldStartReceiver: Boolean = {
// It's okay to start when trackerState is Initialized or Started
!(isTrackerStopping || isTrackerStopped)
}
val
receiverId =
receiver.streamId
if (!shouldStartReceiver) {
onReceiverJobFinish(
receiverId)
return
}
val
checkpointDirOption =
Option(
ssc.
checkpointDir)
val
serializableHadoopConf =
new SerializableConfiguration(
ssc.sparkContext.hadoopConfiguration)
// Function to start the receiver on the worker node
val
startReceiverFunc:
Iterator[Receiver[_]] => Unit =
(iterator:
Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.
get().attemptNumber() ==
0) {
val
receiver = iterator.next()
assert(iterator.hasNext ==
false)
val
supervisor =
new ReceiverSupervisorImpl(
receiver, SparkEnv.
get,
serializableHadoopConf.
value,
checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
}
else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// Create the RDD using the scheduledLocations to run the receiver in a Spark job
val
receiverRDD: RDD[Receiver[_]] =
if (
scheduledLocations.isEmpty) {
ssc.
sc.makeRDD(
Seq(
receiver),
1)
}
else {
val
preferredLocations =
scheduledLocations.map(_.toString).distinct
ssc.
sc.makeRDD(
Seq(
receiver ->
preferredLocations))
}
receiverRDD.setName(
s"Receiver
$
receiverId
")
ssc.sparkContext.setJobDescription(
s"Streaming job running receiver
$
receiverId
")
ssc.sparkContext.setCallSite(
Option(
ssc.getStartSite()).getOrElse(Utils.
getCallSite()))
val
future =
ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD,
startReceiverFunc,
Seq(
0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case
Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(
receiverId)
}
else {
logInfo(
s"Restarting Receiver
$
receiverId
")
self.send(
RestartReceiver(
receiver))
}
case
Failure(
e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(
receiverId)
}
else {
logError(
"Receiver has been stopped. Try to restart it.",
e)
logInfo(
s"Restarting Receiver
$
receiverId
")
self.send(
RestartReceiver(
receiver))
}
}(
submitJobThreadPool)
logInfo(
s"Receiver
${
receiver.streamId}
started")
}
supervisor.start()
val supervisor = new ReceiverSupervisorImpl
ReceiverSupervisor.scala
/** Start the supervisor */
def start() {
onStart()
startReceiver()
}
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
if (
onReceiverStart()) {
logInfo(
"Starting receiver")
receiverState =
Started
receiver.onStart()
logInfo(
"Called receiver onStart")
}
else {
// The driver refused us
stop(
"Registered unsuccessfully because Driver refused to start receiver " +
streamId, None)
}
}
catch {
case
NonFatal(
t) =>
stop(
"Error starting receiver " +
streamId,
Some(
t))
}
}
ReceiverSupervisorImpl.scala
override protected def onStart() {
registeredBlockGenerators.foreach { _.start() }
}
override protected def
onReceiverStart(): Boolean = {
val
msg =
RegisterReceiver(
streamId,
receiver.getClass.getSimpleName,
host,
executorId,
endpoint)
trackerEndpoint.askWithRetry[Boolean](
msg)
}
ReceiverTracker.scala
/**
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
*/
private def
launchReceivers(): Unit = {
val
receivers =
receiverInputStreams.map(nis => {
val
rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.
id)
rcvr
})
runDummySparkJob()
logInfo(
"Starting " +
receivers.length +
" receivers")
endpoint.send(
StartAllReceivers(
receivers))
}
/**
* Run the dummy Spark job to ensure that all slaves have registered. This avoids all the
* receivers to be scheduled on the same node.
*
*
TODO Should poll the executor number and wait for executors according to
* "spark.scheduler.minRegisteredResourcesRatio" and
* "spark.scheduler.maxRegisteredResourcesWaitingTime" rather than running a dummy job.
*/
private def
runDummySparkJob(): Unit = {
if (!
ssc.sparkContext.isLocal) {
ssc.sparkContext.makeRDD(
1 to
50,
50).map(x => (x,
1)).reduceByKey(_ + _,
20).collect()
}
assert(getExecutors.nonEmpty)
}
private val
receivedBlockTracker =
new
ReceivedBlockTracker(
ssc.sparkContext.conf,
ssc.sparkContext.hadoopConfiguration,
receiverInputStreamIds,
ssc.
scheduler.
clock,
ssc.
isCheckpointPresent,
Option(ssc.
checkpointDir)
)
/** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException(
"ReceiverTracker already started")
}
if (!
receiverInputStreams.isEmpty) {
endpoint =
ssc.
env.
rpcEnv.setupEndpoint(
"ReceiverTracker",
new ReceiverTrackerEndpoint(
ssc.
env.
rpcEnv))
if (!
skipReceiverLaunch) launchReceivers()
logInfo(
"ReceiverTracker started")
trackerState =
Started
}
}
override def receive: PartialFunction[Any, Unit] = {
ReceivedBlockTracker.scala
/**
* 当你接受到数据,ReceivedBlockTracker都会有你的元数据信息,Driver调度的时候把哪些数据分配给我们具体的Job的时候,
* Driver也会找ReceivedBlockTracker来获得这个batch duration的数据。
*/
private[streaming]
class
ReceivedBlockTracker(
conf:
SparkConf,
hadoopConf:
Configuration,
streamIds: Seq[Int],
clock: Clock,
recoverFromWriteAheadLog: Boolean,
checkpointDirOption: Option[String])
extends Logging {
JobGenerator.scala
class
JobGenerator(jobScheduler:
JobScheduler)
extends Logging {
private val
timer =
new
RecurringTimer(
clock,
ssc.
graph.
batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
代码流程图:
DStream的集成结构:
InputDSream的集成结构
ReceiverInputDStream的集成结构
备注:技术来源:
大数据Spark高手:王家林
QQ:1740415547