本期内容:
1. Spark Streaming产生Job的机制
2. Spark Streaming的其它产生Job的方式
1. Spark Streaming产生Job的机制
Scala程序中,函数可以作为参数传递,因为函数也是对象。有函数对象不意味着函数马上就运行。Spark Streaming中,常利用线程的run来调用函数,从而导致函数的最终运行。
Spark Streaming中,Job对象包含函数成员。
NetworkWordCount程序中,DStream.print导致了Job的产生。
DStream.print:
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[T], time: Time) => {
val firstNum =
rdd.take
(num + 1)
// scalastyle:off println
println("-------------------------------------------")
println("Time: " + time)
println("-------------------------------------------")
firstNum.take(num).foreach(println)
if (firstNum.length > num) println("...")
println()
// scalastyle:on println
}
}
foreachRDD
(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
}
Spark Streaming应用程序中,除了print,saveAsObjectFiles、saveAsTextFiles
等也能调用foreachRDD,生成ForEachDStream,才能在后面产生Job。
private def foreachRDD(
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
new ForEachDStream
(this,
context.sparkContext.clean(
foreachFunc
, false), displayInnerRDDOps).
register
()
}
通过register注册,新生成的ForEachDStream加入到DStreamGraph的成员outputDStreams中。
如果没有print、count、saveAsObjectFiles、saveAsTextFiles等这样的代码,DStreamGraph中outputDStreams就为空,那么DStreamGraph.generateJobs就产生结果呢?
DStreamGraph.generateJobs:
def generateJobs(time: Time): Seq[Job] = {
logDebug("Generating jobs for time " + time)
val jobs = this.synchronized {
outputStreams.flatMap
{ outputStream =>
val jobOption = outputStream.generateJob(time)
jobOption.foreach(_.setCallSite(outputStream.creationSite))
jobOption
}
}
logDebug("Generated " + jobs.length + " jobs for time " + time)
jobs
}
DStreamGraph.generateJobs就会产生空的Job序列。
通过对DStream(或其子类)定制自己的方法,可以使foreachFunc的定义中不含有RDD.take这样的语句。
这样的话,foreachRDD中的foreachFunc不一定会产生Job。如果其中的函数foreachFunc里面没有Action操作,就不会触发Job。
2. Spark Streaming的其它产生Job的方式
一定要action才会有Job吗?不是。ForEachDStream.transform就可能产生Job。ForEachDStream.transform有两个定义,是调用关系。
ForEachDStream.transform:
/**
* Return a new DStream in which each RDD is generated by applying a function
* on each RDD of 'this' DStream.
*/
def transform[U: ClassTag](
transformFunc
: RDD[T] => RDD[U]): DStream[U] = ssc.withScope {
// because the DStream is reachable from the outer object here, and because
// DStreams can't be serialized with closures, we can't proactively check
// it for serializability and so we pass the optional false to SparkContext.clean
val cleanedF = context.sparkContext.clean(
transformFunc
, false)
transform((r: RDD[T], t: Time) => cleanedF(r))
}
/**
* Return a new DStream in which each RDD is generated by applying a function
* on each RDD of 'this' DStream.
*/
def transform[U: ClassTag](transformFunc: (RDD[T], Time) => RDD[U]): DStream[U] = ssc.withScope {
// because the DStream is reachable from the outer object here, and because
// DStreams can't be serialized with closures, we can't proactively check
// it for serializability and so we pass the optional false to SparkContext.clean
val cleanedF = context.sparkContext.clean(transformFunc, false)
val realTransformFunc = (rdds: Seq[RDD[_]], time: Time) => {
assert(rdds.length == 1)
cleanedF(rdds.head.asInstanceOf[RDD[T]], time)
}
new TransformedDStream
[U](Seq(this), realTransformFunc)
}
其中的函数类型的参数transformFunc是输入RDD并产生一个新的RDD。最终实际会生成TransformedDStream对象。
在第8课中提到过,一般的DStream子类的Compute方法,仅仅是调用父类DStream的getOrCompute,而TransformedDStream的compte方法不是这样。
TransformedDStream.compute:
override def compute(validTime: Time): Option[RDD[U]] = {
val parentRDDs = parents.map { parent => parent.getOrCompute(validTime).getOrElse(
// Guard out against parent DStream that return None instead of Some(rdd) to avoid NPE
throw new SparkException(s"Couldn't generate RDD from parent at time $validTime"))
}
val transformedRDD =
transformFunc
(parentRDDs, validTime)
if (transformedRDD == null) {
throw new SparkException("Transform function must not return null. " +
"Return SparkContext.emptyRDD() instead to represent no element " +
"as the result of transformation.")
}
Some(transformedRDD)
}
和别的DStream子类不同,TransformedDStream的compute方法还调用了transformFunc,函数transformFunc是被马上执行的。这就不会等到JobScheduler调度后再执行。
transformFunc
其中如果有count、print等action操作,就也会触发这个Job的执行。这其实可以理解为是个漏洞。
此前说的各种操作是lazy级别,不能马上拿到结果。而由于transformFunc不接受Spark的统一调度,这样可以根据计算结果做出判断再后续操作。不会因为lazy级别而不能必须做后续的transform。