首先使用脚本spark-submit将作业提交,这个过程实际上就是使用shell脚本调用java命令运行的SparkSubmit类的main方法,所以我们接下来需要看一下SparkSubmit的main方法做了什么?
/**
* 提交作业
* @param args
*/
def main(args: Array[String]): Unit = {
val appArgs = new SparkSubmitArguments(args)//封装参数
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {//模式匹配提交的到底是什么任务
case SparkSubmitAction.SUBMIT => submit(appArgs)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
无非就是使用模式匹配识别用户提交的命令,当我们提交作业时候便调用submit方法,接下来看一下submit方法:
@tailrec
private def submit(args: SparkSubmitArguments): Unit = {
//省略非重点代码
if (args.isStandaloneCluster && args.useRest) {
try {
// scalastyle:off println
printStream.println("Running Spark using the REST application submission protocol.")
// scalastyle:on println
doRunMain()
} catch {
// Fail over to use the legacy submission gateway
case e: SubmitRestConnectionException =>
printWarning(s"Master endpoint ${args.master} was not a REST server. " +
"Falling back to legacy submission gateway instead.")
args.useRest = false
submit(args)
}
// In all other modes, just run the main class as prepared
} else {
doRunMain()
}
}
在submit方法中调用的是 doRunMain()方法:
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
var mainClass: Class[_] = null
//省略非重点代码
mainClass = Utils.classForName(childMainClass)
//省略非重点代码
if (classOf[scala.App].isAssignableFrom(mainClass)) {
printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}
//省略非重点代码
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
if (!Modifier.isStatic(mainMethod.getModifiers)) {
throw new IllegalStateException("The main method in the given main class must be static")
}
//省略非重点代码
try {
mainMethod.invoke(null, childArgs.toArray)
} catch {
case t: Throwable =>
findCause(t) match {
case SparkUserAppException(exitCode) =>
System.exit(exitCode)
case t: Throwable =>
throw t
}
}
}
接下来我们以统计文件行数程序讲解,这个作业就一个count操作。
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: ")
System.exit(1)
}
val conf = new SparkConf()
val sc = new SparkContext(conf)
val rdd= sc.textFile(args(0))
rdd.count()
sc.stop()
}
}
统计单词行数的程序,当sc创建完毕之后,我们使用textFile创建了一个rdd,接下来我们看看textFile的源码:
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
textFile使用hadoopFile创建的RDD,接下来看一下h
adoopFile源码:
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
FileSystem.getLocal(hadoopConfiguration)
//...省略非重点代码
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))//广播Hadoop文件
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(this,confBroadcast,Some(setInputPathsFunc),inputFormatClass, //参数this是我们的重点讲解之处
keyClass,valueClass,minPartitions).setName(path)
}
此时我们已经创建完毕了rdd,接下来需要计算行数了,我们便调用count算子,count算子是一个action,我们都知道只有actioin是触发job在集群中运行,transformation并不会触发作业运行,那么是如何触发job运行的呢?是谁调用的runJob方法触发作业执行的呢?至此我们只见到在SparkContext中启动一个TaskScheduler,而没见其他调用或者启动!!!其实奥妙就在rdd持有SaprkContext引用之中。
接下来我们看一下org.apache.spark.rdd.RDD#count都做了什么?
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
明白了吧,runJob原来是在count算子中使用该rdd的一个sparkContext引用调用的runJob来触发作业在集群中运行,到此是不是就解决了困惑呢!!
但是其实问题又出现了,为什么map算子不可以触发作业到集群中运行呢?想知道为什么的最好办法还是看源码,学习任何开源项目,如果懂了大概原理和使用之后,遇到问题解决问题最好的方法就是看源码,跟踪源码!那我们就看看map算子源码吧。
org.apache.spark.rdd.RDD#map:
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}