基于mesos集群中spark是如何提交任务的

最近公司部署mesos,在测试\的时候遇见一些问题,顺便研究了下spark任务的提交过程。将研究的结果和大家分享一下。

目前我们的任务提交,主要有command模式和Java调用API提交(魔盒再使用)两种模式。根据目前研究的结果,无论采用哪一种模式,最终都是采用api提交。

首先看下command是怎么玩起来的

我们通常情况下调用 spark-submit提交任务,

spark-submit脚本如下:

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

暂时不管exec命令是干什么的,spark-submit 会调用 spark-class 脚本,并且传入参数。

再看下spark-class脚本

build_command() {

"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"

printf "%d\0" $?

}

CMD=()

while IFS= read -d '' -r ARG; do

CMD+=("$ARG")

done < <(build_command "$@")

COUNT=${#CMD[@]}

LAST=$((COUNT - 1))

LAUNCHER_EXIT_CODE=${CMD[$LAST]}

# if [ $LAUNCHER_EXIT_CODE != 0 ]; then

# exit $LAUNCHER_EXIT_CODE

# fi

exec CMD=("${CMD[@]:0:$LAST}")

通过 build_command 函数,可以看出,最终spark脚本调用的第一个程序是org.apache.spark.launcher.Main 类,并且传入相应的参数

下边的while循环又把spark-submit脚本中的 org.apache.spark.deploy.SparkSubmit 参数加入到ARG中。最终作为参数,传入到Main类

接下来打开 Main类,

Main类会调用 org.apache.spark.deploy.SparkSubmit的main方法(至于怎么调用的,我也没有看明白,我是从网上资料以及自己对代码的测试跟踪发现的)。并将相应的参数传递进来

override defmain(args: Array[String]): Unit = {

valappArgs =newSparkSubmitArguments(args)

appArgs.actionmatch{

caseSparkSubmitAction.SUBMIT=>submit(appArgs)

caseSparkSubmitAction.KILL=>kill(appArgs)

caseSparkSubmitAction.REQUEST_STATUS=>requestStatus(appArgs)

}

}

这个main方法调用submit方法。并传入appArgs参数。

接下来在研究下submit方法

private defsubmit(args: SparkSubmitArguments): Unit = {

val(childArgs, childClasspath, sysProps, childMainClass) =prepareSubmitEnvironment(args)

defdoRunMain(): Unit = {

if(args.proxyUser!=null) {

valproxyUser = UserGroupInformation.createProxyUser(args.proxyUser,

UserGroupInformation.getCurrentUser())

try{

proxyUser.doAs(newPrivilegedExceptionAction[Unit]() {

override defrun(): Unit = {

runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)

}

})

}catch{}

}else{

runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)

}

}

从代码中可以看出,submit方法主要做两件事请,首先对传进的参数和运营环境进行进行封装。其次是对封装后的参数提交给runmai进行运行。

prepareSubmitEnvironment 方法在封装参数的过程,会生成childMainClass,该参数是根据环境和传入的参数生成,如果使用mesos的cluster,则该参数值是被在代码中写死的

代码如下:

if(isMesosCluster) {

childMainClass ="org.apache.spark.deploy.rest.RestSubmissionClient"

}

接下来看下 runMain方法。

private defrunMain(

childArgs:Seq[String],

childClasspath:Seq[String],

sysProps: Map[String,String],

childMainClass:String,

verbose: Boolean): Unit = {

try{

mainClass = Utils.classForName(childMainClass)

}catch{ }

valmainMethod = mainClass.getMethod("main",newArray[String](0).getClass)

try{

mainMethod.invoke(null, childArgs.toArray)

}catch{

caset: Throwable=>

findCause(t)match{

caseSparkUserAppException(exitCode) =>

System.exit(exitCode)

caset: Throwable=>

throwt

}

}

}

通过代码可以发现,runMain方法调用了RestSubmissionClient.main 方法,main方法调用 run方法。

到了run方法,也就进入最关键的阶段

defrun(

appResource:String,

mainClass:String,

appArgs: Array[String],

conf: SparkConf,

env:Map[String,String] =Map()): SubmitRestProtocolResponse = {

valmaster = conf.getOption("spark.master").getOrElse {

throw newIllegalArgumentException("'spark.master' must be set.")

}

valsparkProperties = conf.getAll.toMap

valclient =newRestSubmissionClient(master)

valsubmitRequest = client.constructSubmitRequest(

appResource, mainClass, appArgs, sparkProperties, env)

varcreateSubmissionResponse = client.createSubmission(submitRequest)

createSubmissionResponse

}

run方法首先创建了个REST客户端

接着样需要的请求信息进行封装,然后调用createSubmission方法,然后再去看看createSubmission方法中的代码

defcreateSubmission(request: CreateSubmissionRequest): SubmitRestProtocolResponse = {

varhandled: Boolean =false

varresponse: SubmitRestProtocolResponse =null

for(m <-mastersif!handled) {

validateMaster(m)

valurl = getSubmitUrl(m)

try{

response = postJson(url, request.toJson)

responsematch{

cases: CreateSubmissionResponse =>

if(s.success) {

reportSubmissionStatus(s)

handleRestResponse(s)

handled =true

}

caseunexpected =>

handleUnexpectedRestResponse(unexpected)

}

}catch{

casee: SubmitRestConnectionException =>

if(handleConnectionException(m)) {

throw newSubmitRestConnectionException("Unable to connect to server", e)

}

}

}

response

}

执行完这个方法,我们的是park任务就提交完毕,通过这个方法可以非常明显的看出,原来提交过程,其实就是一个REST请求。并且将请求的返回信息封装到

CreateSubmissionResponse 对象当中。

private[spark]classCreateSubmissionResponseextendsSubmitRestProtocolResponse {

varsubmissionId:String =null

protected override defdoValidate(): Unit = {

super.doValidate()

assertFieldIsSet(success,"success")

}

}

通过 CreateSubmissionResponse 类可以发现submissionId 参数

通过测试发现,其实就是我们常用的任务ID

到此为止,我们并不知道 这个任务的执行状态,通过查看RestSubmissionClient类的代码,发现还有一个

requestSubmissionStatus方法,代码如下

defrequestSubmissionStatus(

submissionId:String,

quiet: Boolean =false): SubmitRestProtocolResponse = {

logInfo(s"Submitting a request for the status of submission$submissionIdin$master.")

varhandled: Boolean =false

varresponse: SubmitRestProtocolResponse =null

for(m <-mastersif!handled) {

validateMaster(m)

valurl = getStatusUrl(m, submissionId)

try{

response = get(url)

responsematch{

cases: SubmissionStatusResponseifs.success=>

if(!quiet) {

handleRestResponse(s)

}

handled =true

caseunexpected =>

handleUnexpectedRestResponse(unexpected)

}

}catch{ }

}

response

}

查看说明和代码发现,可以依据 submissionId 返回SubmissionStatusResponse 对象。

再来看看类

private[spark]classSubmissionStatusResponseextendsSubmitRestProtocolResponse {

varsubmissionId:String =null

vardriverState:String =null

varworkerId:String =null

varworkerHostPort:String =null

protected override defdoValidate(): Unit = {

super.doValidate()

assertFieldIsSet(submissionId,"submissionId")

assertFieldIsSet(success,"success")

}

}

果然有个driverState状态。到此为止,可以简单总结下sparksubmit 的提交过程。

spark-submit 脚本执行的时候,调用 spark-class 脚本。

然后spark-class 脚本 执行的时候调用org.apache.spark.launcher.Main 类

org.apache.spark.launcher.Main 类调用org.apache.spark.deploy.SparkSubmit类。

接着org.apache.spark.deploy.SparkSubmit类调用org.apache.spark.deploy.rest.RestSubmissionClient类。

org.apache.spark.deploy.rest.RestSubmissionClient类通过rest请求,通过API的方式创建任务。

注意,这里仅仅创建一个任务,创建任务完毕之后,通过cammand方式提交的任务就算执行完毕了,接下来能否正常执行完全看spark的造化了。

绕了这么多,其实就是发送一个REST请求,接下来通过命令行的方式,自己发送一个请求,查看下结果

curl -XPOST 'http://192.168.23.7:7077/v1/submissions/create' -d '{

"action" : "CreateSubmissionRequest",

"appArgs" : [ "20180315" ],

"appResource" : "hdfs://hadoopha/datacenter/jar/spark_test.jar",

"clientSparkVersion" : "2.2.0",

"environmentVariables" : {

"SPARK_SCALA_VERSION" : "2.10"

},

"mainClass" : "com.yunzongnet.datacenter.spark.main.SparkForTest",

"sparkProperties" : {

"spark.sql.ui.retainedExecutions" : "2000",

"spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : "/usr/local/lib/libmesos.so",

"spark.dynamicAllocation.sustainedSchedulerBacklogTimeout" : "5",

"spark.history.fs.logDirectory" : "hdfs://hadoopha/spark/eventlog",

"spark.eventLog.enabled" : "true",

"spark.streaming.ui.retainedBatches" : "2000",

"spark.shuffle.service.enabled" : "true",

"spark.jars" : "hdfs://hadoopha/datacenter/jar/spark_test.jar",

"spark.mesos.executor.docker.volumes" : "/spark_local_dir:/spark_local_dir:rw",

"spark.driver.supervise" : "false",

"spark.app.name" : "sparkjob5",

"spark.cores.max" : "6",

"spark.dynamicAllocation.schedulerBacklogTimeout" : "1",

"spark.mesos.principal" : "admin",

"spark.worker.ui.retainedDrivers" : "2000",

"spark.driver.memory" : "4G",

"spark.files.fetchTimeout" : "900s",

"spark.mesos.uris" : "/etc/docker.tar.gz",

"spark.mesos.secret" : "admin",

"spark.deploy.retainedDrivers" : "2000",

"spark.mesos.role" : "root",

"spark.files" : "file:///usr/local/hadoop-2.6.0/etc/hadoop/hdfs-site.xml,file:///usr/local/hadoop-2.6.0/etc/hadoop/core-site.xml",

"spark.mesos.executor.docker.image" : "http://registry.seagle.me:443/spark-2-base:v1",

"spark.submit.deployMode" : "cluster",

"spark.master" : "mesos://192.168.23.7:7077",

"spark.executor.memory" : "12G",

"spark.driver.extraClassPath" : "/usr/local/alluxio/core/client/target/alluxio-core-client-1.1.0-SNAPSHOT-jar-with-dependencies.jar,/usr/local/spark/jars/*",

"spark.local.dir" : "/spark_local_dir",

"spark.eventLog.dir" : "hdfs://hadoopha/spark/eventlog",

"spark.dynamicAllocation.enabled" : "true",

"spark.executor.cores" : "2",

"spark.deploy.retainedApplications" : "2000",

"spark.worker.ui.retainedExecutors" : "2000",

"spark.dynamicAllocation.executorIdleTimeout" : "60",

"spark.mesos.executor.home" : "/usr/local/spark"

}

}'

REST 请求发送之后,会立即返回

{

"action" : "CreateSubmissionResponse",

"serverSparkVersion" : "2.0.0",

"submissionId" : "driver-20170425164456-271697",

"success" : true

}

接下来在发送一个GET请求 curl -XGET 'http://192.168.23.7:7077/v1/submissions/status/driver-20170425164456-271697'

{

"action" : "SubmissionStatusResponse",

"driverState" : "RUNNING",

"message" : "task_id {\n value: \"driver-20170425164456-271697\"\n}\nstate: TASK_RUNNING\n"

}

我们能发现当前任务证处于 RUNNING状态。

过一段时间,持续发送该GET请求。最终返回

{

"action" : "SubmissionStatusResponse",

"driverState" : "FINISHED",

"message" : "task_id {\n value: \"driver-20170425164456-271697\"\n}\nstate:TASK_FAILED\nmessage: \"Container exited with status 1\"\nslave_id {\n value: \"0600da48-750d-48e2-ba79-b78936224c83-S2\"\n}\ntimestamp: 1.493109963886255E9\nexecutor_id {\n value: \"driver-20170425164456-271697\"\n}\nsource: SOURCE_EXECUTOR\n11: \"4\\222O\\215\\201\\334L^\\232\\303\\313:j&\\004\\'\"\n13: \"\\n\\017*\\r\\022\\v192.168.23.1\"\n",

"serverSparkVersion" : "2.0.0",

"submissionId" : "driver-20170425164456-271697",

"success" : true

}

我们发现该任务运行完毕,并且是运行失败。

你可能感兴趣的:(基于mesos集群中spark是如何提交任务的)