使用隐藏的 REST API 提交 SPARK 任务

在做spark应用开发的时候,有两种方式可以提交任务到集群中去执行,spark 官网上,给出的提交任务的方式是spark-submit 脚本的方式,一种是 使用spark 隐藏的rest api 。

一、Spark Submit

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.12/simple-project_2.12-1.0.jar

二、REST API from outside the Spark cluster

1. 提交任务到SPARK集群

curl -X POST http://spark-cluster-ip:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "action" : "CreateSubmissionRequest",
  "appArgs" : [ "myAppArgument1" ],
  "appResource" : "file:/myfilepath/spark-job-1.0.jar",
  "clientSparkVersion" : "2.4.4",
  "environmentVariables" : {
    "SPARK_ENV_LOADED" : "1"
  },
  "mainClass" : "com.mycompany.MyJob",
  "sparkProperties" : {
    "spark.jars" : "file:/myfilepath/spark-job-1.0.jar",
    "spark.driver.supervise" : "false",
    "spark.app.name" : "MyJob",
    "spark.eventLog.enabled": "true",
    "spark.submit.deployMode" : "cluster",
    "spark.master" : "spark://spark-cluster-ip:7077"
  }
}'

参数说明:

spark-cluster-ip:spark master地址。默认的rest服务端口是6066,如果被占用会依次查找6067,6068…
“action” : “CreateSubmissionRequest”:请求的内容是提交程序,固定值。
“appArgs” : [ “args1, args2,…” ]:我们的程序jar包所需要的参数,如kafka topic,使用的模型等(说明:如果程序没有需要的参数,这里写”appArgs”:[],不能不写,否则会把appResource后面的一条解析为appArgs引发未知错误)
“appResource” : “file:/spark.jar”:程序jar包的路径
“clientSparkVersion” : “2.4.4”:spark的版本
“environmentVariables” : {“SPARK_ENV_LOADED” : “1”}:是否加载Spark环境变量(此项必须要写,否则会报NullPointException)
“mainClass” : “mainClass”:程序的主类main方法
“sparkProperties” : {…}:spark的参数配置

返回结果:

{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20200115102452-0000",
  "serverSparkVersion" : "2.4.4",
  "submissionId" : "driver-20200115102452-0000",
  "success" : true
}

2. 获取已提交程序的执行状态

curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20200115102452-0000

其中driver-20200115102452-0000 为提交程序后返回的submission Id,也可以在spark 监控页面查看,running driver 或者完成driver的 submission Id。

返回结果:

{
  "action" : "SubmissionStatusResponse",
  "driverState" : "FINISHED",
  "serverSparkVersion" : "2.4.4",
  "submissionId" : "driver-20200115102452-0000",
  "success" : true,
  "workerHostPort" : "128.96.104.10:37588",
  "workerId" : "worker-20201016084158-128.96.104.10-37588"
}

driverState表示程序的运行状态,包括以下几个类型:

ERROR(因错误没有提交成功,会显示错误信息),

SUBMITTED(已提交但未开始执行),

RUNNIG(正在运行),

FAILED(执行失败,会抛出异常),

FINISHED(执行成功)

3. 结束已提交的程序

curl -X POST http://spark-cluster-ip:6066/v1/submissions/kill/driver-20200115102452-0000

返回结果:

{
  "action" : "KillSubmissionResponse",
  "message" : "Kill request for driver-20181016102452-0000 submitted",
  "serverSparkVersion" : "2.4.4",
  "submissionId" : "driver-20200115102452-0000",
  "success" : true
}

4. 查看spark集群work的运行信息

curl http://spark-cluster-ip:8080/json/

你可能感兴趣的:(数据平台)