Spark Streaming On Yarn/ On StandAlone模式下的checkpointing容错

Spark On Yarn:

在Spark On Yarn模式下部署Spark Streaming 时候,我们需要使用StreamingContext.getOrCreate方法创建StreamingContext实例,指定我们自己的checkpoint目录,用作存储checkpoint数据。

容错1:

当我们使用spark-submit成功提交一个程序之后,我们可以使用jps能够查看到CoarseGrainedExecutorBackend进程,当我们直接kill该线程时候,你会发现立马会再次启动一个全新的CoarseGrainedExecutorBackend进行,这个就是Yarn提供的特性支持自动重启容错,重启的CoarseGrainedExecutorBackend会去读取checkpoint中的数据然后继续计算。


容错2:

如果我们使用yarn application -kill jobid 直接杀死作业的话,再次启动你会发现程序启动不起来了。直接抛异常:

Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)

异常信息也很清晰:Yarn application has already ended!  It might have been killed or unable to launch application master. 为了确定该问题只好翻阅源码看问题具体原因,查看YarnClientSchedulerBackend的waitForApplication方法:

  /**
   * Report the state of the application until it is running.
   * If the application has finished, failed or been killed in the process, throw an exception.
   * This assumes both `client` and `appId` have already been set.
   */
  private def waitForApplication(): Unit = {
    assert(client != null && appId.isDefined, "Application has not been submitted yet!")
    val (state, _) = client.monitorApplication(appId.get, returnOnRunning = true) // blocking
    if (state == YarnApplicationState.FINISHED ||  //判断如果是完成或者失败或者杀死的话 直接抛异常
      state == YarnApplicationState.FAILED ||
      state == YarnApplicationState.KILLED) {
      throw new SparkException("Yarn application has already ended! " +
        "It might have been killed or unable to launch application master.")
    }
    if (state == YarnApplicationState.RUNNING) {
      logInfo(s"Application ${appId.get} has started running.")
    }
  }
通过代码猜测,Spark 设计者的想法是:如果程序异常被杀死某一个进行的话,则又Yarn负责自动重启容错,就如上面容错问题1描述。如果是人为kill,或者执行完毕,或者失败的话,则认为程序本身要么是执行完毕不需要再次运行,要么认为程序有错误不需要再次执行。


所以on yarn的容错是由yarn负责的。


拓展:由于yarn不允许我们重启提交作业从checkpoint中恢复的话,那么如果我们stareaming消费kafka的话,就需要手动完成kafka的消费offset的自行保存,一遍加载启动时候能够继续消费。


Spark On StandAlone:

如果是spark standlone模式提交的话,如果直接在webui上面kill该job的话,会从新加载上次的checkpoint目录完成容错。

spark standlone模式下,也支持kill  CoarseGrainedExecutorBackend之后自动重启CoarseGrainedExecutorBackend的。


结论:不同的部署模式容错有差异,还需要视具体情况而定。





你可能感兴趣的:(Spark Streaming On Yarn/ On StandAlone模式下的checkpointing容错)