Spark Streaming Zero Data Loss

https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html

Spark应用程序24小时执行,需要有一种方法能够保证流数据处理0丢失数据。

Write Ahead Logs

These two together can ensure that there is zero data loss – all data is either recovered from the logs or resent by the source.
最关键的两句话,要么从日志中恢复,要么从数据源重写发送(例如Kafka)。

Configuration

1.checkpoint(path-to-directory). 配置保存的日志地址(例如:HDFS)
2.SparkConf属性spark.streaming.receiver.writeAheadLog.enable为true

实现细节

1.Driver和Executor数据处理的生命周期


Spark Streaming Zero Data Loss_第1张图片

2.失败恢复
driver失败重启启动时,会触发下图所示的过程:


Spark Streaming Zero Data Loss_第2张图片

你可能感兴趣的:(Spark Streaming Zero Data Loss)