1.Spark性能调优:checkPoint的使用
https://blog.csdn.net/leen0304/article/details/78718346
checkpoint的意思就是建立检查点,类似于快照,例如在spark计算里面,计算流程DAG特别长,服务器需要将整个DAG计算完成得出结果,但是如果在这很长的计算流程中突然中间算出的数据丢失了,spark又会根据RDD的依赖关系从头到尾计算一遍,这样子就很费性能,当然我们可以将中间的计算结果通过cache或者persist放到内存或者磁盘中,但是这样也不能保证数据完全不会丢失,存储的这个内存出问题了或者磁盘坏了,也会导致spark从头再根据RDD计算一遍,所以就有了checkpoint,其中checkpoint的作用就是将DAG中比较重要的中间数据做一个检查点将结果存储到一个高可用的地方(通常这个地方就是HDFS里面)。
使用checkpoint 需要 先设置 checkpoint 的目录,例如如下代码:
val sparkConf = new SparkConf
sparkConf
.setAppName("JOINSkewedData")
.set("spark.sql.autoBroadcastJoinThreshold", "1048576") //1M broadcastJOIN
//.set("spark.sql.autoBroadcastJoinThreshold", "104857600") //100M broadcastJOIN
.set("spark.sql.shuffle.partitions", "3")
if (args.length > 0 && args(0).equals("ide")) {
sparkConf
.setMaster("local[3]")
}
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val sparkContext = spark.sparkContext
sparkContext.setLogLevel("WARN")
sparkContext.setCheckpointDir("file:///D:/checkpoint/")
有的时候需要本地调试,需要设置为windows 或者 linux 的本地目录
windows
sparkContext.setCheckpointDir("file:///D:/checkpoint/")
linux
sparkContext.setCheckpointDir("file:///tmp/checkpoint")
hdfs
sparkContext.setCheckpointDir("hdfs://leen:8020/checkPointDir")
使用 checkpoint 的时候,需要在建立 checkpoint 的 rdd 上进行函数调用即可
rdd.checkpoint
注意 :
使用 checkpoint 的时候,建议先将 rdd.cache 一次,因为 checkpoint 是 transform 算子,
执行的时候相当于走了两次流程,前面计算了一遍,然后checkpoint又会计算一次,所以一般我们先进行cache然后做checkpoint就会只走一次流程,checkpoint的时候就会从刚cache到内存中取数据写入hdfs中,如下:
rdd.cache()
rdd.checkpoint()
rdd.collect
在streaming中使用checkpoint主要包含以下两点:设置checkpoint目录,初始化StreamingContext时调用getOrCreate方法,即当checkpoint目录没有数据时,则新建streamingContext实例,并且设置checkpoint目录,否则从checkpoint目录中读取相关配置和数据创建streamingcontext。
// Function to create and setup a new StreamingContext def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ... ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc } // Get StreamingContext from checkpoint data or create a new one val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
checkpoint 与 cache 是不一样的,checkpoint 会切除前面算子的rdd 依赖, 而 cache 是将数据暂存在一个具体的位置。
rdd 的 checkpoint 实现
/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
def checkpoint(): Unit = RDDCheckpointData.synchronized {
// NOTE: we use a global lock here due to complexities downstream with ensuring
// children RDD partitions point to the correct parent partitions. In the future
// we should revisit this consideration.
if (context.checkpointDir.isEmpty) {
throw new SparkException("Checkpoint directory has not been set in the SparkContext")
} else if (checkpointData.isEmpty) {
checkpointData = Some(new ReliableRDDCheckpointData(this))
}
}
dataframe 的 checkpoint 实现
/**
* Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate
* the logical plan of this Dataset, which is especially useful in iterative algorithms where the
* plan may grow exponentially. It will be saved to files inside the checkpoint
* directory set with `SparkContext#setCheckpointDir`.
*
* @group basic
* @since 2.1.0
*/
@Experimental
@InterfaceStability.Evolving
def checkpoint(): Dataset[T] = checkpoint(eager = true)