databricks的github io上针对Spark任务经常遇到的一些问题做了一些总结, 这里对关于任务和对象序列化这一章进行翻译.
原链接 Job aborted due to stage failure: Task not serializable
If you see this error:
如果你在Spark任务提交之后碰到了这样的情况:
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: ...
The above error can be triggered when you intialize a variable on the driver (master), but then try to use it on one of the workers. In that case, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable. Consider the following code snippet:
上面的错误可以理解为当你在driver
上初始化了一个变量,但是又尝试在executor
上使用这个变量时触发的一个异常. 这种情况下 Spark Streaming 会尝试将这个变量对象进行序列化,从而将其传输给executor
. 如果对象不可序列化则报错退出. 考虑如下的代码段:
NotSerializable notSerializable = new NotSerializable();
JavaRDD rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();
This will trigger that error. Here are some ideas to fix this error:
这样就会触发序列化错误. 有几个小技巧可以避免这个问题:
forEachPartition
方法(而不是map
), 并且用如下方法创建该对象:rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});
1) 实际中遇到的问题
val path = new Path("GOALPATH")
val pathStr = path.toString
def readData(sqlContext : SQLContext,path : Path) : DataFrame = {
sqlContext.read.format("...").load(path.toString)
...
}
def readData(sqlContext : SQLContextpathStr : String) : DataFrame = {
sqlContext.read.format("...").load(path.toString)
...
}
val bString = sc.broadcast("broadcast")
/* 报错 */
val data = readData(path)
data.map{
value => {
val str = bString.value
...
}
}
/* 正常 */
val dataStr = readData(pathStr)
dataStr.map{
value => {
val str = bString.value
...
}
}
这里会报错,提示Path
类不能进行序列化.
判断是基于SQLContext
的read
方法中,是将目标文件进行分布式读取,所以需要将Path
进行序列化分发给多个executor
. 而默认的org.apache.hadoop.fs.Path
类仅实现了Comparable
接口,没有实现序列化,所以这里报错.