【spark】java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper

菜鸡一只!如果有什么说错的还请大家指出批评,坚决改正!!

 

遇到了一个挺诡异的报错把,反正比较无语,发现国内网站完全搜不到这个报错的解决方法,因此在这里记录下!!

1、环境:

这是一个spark的Task not serializable问题,因此只需要关注spark的版本就好了,我的版本是spark2.2.0的版本!

2、具体报错:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)
	at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
	at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
	at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252)
	at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2853)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
	at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2153)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2366)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:644)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:603)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:612)
	at com.test.XXX.getXXXX(App.java:473)
	at com.test.BaseTest.main(BaseTest.java:66)
Caused by: java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper
Serialization stack:

其实报错挺长的,后面还有,不过看报错还是要提取关键字的:其实就是这个类,或者这个对象无法序列化!!!org.apache.spark.unsafe.types.UTF8String$IntWrapper

3、产生的原因(可看图):

【spark】java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper_第1张图片

这里是使用spark DSL语法,写的sql,将A表和B表做join的同时,还用到了case when语法,并在case when内嵌套了sql的窗口函数(sql有一点点复杂,但是还好),如果大家不太懂怎么写,可以参考我的图,不过这里马赛克打得乱七八糟的,希望各位谅解,毕竟这种东西,你懂的。。。当我要show t1和t2字段的时候就出报错了(如果不打印那两个字段,不会报错)!

相对应的sql:

SELECT  查询字段5,查询字段6,查询字段7,查询字段8,
FROM (
SELECT 
查询字段1,
查询字段2,
查询字段3,
查询字段4,
case when (MAX(XXX) OVER w ) =1 then 1 else 0 end AS t1, 
case when (MIN(XXX) OVER w ) =0 then 1 else 0 end AS t2 
FROM (
select Z,ZZ,ZZZ,ZZZZ
from 表A
where 过滤条件
) info 
INNER JOIN 
( 
select X,XX,XXX,XXXX
from 表B
WHERE 过滤条件
) hav
ON info.关联字段=hav.关联字段
WINDOW w AS (PARTITION BY 字段A,字段B, 字段C ORDER BY 字段D RANGE BETWEEN  窗口时间 PRECEDING AND 窗口时间 FOLLOWING)
) db1
WHERE 过滤字段
; 

大家可以结合DSL代码和sql一起看,应该比较能看得懂我在写什么鬼东西(实在抱歉,字段都做了马赛克)

 

4、解决方法:

方法一:

经过测试发现,如果直接把sql写好,然后使用spark.sql(“selectXXXX”),这种方式是不会报错的,真的想骂人,这种方式也是可以,但是我可能个人有点强迫症(领导让你来写代码,结果你就写了一段sql,用这种方式运行,大家肯定会觉得你是在敷衍了事),所以不想采用这种方法,我想把这个报错调好

方法二:

上下求索,终于让我找到解决方法!

我去google上搜索相关的东西,找到了(id:debugger87)一位大神向spark提交的pr(pull request)!

地址:https://github.com/debugger87/spark/commit/2bd33d819137220c55e8ddf7b6df7b98945046aa

如果大家看不懂这篇pull request在说什么,那我来说说:

【spark】java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper_第2张图片

这两个静态类是在org.apache.spark.unsafe.types.UTF8String这个类里面的!

 

修改方式:

直接修改肯定是不可能的,因为这是编译好的类!

如果你的代码只想在本地跑跑!

 

直接把需要修改的源码类,复制出来放到自己的项目中,这样运行的时候,自己写的类就会覆盖源码中的类,就实现了修改,如图:

【spark】java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper_第3张图片

【spark】java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper_第4张图片

如果这样还看不明白,那留言给我号了,比如什么复制代码出来,怎么创建一模一样的package这一类的问题等。。。

但是本人经过尝试,在服务器上跑,无论打成胖包还是打成瘦包,spark都不会读取我们自己修改的UTF8String这个类,只会读取本地的jar包(因为这个类是spark本身就有的jar包,不是外部添加的,这个jar包名字spark-unsafe_2.11-2.2.0.jar)。因此按照上文那样的修改方式,本地生效在服务器上不生效!

 

如果想让你的代码在服务器跑:

麻烦,但是能多研究些东西的方法:

下载spark源码,找到这个类,先把这个地方写上implements Serializable,然后编译源码,运行的时候用自己的源码的环境来运行(比较麻烦,因为你如果不会编译的话。。。但是本人也有写一篇如何编译的文章,大家可以参考下,亲测成功)编译博客:https://blog.csdn.net/lsr40/article/details/80116235

相对简单,方便快捷的方法:

去找2.2.0版本以后的spark,但是又不要太后面,例如,我找了一个2.2.2版本的spark(已经编译好的,我是自己编译的,大家可以去找找那些别人编译好的版本),找到jar里面的spark-unsafe_2.11-2.2.2.jar,放入到服务器上spark的jar路径中,覆盖以前的文件(我顺便把名字也改成了spark-unsafe_2.11-2.2.0.jar),然后再次运行spark的时候,只要spark读取的是这个路径下的jar包,就不会再出现那个问题了

 

后来,我还遇到了数据倾斜的问题,我会在下一篇文章当中提出解决的方案(其实这个方法很早就有了,大数据解决数据倾斜的思想是比较重要而且比较通用的)

文章链接:https://blog.csdn.net/lsr40/article/details/80322553

 

题外话:

这里留一篇文章,是一般人会遇到的序列化问题的解决方法(其实这类报错就是看到底哪个类序列化失败,然后去实现序列化接口,或者不要在map等闭包算子内,引用外部对象变量,你可以直接在map算子内手动创建就好了,或者考虑广播变量等方式),原谅我找不到原作者了,我看到这篇文章的作者也是转载的,而且也没有标明出处,如有冒犯原作者的地方,还请谅解,如果原作者看到,可以私聊我,我愿意添加上您的id!

序列化问题:https://blog.csdn.net/javastart/article/details/51206715

刚刚提到了闭包,可能会有人问什么是闭包!?(又是一篇被转载没有标明出处的文章,不好意思)

闭包:https://blog.csdn.net/wy_blog/article/details/57130702

好了,老话,菜鸡一只,如果有什么说的不对的,还请大家批评指出!

 

你可能感兴趣的:(spark)