目录
一.引言
二.问题分析与定位
1.问题描述
2.代码回朔
2.1 asJava
2.2 Decorators
2.3 mutableSetAsJavaSetConverter
2.4 MutableSetWrapper
三.问题解决尝试
1.增加 constructor ❌
2.嵌套包装 Wrapper ❌
3.JavaConversions ❌
4.基础转换 java.util.Set
四.总结
Spark 项目下需要使用 Google Guava 的工具库,由于 Guava 工具库基于 Java 开发,因此 Scala 的 Collection 集合需要转换为 Java 版,使用 Scala mutable.HashSet[T] 转换 Java util.Set[T] 时报错 java.io.InvalidClassException: scala.collection.convert.Wrappers$MutableSetWrapper; no valid constructor,下面开始熟悉的踩坑环节。
使用 Guava com.google.common.collect.Sets 库时,需要将 Sacla 的 Array 、Set、mutable.Set 或者 mutable.HashSet 均转化为 java.util.Set,于是我创建了 Object 静态类:
import scala.collection.JavaConverters._
object converUtil() {
def converToUtilSet(array: Array[String]): java.util.Set[String] = {
collection.mutable.Set.apply(array:_*).asJava // 转换为 util.Set
}
}
常规测试该方法可以正常生效,但是在 RDD 内调用就会报错:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 3, localhost, executor driver): java.io.InvalidClassException: scala.collection.convert.Wrappers$MutableSetWrapper; no valid constructor
at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)
at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2043)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:143)
at scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:143)
at scala.collection.mutable.HashTable$class.init(HashTable.scala:106)
at scala.collection.mutable.HashMap.init(HashMap.scala:40)
at scala.collection.mutable.HashMap.readObject(HashMap.scala:143)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:582)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:452)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
可以看到堆栈的日志,多为 deserialize 和 SerialData,再结合实际日志中,After MapPartition 日志并未打印,任务运行至 mapPartition 处后异常退出,猜测异常大致率与序列化相关:
java.io.InvalidClassException: scala.collection.convert.Wrappers$MutableSetWrapper;
no valid constructor
这里异常栈特别的长,报错的意思也很清晰,MutableSetWrapper 没有有效的构造方法,于是我们去找 asJava 调用的方法类 MutableSetWrapper。
asJava 方法继承 Decorators
private[collection] trait Decorators {
/** Generic class containing the `asJava` converter method */
class AsJava[A](op: => A) {
/** Converts a Scala collection to the corresponding Java collection */
def asJava: A = op
}
Scala Set 转 Java Set 就是调用 Decorators 内的 mutableSetAsJavaSetConverter
/**
* Adds an `asJava` method that implicitly converts a Scala mutable `Set`>
* to a Java `Set`.
*
* The returned Java `Set` is backed by the provided Scala `Set` and any
* side-effects of using it via the Java interface will be visible via
* the Scala interface and vice versa.
*
* If the Scala `Set` was previously obtained from an implicit or explicit
* call of `asSet(java.util.Set)` then the original Java `Set` will be
* returned.
*
* @param s The `Set` to be converted.
* @return An object with an `asJava` method that returns a Java `Set` view
* of the argument.
*/
implicit def mutableSetAsJavaSetConverter[A](s : mutable.Set[A]): AsJava[ju.Set[A]] =
new AsJava(mutableSetAsJavaSet(s))
继续推进我们来到了 Trait WrapAsJava 接口,终于找到我们的主角 MutableSetWrapper
/**
* Implicitly converts a Scala mutable Set to a Java Set.
* The returned Java Set is backed by the provided Scala
* Set and any side-effects of using it via the Java interface will
* be visible via the Scala interface and vice versa.
*
* If the Scala Set was previously obtained from an implicit or
* explicit call of `asSet(java.util.Set)` then the original
* Java Set will be returned.
*
* @param s The Set to be converted.
* @return A Java Set view of the argument.
*/
implicit def mutableSetAsJavaSet[A](s: mutable.Set[A]): ju.Set[A] = s match {
case JSetWrapper(wrapped) => wrapped
case _ => new MutableSetWrapper(s)
}
MutableSetWrapper 实现为 case class,没有显式的构造方法:
case class MutableSeqWrapper[A](underlying: mutable.Seq[A]) extends ju.AbstractList[A] with IterableWrapperTrait[A] {
def get(i: Int) = underlying(i)
override def set(i: Int, elem: A) = {
val p = underlying(i)
underlying(i) = elem
p
}
}
既然 no Valid constructor 构造函数,看网上很多大佬通过在父类增加空的构造函数并继承 java.io.Serializable 解决了该问题,但是本例下由于 MutableSetWrapper 类为 Scala 源码,因此我们无法在源码中增加改动,故放弃。
还有大佬通过嵌套包装类的形式,将 MutableSetWrapper 包装到自定义的继承序列化的类中:
import scala.collection.JavaConverters._
class MySerializableClass extends Serializable {
// scala Set to Java Set Converters
def scalaToJavaSetConverter(arr: Array[String]): java.util.Set[String] = {
collection.mutable.Set.apply(arr:_*).asJava
}
}
结果与之前直接调用 object 静态类是相同的,提示 no valid constructor
除了 scala.collection.JavaConverters._ 外,scala.collection.JavaConversions 也支持转换 scala collection 为 java 类,所以我们更换 object 内的方法:
def converToUtilSet(array: Array[String]): java.util.Set[String] = {
val mutableSet = collection.mutable.Set.apply(array:_*)
scala.collection.JavaConversions.setAsJavaSet(mutableSet)
}
哎,涛声依旧,还是不支持序列化:
这里忽略一些中间过程,直接定位到最底层执行类,这里 SetWrapper 其实是 MutableSetWrapper 的父类,因此不论是 JavaConverters 还是 JavaConversions 应该问题相似。
class SetWrapper[A](underlying: Set[A]) extends ju.AbstractSet[A] {
self =>
override def contains(o: Object): Boolean = {
try { underlying.contains(o.asInstanceOf[A]) }
catch { case cce: ClassCastException => false }
}
override def isEmpty = underlying.isEmpty
def size = underlying.size
def iterator = new ju.Iterator[A] {
val ui = underlying.iterator
var prev: Option[A] = None
def hasNext = ui.hasNext
def next = { val e = ui.next(); prev = Some(e); e }
def remove = prev match {
case Some(e) =>
underlying match {
case ms: mutable.Set[a] =>
ms remove e
prev = None
case _ =>
throw new UnsupportedOperationException("remove")
}
case _ =>
throw new IllegalStateException("next must be called at least once before remove")
}
}
}
上面方法试了个遍,都不行,看来只能用最原始的办法了,那就是自己初始化一个 java.util.Set,然后把元素都一个一个 add 进去:
def converToUtilSet(array: Array[String]): java.util.Set[String] = {
val javaSet = new java.util.HashSet[String]()
array.foreach(javaSet.add)
javaSet
}
虽然相比前几种方法显得不够优雅,但是它能解决实际问题,因此也足够优雅!
为什么要使用 asJava 或者 setAsJavaSet 方法而不是用最基础的 new + add 呢,我们简单测试下,这里构造 4 种 Scala Array 转 Java Set 的方法各运行 1000 次,看看耗时如何:
def convertSet(arr: Array[String], format: String): Unit = {
val st = System.currentTimeMillis()
var epoch = 0
while (epoch < 1000) {
if (format.equals("Conversion")) {
val mutableSet = collection.mutable.Set.apply(arr:_*)
scala.collection.JavaConversions.setAsJavaSet(mutableSet)
} else if (format.equals("Converter")) {
val mutableSet = collection.mutable.Set.apply(arr:_*)
JavaConverters.mutableSetAsJavaSetConverter(mutableSet).asJava
} else if (format.equals("AsJava")) {
collection.mutable.Set.apply(arr:_*).asJava
} else {
val javaSet = new java.util.HashSet[String]()
arr.foreach(javaSet.add)
}
epoch += 1
}
val end = System.currentTimeMillis()
println(s"Epoch: $epoch Format: $format Cost: ${end - st}")
}
分别采用长度为 5 和 500 的 Array[String] 测试:
Cost / ms | Short Array | Long Array |
Conversion | 58 | 162 |
Converter | 19 | 78 |
AsJava | 1 | 56 |
Common | 8 | 75 |
经过测试 AsJava 速度最快,不过这里是语法问题,AsJava 本质也是调用了 JavaConverters。总的来说问题是解决了,但是为什么没有构造方法的 case class 不能序列化呢,还是很神奇。