Spark Java sortByKey二次排序及Task not serializable异常

相比于scala,用java写二次排序较繁琐一些,请参考:
Spark Java 二次排序:http://blog.csdn.net/leen0304/article/details/78280282
Spark Scala 二次排序: http://blog.csdn.net/leen0304/article/details/78280282

下边用sortByKey实现二次排序:
为了说明问题,举了一个简单的例子,key是由两部分组成的,我们这里按key的第一部分的升序排,key的第二部分降序排,具体如下:

public class SecondarySortByKey implements Serializable {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("SecondarySortByKey").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        List> list = Arrays.asList(
                new Tuple2("A", 10),
                new Tuple2("D", 20),
                new Tuple2("D", 6),
                new Tuple2("B", 6),
                new Tuple2("C", 12),
                new Tuple2("B", 2),
                new Tuple2("A", 3)
        );

        JavaRDD> rdd1 = sc.parallelize(list);
        JavaPairRDD pairRdd = rdd1.mapToPair(x -> new Tuple2(x._1() + " " + x._2(), 1));

        //自定义比较器
        Comparator comparator = new Comparator() {
            @Override
            public int compare(String o1, String o2) {
                String[] oo1 = o1.split(" ");
                String[] oo2 = o2.split(" ");
                if (oo1[0].equals(oo2[0])) {
                    return -Integer.valueOf(oo1[1]).compareTo(Integer.valueOf(oo2[1]));
                } else {
                    return oo1[0].compareTo(oo2[0]);
                }
            }
        };

        JavaPairRDD res = pairRdd.sortByKey(comparator);
        res.foreach(x -> System.out.println(x._1()));
    }
}

上边的代码没有问题。但是运行报错如下:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: GCore.SecondarySortByKey$1
Serialization stack:

...

    at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
    at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:351)
    at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45)
    at GCore.SecondarySortByKey.main(SecondarySortByKey.java:52)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.io.NotSerializableException: GCore.SecondarySortByKey$1

上边的异常大致意思就是:Task not serializable
参考源码:

def sortByKey(comp: Comparator[K], ascending: Boolean): JavaPairRDD[K, V] = {
  implicit val ordering = comp // Allow implicit conversion of Comparator to Ordering.
  fromRDD(new OrderedRDDFunctions[K, V, (K, V)](rdd).sortByKey(ascending))
}

其实在OrderedRDDFunctions类中有个变量ordering它是隐形的:implicit val ordering = comp。
它就是默认的排序规则,我们自己重写的comp就修改了默认的排序规则。
到这里还是没有发现问题,但是发现类OrderedRDDFunctions extends Logging with Serializable,又回到上面的报错信息,扫描到“serializable”,因此,返回上述代码,查看Comparator interface实现,发现原来是它没有extend Serializable,故只需创建一个 serializable的comparator就可以:
具体如下:

public class SecondaryComparator implements Comparator<String>, Serializable {
    @Override
    public int compare(String o1, String o2) {
        String[] oo1 = o1.split(" ");
        String[] oo2 = o2.split(" ");
        if (oo1[0].equals(oo2[0])) {
            return -Integer.valueOf(oo1[1]).compareTo(Integer.valueOf(oo2[1]));
        } else {
            return oo1[0].compareTo(oo2[0]);
        }
    }
}

JavaPairRDD res = pairRdd.sortByKey(new SecondaryComparator());

打印结果:
A 10
A 3
B 6
B 2
C 12
D 20
D 6


关于SparkTask未序列化(Tasknotserializable)问题分析请参考:http://blog.csdn.net/javastart/article/details/51206715

你可能感兴趣的:(spark)