相比于scala,用java写二次排序较繁琐一些,请参考:
Spark Java 二次排序:http://blog.csdn.net/leen0304/article/details/78280282
Spark Scala 二次排序: http://blog.csdn.net/leen0304/article/details/78280282
下边用sortByKey实现二次排序:
为了说明问题,举了一个简单的例子,key是由两部分组成的,我们这里按key的第一部分的升序排,key的第二部分降序排,具体如下:
public class SecondarySortByKey implements Serializable {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SecondarySortByKey").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List> list = Arrays.asList(
new Tuple2("A", 10),
new Tuple2("D", 20),
new Tuple2("D", 6),
new Tuple2("B", 6),
new Tuple2("C", 12),
new Tuple2("B", 2),
new Tuple2("A", 3)
);
JavaRDD> rdd1 = sc.parallelize(list);
JavaPairRDD pairRdd = rdd1.mapToPair(x -> new Tuple2(x._1() + " " + x._2(), 1));
//自定义比较器
Comparator comparator = new Comparator() {
@Override
public int compare(String o1, String o2) {
String[] oo1 = o1.split(" ");
String[] oo2 = o2.split(" ");
if (oo1[0].equals(oo2[0])) {
return -Integer.valueOf(oo1[1]).compareTo(Integer.valueOf(oo2[1]));
} else {
return oo1[0].compareTo(oo2[0]);
}
}
};
JavaPairRDD res = pairRdd.sortByKey(comparator);
res.foreach(x -> System.out.println(x._1()));
}
}
上边的代码没有问题。但是运行报错如下:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: GCore.SecondarySortByKey$1
Serialization stack:
...
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:351)
at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45)
at GCore.SecondarySortByKey.main(SecondarySortByKey.java:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.io.NotSerializableException: GCore.SecondarySortByKey$1
上边的异常大致意思就是:Task not serializable
参考源码:
def sortByKey(comp: Comparator[K], ascending: Boolean): JavaPairRDD[K, V] = {
implicit val ordering = comp // Allow implicit conversion of Comparator to Ordering.
fromRDD(new OrderedRDDFunctions[K, V, (K, V)](rdd).sortByKey(ascending))
}
其实在OrderedRDDFunctions类中有个变量ordering它是隐形的:implicit val ordering = comp。
它就是默认的排序规则,我们自己重写的comp就修改了默认的排序规则。
到这里还是没有发现问题,但是发现类OrderedRDDFunctions extends Logging with Serializable,又回到上面的报错信息,扫描到“serializable”,因此,返回上述代码,查看Comparator interface实现,发现原来是它没有extend Serializable,故只需创建一个 serializable的comparator就可以:
具体如下:
public class SecondaryComparator implements Comparator<String>, Serializable {
@Override
public int compare(String o1, String o2) {
String[] oo1 = o1.split(" ");
String[] oo2 = o2.split(" ");
if (oo1[0].equals(oo2[0])) {
return -Integer.valueOf(oo1[1]).compareTo(Integer.valueOf(oo2[1]));
} else {
return oo1[0].compareTo(oo2[0]);
}
}
}
JavaPairRDD res = pairRdd.sortByKey(new SecondaryComparator());
打印结果:
A 10
A 3
B 6
B 2
C 12
D 20
D 6
关于SparkTask未序列化(Tasknotserializable)问题分析请参考:http://blog.csdn.net/javastart/article/details/51206715