在MapReduce程序中,我们常常需要对属于同一个key的value进行排序,即“二次排序”,将key和value进行组合,合并成一个新的key,给map去排序。在Hadoop 1.0.4中,利用setSortComparatorClass()对二次排序进行设定,但是sort comparator需要自己实现一个comparator,下面是一个自己实现的comparator的例子。
public static class SortComparator extends WritableComparator { protected SortComparator() { super(Text.class, true); // TODO Auto-generated constructor stub } @Override public int compare(WritableComparable a, WritableComparable b) { // TODO Auto-generated method stub String[] strs_a = ((Text) a).toString().split(":"); String[] strs_b = ((Text) b).toString().split(":"); if ((strs_a.length != 3) || (strs_b.length != 3)) { log.error("Error: dimension error 1 in SortComparator!"); System.exit(1); } if (Integer.parseInt(strs_a[0]) > Integer.parseInt(strs_b[0])) { return 1; } else if (Integer.parseInt(strs_a[0]) < Integer .parseInt(strs_b[0])) { return -1; } else { if (Double.parseDouble(strs_a[1]) > Double .parseDouble(strs_b[1])) { return 1; } else { return -1; } } } }
然后,在job中设置
job.setSortComparatorClass(SortComparator)
由于我们使用了“二次排序”,因此现在的key是被合并过的key(上面说过,是将key与value合并成新的key),所以我们需要定义组比较器(grouping comparator),它的功能是在reducer中为我们需要的相同的key(即合并之前的key)送入到同一个reduce中(官方文档中的描述是“Define the comparator that controls which keys are grouped together for a single call to Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context)
”)。下面是一个grouping comparator的例子。
public static class GroupComparator extends WritableComparator { protected GroupComparator() { super(Text.class, true); // TODO Auto-generated constructor stub } @Override public int compare(WritableComparable a, WritableComparable b) { // TODO Auto-generated method stub String[] strs_a = ((Text) a).toString().split(":"); String[] strs_b = ((Text) b).toString().split(":"); if ((strs_a.length != 3) || (strs_b.length != 3)) { log.error("Error: dimension error 1 in GroupComparator!"); System.exit(1); } String new_key_a = strs_a[0] + strs_a[2]; String new_key_b = strs_b[0] + strs_b[2]; if (new_key_a.compareTo(new_key_b) == 0) { return 0; } else if (new_key_a.compareTo(new_key_b) > 0) { return 1; } else { return -1; } } }然后,在job中设置
job.setGroupingComparatorClass(GroupComparator.class);
public static class Patitioner extends HashPartitioner<Text, IntWritable> { @Override public int getPartition(Text key, IntWritable value, int numReduceTasks) { // TODO Auto-generated method stub String[] new_key = key.toString().split(":"); if (new_key.length != 3) { log.error("Error: dimension error in partitioner!"); System.exit(1); } return super.getPartition(new Text(new_key[0]), value, numReduceTasks); } }然后,在job中设置
job.setPartitionerClass(Patitioner.class);