问题:对一下数据进行排序,首先对key进行升序,如果key相同,那么对value进行升序那么处理的结果是:
key value 30 20 40 20 40 10 40 5 30 30 30 10 50 10 40 30 30 40 50 20 50 60
处理结果:
---------------------------------- 30 10 30 20 30 30 30 40 ---------------------------------- 40 5 40 10 40 20 40 30 ---------------------------------- 50 10 50 20 50 60
一下代码都是结构如下:
在本次教程中,
SecondSortMapReduce为外部类,其他类为静态内部类
在MapReduce操作时,我们知道传递的<key,value>会按照key的大小进行排序,最后输出的结果是按照key排过序的。有的时候我们在key排序的基础上,对value也进行排序。这种需求就是二次排序。
我们先看一下Mapper任务的数据处理过程吧,见下图。
在图中,数据处理分为四个阶段:
(1)Mapper任务会接收输入分片,然后不断的调用map函数,对记录进行处理。处理完毕后,转换为新的<key,value>输出。
(2)对map函数输出的<key, value>调用分区函数,对数据进行分区。不同分区的数据会被送到不同的Reducer任务中。
(3)对于不同分区的数据,会按照key进行排序,这里的key必须实现WritableComparable接口。该接口实现了Comparable接口,因此可以进行比较排序。
(4)对于排序后的<key,value>,会按照key进行分组。如果key相同,那么相同key的<key,value>就被分到一个组中。最终,每个分组会调用一次reduce函数。
(5)排序、分组后的数据会被送到Reducer节点。
在MapReduce的体系结构中,我们没有看到对value的排序操作。怎么实现对value的排序哪?这就需要我们变通的去实现这个需求。
变通手段:我们可以把key和value联合起来作为新的key,记作newkey。这时,newkey含有两个字段,假设分别是k,v。这里的k和v是原来的key和value。原来的value还是不变。这样,value就同时在newkey和value的位置。我们再实现newkey的比较规则,先按照key排序,在key相同的基础上再按照value排序。在分组时,再按照原来的key进行分组,就不会影响原有的分组逻辑了。最后在输出的时候,只把原有的key、value输出,就可以变通的实现了二次排序的需求。
通过以上的分析,我们是不是已经把问题分析透彻了呢,那么下面请看用hadoop2.x是如何实现的。
hadoop中Mapper中的key必须继承WritableComparator类,代码如下:
public static class IntPair implements WritableComparable<IntPair>{ private int first = 0; private int second = 0; public void set(int left, int right) { first = left; second = right; } public int getFirst() { return first; } public int getSecond() { return second; } public void readFields(DataInput in) throws IOException { first = in.readInt(); second = in.readInt(); } public void write(DataOutput out) throws IOException { out.writeInt(first); out.writeInt(second); } public int hashCode() { return first+"".hashCode() + second+"".hashCode(); } public boolean equals(Object right) { if (right instanceof IntPair) { IntPair r = (IntPair) right; return r.first == first && r.second == second; } else { return false; } } //这里的代码是关键,因为对key排序时,调用的就是这个compareTo方法 public int compareTo(IntPair o) { if (first != o.first) { return first - o.first; } else if (second != o.second) { return second - o.second; } else { return 0; } } }
public static class SortMapp extends Mapper<Object, Text, IntPair, IntWritable>{ private final IntPair mapKey=new IntPair(); private final IntWritable mapValue=new IntWritable(); @Override protected void setup( Mapper<Object, Text, IntPair, IntWritable>.Context context) throws IOException, InterruptedException { super.setup(context); } protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer=new StringTokenizer(value.toString()); int first=0; int second=0; if(tokenizer.hasMoreTokens()){ first=Integer.parseInt(tokenizer.nextToken()); if(tokenizer.hasMoreTokens()){ second=Integer.parseInt(tokenizer.nextToken()); } mapKey.set(first,second); mapValue.set(second); System.out.println("key:"+first+" value:"+second); context.write(mapKey, mapValue); } } }
/** * 分区排序,第一次排序 * @author king-pan * */ public static class FirstPartitioner extends Partitioner<IntPair, IntWritable>{ @Override public int getPartition(IntPair key, IntWritable value, int numPartitions) { return Math.abs(key.getFirst()*127)&numPartitions; } }
......
public static class GroupingComparator extends WritableComparator{ protected GroupingComparator(){ super(IntPair.class,true); } @SuppressWarnings("rawtypes") @Override public int compare(WritableComparable a, WritableComparable b) { IntPair i1=(IntPair)a; IntPair i2=(IntPair)b; int first1=i1.getFirst(); int first2=i2.getFirst(); return (first1==first2)?0:(first1<first2?-1:1); } }
Reduce接受数据,并且输出数据
public static class SortReduce extends Reducer<IntPair, IntWritable, Text, IntWritable>{ private final Text outputKey=new Text(); private static final Text SEPARATOR=new Text("----------------------------------"); @Override protected void setup( Reducer<IntPair, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { super.setup(context); } @Override protected void reduce(IntPair key, Iterable<IntWritable> values, Reducer<IntPair, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { context.write(SEPARATOR, null); outputKey.set(Integer.toString(key.getFirst())+"\t"); for(IntWritable value:values){ context.write(outputKey, value); } } @Override protected void cleanup( Reducer<IntPair, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { super.cleanup(context); } }
package com.mscncn.hadoop.sort; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.io.WritableComparator; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Partitioner; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class SecondSortMapReduce extends Configured implements Tool { public int run(String[] args) throws Exception { //1. 设置配置文件 Configuration conf=getConf(); //2. 创建Job Job job=Job.getInstance(conf); job.setJobName("data解决任务");//设置任务名 //3. 设置job //a. 设置job输入格式 job.setInputFormatClass(TextInputFormat.class); //b. 设置Mapper job.setMapperClass(SortMapp.class); //c. 设置Map的输出类型 job.setMapOutputKeyClass(IntPair.class); job.setMapOutputValueClass(IntWritable.class); //d. 设置分区 job.setPartitionerClass(FirstPartitioner.class); //e. 设置分组 job.setGroupingComparatorClass(GroupingComparator.class); //f. 设置reduce输出类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //g. 设置输入输出路径 FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setReducerClass(SortReduce.class); //4. 提交任务 job.waitForCompletion(true);//把任务提交到 //5. 返回任务状态,0正常,1不正常 return job.isSuccessful()?0:1; } public static class SortMapp extends Mapper<Object, Text, IntPair, IntWritable>{ private final IntPair mapKey=new IntPair(); private final IntWritable mapValue=new IntWritable(); @Override protected void setup( Mapper<Object, Text, IntPair, IntWritable>.Context context) throws IOException, InterruptedException { super.setup(context); } protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer=new StringTokenizer(value.toString()); int first=0; int second=0; if(tokenizer.hasMoreTokens()){ first=Integer.parseInt(tokenizer.nextToken()); if(tokenizer.hasMoreTokens()){ second=Integer.parseInt(tokenizer.nextToken()); } mapKey.set(first,second); mapValue.set(second); System.out.println("key:"+first+" value:"+second); context.write(mapKey, mapValue); } } } public static class SortReduce extends Reducer<IntPair, IntWritable, Text, IntWritable>{ private final Text outputKey=new Text(); private static final Text SEPARATOR=new Text("----------------------------------"); @Override protected void setup( Reducer<IntPair, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { super.setup(context); } @Override protected void reduce(IntPair key, Iterable<IntWritable> values, Reducer<IntPair, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { context.write(SEPARATOR, null); outputKey.set(Integer.toString(key.getFirst())+"\t"); for(IntWritable value:values){ context.write(outputKey, value); } } @Override protected void cleanup( Reducer<IntPair, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { super.cleanup(context); } } public static class IntPair implements WritableComparable<IntPair>{ private int first = 0; private int second = 0; public void set(int left, int right) { first = left; second = right; } public int getFirst() { return first; } public int getSecond() { return second; } public void readFields(DataInput in) throws IOException { first = in.readInt(); second = in.readInt(); } public void write(DataOutput out) throws IOException { out.writeInt(first); out.writeInt(second); } public int hashCode() { return first+"".hashCode() + second+"".hashCode(); } public boolean equals(Object right) { if (right instanceof IntPair) { IntPair r = (IntPair) right; return r.first == first && r.second == second; } else { return false; } } //这里的代码是关键,因为对key排序时,调用的就是这个compareTo方法 public int compareTo(IntPair o) { if (first != o.first) { return first - o.first; } else if (second != o.second) { return second - o.second; } else { return 0; } } } /** * 分区排序,第一次排序 * @author king-pan * */ public static class FirstPartitioner extends Partitioner<IntPair, IntWritable>{ @Override public int getPartition(IntPair key, IntWritable value, int numPartitions) { return Math.abs(key.getFirst()*127)&numPartitions; } } public static class GroupingComparator extends WritableComparator{ protected GroupingComparator(){ super(IntPair.class,true); } @SuppressWarnings("rawtypes") @Override public int compare(WritableComparable a, WritableComparable b) { IntPair i1=(IntPair)a; IntPair i2=(IntPair)b; int first1=i1.getFirst(); int first2=i2.getFirst(); return (first1==first2)?0:(first1<first2?-1:1); } } public static void main(String[] args) { try { int result=ToolRunner.run(new Configuration(), new SecondSortMapReduce(),args); System.exit(result); } catch (Exception e) { e.printStackTrace(); } } }
大家注意了,这个是一个MapReduce的一个模板。限于字数限制,本次教程接受