近期看了一本书:Data-intensive Text Processing with MapReduce,是讲如何设计MR程序的,看到一个例子是Combiner的设计模式,然后就动手实现了下。具体问题如下:
现有输入数据如下:
one 3.9 one 4.0 one 3.8 two 44 two 44 two 44 three 9898 four 2323 four 2323 five 2323 six 23 six 2323 four 232 five 2323
class Mapper method Map(string t, integer r) Emit(string t, integer r) class Reducer method Reduce(string t, integers [r1 , r2 , . . .]) sum ← 0 cnt ← 0 for all integer r ∈ integers [r1 , r2 , . . .] do sum ← sum + r cnt ← cnt + 1 ravg ← sum/cnt Emit(string t, integer ravg )如果要加combine怎么操作呢?Combiner和Reducer一样么(求最大气温的例子或许是一样的,但这里却不是,而且现实中的很多例子都不是一样的),如果一样的话那么就会变成下面的错误操作了:
Mean(1, 2, 3, 4, 5) = Mean(Mean(1, 2), Mean(3, 4, 5))正确的伪代码如下(书上摘录):
class Mapper method Map(string t, integer r) Emit(string t, pair (r, 1)) class Combiner method Combine(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .]) sum ← 0 cnt ← 0 for all pair (s, c) ∈ pairs [(s1 , c1 ), (s2 , c2 ) . . .] do sum ← sum + s cnt ← cnt + c Emit(string t, pair (sum, cnt)) class Reducer method Reduce(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .]) sum ← 0 cnt ← 0 for all pair (s, c) ∈ pairs [(s1 , c1 ), (s2 , c2 ) . . .] do sum ← sum + s cnt ← cnt + c ravg ← sum/cnt Emit(string t, integer ravg )由于Combiner的输入和输出格式要一样,即Combiner的输入要和Mapper的输出格式一样,Combiner的输出要和Reducer的输入格式一样。所以上面有pairs。参考上面的伪代码编写的代码如下:
Driver:
package org.fansy.date922; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class AverageDriver3 { public static void main(String[] args) throws Exception{ // TODO Auto-generated method stub Configuration conf1 = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf1, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: AverageDriver<in> <out>"); System.exit(2); } Job job1 = new Job(conf1, "AverageDriver job "); job1.setInputFormatClass(KeyValueTextInputFormat.class); job1.setNumReduceTasks(1); job1.setJarByClass(AverageDriver3.class); job1.setMapperClass(AverageM2.class); job1.setMapOutputKeyClass(Text.class); job1.setMapOutputValueClass(TextPair.class); job1.setCombinerClass(AverageC3.class); job1.setReducerClass(AverageR2.class); job1.setOutputKeyClass(Text.class); job1.setOutputValueClass(DoubleWritable.class); KeyValueTextInputFormat.addInputPath(job1, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job1, new Path(otherArgs[1])); if(!job1.waitForCompletion(true)){ System.exit(1); // run error then exit } System.out.println("************************"); } }Mapper:
package org.fansy.date922; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class AverageM3 extends Mapper<Text,Text,Text,TextPair>{ // private Text newkey=new Text(); private TextPair newvalue=new TextPair(); private DoubleWritable r=new DoubleWritable(); private IntWritable number=new IntWritable(1); public void map(Text key,Text value,Context context)throws IOException,InterruptedException { // TODO Auto-generated method stub System.out.println(key.toString()); double shuzhi=Double.parseDouble(value.toString()); r.set(shuzhi); newvalue.set(r, number); context.write(key, newvalue); } }Combiner:
package org.fansy.date922; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class AverageC3 extends Reducer<Text,TextPair,Text,TextPair>{ private DoubleWritable newvalued=new DoubleWritable(); private IntWritable newvaluei=new IntWritable(); private TextPair newvalue=new TextPair(); public void reduce(Text key,Iterable<TextPair> values,Context context) throws IOException,InterruptedException{ // TODO Auto-generated method stub double sum= 0.0; int num=0; for(TextPair val:values){ sum+=val.getFirst().get(); num+=val.getSecond().get(); } newvalued.set(sum); newvaluei.set(num); newvalue.set(newvalued,newvaluei); context.write(key, newvalue); } }
package org.fansy.date922; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class AverageR3 extends Reducer<Text,TextPair,Text,DoubleWritable>{ private DoubleWritable newvalue=new DoubleWritable(); public void reduce(Text key,Iterable<TextPair> values,Context context) throws IOException,InterruptedException{ // TODO Auto-generated method stub double sum= 0.0; int num=0; for(TextPair val:values){ sum+=val.getFirst().get(); num+=val.getSecond().get(); } double aver=sum/num; newvalue.set(aver); context.write(key, newvalue); } }
package org.fansy.date922; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.WritableComparable; public class TextPair implements WritableComparable<TextPair> { private DoubleWritable first; private IntWritable second; public TextPair(){ set(new DoubleWritable(),new IntWritable()); } public void set(DoubleWritable longWritable, IntWritable intWritable) { // TODO Auto-generated method stub this.first=longWritable; this.second=intWritable; } public DoubleWritable getFirst(){ return first; } public IntWritable getSecond(){ return second; } @Override public void readFields(DataInput arg0) throws IOException { // TODO Auto-generated method stub first.readFields(arg0); second.readFields(arg0); } @Override public void write(DataOutput arg0) throws IOException { // TODO Auto-generated method stub first.write(arg0); second.write(arg0); } @Override public int compareTo(TextPair o) { // TODO Auto-generated method stub int cmp=first.compareTo(o.first); if(cmp!=0){ return cmp; } return second.compareTo(o.second); } }
12/09/22 15:55:45 INFO mapred.JobClient: Job complete: job_local_0001 12/09/22 15:55:45 INFO mapred.JobClient: Counters: 22 12/09/22 15:55:45 INFO mapred.JobClient: File Output Format Counters 12/09/22 15:55:45 INFO mapred.JobClient: Bytes Written=65 12/09/22 15:55:45 INFO mapred.JobClient: FileSystemCounters 12/09/22 15:55:45 INFO mapred.JobClient: FILE_BYTES_READ=466 12/09/22 15:55:45 INFO mapred.JobClient: HDFS_BYTES_READ=244 12/09/22 15:55:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=82758 12/09/22 15:55:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=65 12/09/22 15:55:45 INFO mapred.JobClient: File Input Format Counters 12/09/22 15:55:45 INFO mapred.JobClient: Bytes Read=122 12/09/22 15:55:45 INFO mapred.JobClient: Map-Reduce Framework 12/09/22 15:55:45 INFO mapred.JobClient: Map output materialized bytes=118 12/09/22 15:55:45 INFO mapred.JobClient: Map input records=14 12/09/22 15:55:45 INFO mapred.JobClient: Reduce shuffle bytes=0 12/09/22 15:55:45 INFO mapred.JobClient: Spilled Records=12 12/09/22 15:55:45 INFO mapred.JobClient: Map output bytes=231 12/09/22 15:55:45 INFO mapred.JobClient: Total committed heap usage (bytes)=301727744 12/09/22 15:55:45 INFO mapred.JobClient: CPU time spent (ms)=0 12/09/22 15:55:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=108 12/09/22 15:55:45 INFO mapred.JobClient: Combine input records=14 12/09/22 15:55:45 INFO mapred.JobClient: Reduce input records=6 12/09/22 15:55:45 INFO mapred.JobClient: Reduce input groups=6 12/09/22 15:55:45 INFO mapred.JobClient: Combine output records=6 12/09/22 15:55:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 12/09/22 15:55:45 INFO mapred.JobClient: Reduce output records=6 12/09/22 15:55:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 12/09/22 15:55:45 INFO mapred.JobClient: Map output records=14 ************************那本书上面其实最后还有提到一个 in-Mapper Combining的一个编程,但是看的不是很明白,伪代码如下:
class Mapper method Initialize S ← new AssociativeArray C ← new AssociativeArray method Map(string t, integer r) S{t} ← S{t} + r C{t} ← C{t} + 1 method Close for all term t ∈ S do Emit(term t, pair (S{t}, C{t}))