Hadoop的ChainMapper/ChainReducer
ChainMapper/ChainReducer主要为了解决线性链式Mapper而提出的。
ChainMapper:
/**The ChainMapper class allows to use multiple Mapper classes within a single * Map task. */ public class ChainMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * @param job * The job. * @param klass * the Mapper class to add. * @param inputKeyClass * mapper input key class. * @param inputValueClass * mapper input value class. * @param outputKeyClass * mapper output key class. * @param outputValueClass * mapper output value class. * @param mapperConf */ public static void addMapper(Job job, Class<? extends Mapper> klass, Class<?> inputKeyClass, Class<?> inputValueClass, Class<?> outputKeyClass, Class<?> outputValueClass, Configuration mapperConf) throws IOException { job.setMapperClass(ChainMapper.class); job.setMapOutputKeyClass(outputKeyClass); job.setMapOutputValueClass(outputValueClass); Chain.addMapper(true, job, klass, inputKeyClass, inputValueClass, outputKeyClass, outputValueClass, mapperConf); } }
ChainReducer:
public class ChainReducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> extends Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * @param job * the job * @param klass * the Reducer class to add. * @param inputKeyClass * reducer input key class. * @param inputValueClass * reducer input value class. * @param outputKeyClass * reducer output key class. * @param outputValueClass * reducer output value class. * @param reducerConf */ public static void setReducer(Job job, Class<? extends Reducer> klass, Class<?> inputKeyClass, Class<?> inputValueClass, Class<?> outputKeyClass, Class<?> outputValueClass, Configuration reducerConf) { job.setReducerClass(ChainReducer.class); job.setOutputKeyClass(outputKeyClass); job.setOutputValueClass(outputValueClass); Chain.setReducer(job, klass, inputKeyClass, inputValueClass, outputKeyClass, outputValueClass, reducerConf); } public static void addMapper(Job job, Class<? extends Mapper> klass, Class<?> inputKeyClass, Class<?> inputValueClass, Class<?> outputKeyClass, Class<?> outputValueClass, Configuration mapperConf) throws IOException { job.setOutputKeyClass(outputKeyClass); job.setOutputValueClass(outputValueClass); Chain.addMapper(false, job, klass, inputKeyClass, inputValueClass, outputKeyClass, outputValueClass, mapperConf); } }
也就是说,在Map或者Reduce阶段存在多个Mapper,这些Mapper像Linux管道一样,前一个Mapper的输出结果直接重定向到下一个Mapper的输入,形成一个流水线,形式类似于[MAP+ REDUCE MAP*]。
图展示了一个典型的ChainMapper/ChainReducer的应用场景:
在Map阶段,数据依次经过Mapper1和Mapper2处理;在Reduce阶段,数据经过shuffle和sort后;交由对应的Reducer处理,
但Reducer处理之后并没有直接写到HDFS上,而是交给另外一个Mapper处理,它产生的结果写到最终的HDFS输出目录中。
对于任意一个MapReduce作业,Map和Reduce阶段可以有无限个Mapper,但Reducer只能有一个
用户通过addMapper在Map/Reduce阶段添加多个Mapper。
该函数带有8个输入参数,分别是作业的配置、Mapper类、Mapper的输入key类型、输入value类型、输出key类型、输出value类型、key/value是否按值传递和Mapper的配置。
ChainMapper.addMapper(job, AMap.class,LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
这主要是因为函数Mapper.map()调用完OutputCollector.collect(key,value)之后,可能会再次使用key和value值,
如果被改变,可能会造成潜在的错误。为了防止OutputCollector直接对key/value修改,ChainMapper允许用户指定key/value传递方式。
如果用户确定key/value不会被修改,则可选用按引用传递,否则按值传递。需要注意的是,引用传递可避免对象拷贝,提高处理效率,但需要确保key/value不会被修改。
实现原理分析
ChainMapper/ChainReducer实现的关键技术点是修改Mapper和Reducer的输出流,将本来要写入文件的输出结果重定向到另外一个Mapper中。结果的输出由OutputCollector管理,因而,ChainMapper/ChainReducer需要重新实现一个OutputCollector完成数据重定向功能。
尽管链式作业在Map和Reduce阶段添加了多个Mapper,但仍然只是一个MapReduce作业,因而只能有一个与之对应的JobConf对象。
然而,当用户调用addMapper添加Mapper时,可能会为新添加的每个Mapper指定一个特有的JobConf,为此,ChainMapper/ChainReducer将这些JobConf对象序列化后,统一保存到作业的JobConf中。
当链式作业开始执行的时候,首先将各个Mapper的JobConf对象反序列化,并构造对应的Mapper和Reducer对象,添加到数据结构mappers(List<Mapper>类型)和reducer(Reducer类型)中。
测试数据:
hadoop|9 spark|2 storm|4 spark|1 kafka|2 tachyon|2 flume|2 flume|2 redis|4 spark|4 hive|3 hbase|4 hbase|2 zookeeper|2 oozie|3 mongodb|3
设置为下图的结果:
结果:
flume 4 hadoop 9 hbase 6 hive 3 kafka 2 mongodb 3 oozie 3 redis 4 spark 7 storm 4 tachyon 2 zookeeper 2
设置为下图的结果:
结果:
hadoop 9
可以看到hadoop、hbase、spark本应都满足条件,但是只输出了hadoop,这也是原始输入数据唯一一个满足条件的,Key/Value并为改变。
代码:
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.chain.ChainMapper; import org.apache.hadoop.mapreduce.lib.chain.ChainReducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class ChainMapperChainReducer { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage <Input> <Output>"); } Job job = Job.getInstance(conf, ChainMapperChainReducer.class.getSimpleName()); job.setJarByClass(ChainMapperChainReducer.class); ChainMapper.addMapper(job, MyMapper1.class, LongWritable.class, Text.class, Text.class, IntWritable.class,new Configuration(false)); ChainReducer.setReducer(job, MyReducer1.class, Text.class, IntWritable.class, Text.class, IntWritable.class,new Configuration(false)); ChainMapper.addMapper(job, MyMapper2.class, Text.class, IntWritable.class, Text.class, IntWritable.class,new Configuration(false)); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); job.waitForCompletion(true); } public static class MyMapper1 extends Mapper<LongWritable, Text, Text, IntWritable> { IntWritable in=new IntWritable(); @Override protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { String[] spl = value.toString().split("\\|"); if (spl.length == 2) { in.set(Integer.parseInt(spl[1].trim())); context.write(new Text(spl[0].trim()),in); } } } public static class MyReducer1 extends Reducer<Text, IntWritable, Text, IntWritable> { IntWritable in=new IntWritable(); @Override protected void reduce(Text k2, Iterable<IntWritable> v2s, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { Integer uv = 0; for (IntWritable v2 : v2s) { uv += Integer.parseInt(v2.toString().trim()); } in.set(uv); context.write(k2, in); } } public static class MyMapper2 extends Mapper<Text, IntWritable, Text, IntWritable> { @Override protected void map(Text key, IntWritable value, Mapper<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { if (Long.parseLong(value.toString().trim()) >= 5) { context.write(new Text(key.toString().trim()), value); } } } }