[Hadoop源码解读](六)MapReduce篇之MapTask类

MapTask类继承于Task类,它最主要的方法就是run(),用来执行这个Map任务。

  run()首先设置一个TaskReporter并启动,然后调用JobConf的getUseNewAPI()判断是否使用New API,使用New API的设置在前面[Hadoop源码解读](三)MapReduce篇之Job类 讲到过,再调用Task继承来的initialize()方法初始化这个task,接着根据需要执行runJobCleanupTask()、runJobSetupTask()、runTaskCleanupTask()或相应的Mapper,执行Mapper时根据情况使用不同版本的MapReduce,这个版本是设置参数决定的。

 1   @Override

 2   public void run(final JobConf job, final TaskUmbilicalProtocol umbilical) 

 3     throws IOException, ClassNotFoundException, InterruptedException {

 4     this.umbilical = umbilical;

 5 

 6     // start thread that will handle communication with parent

 7     TaskReporter reporter = new TaskReporter(getProgress(), umbilical,

 8         jvmContext);

 9     reporter.startCommunicationThread();

10     boolean useNewApi = job.getUseNewMapper();  //是由JobConf来的,而New API 的JobContext包含一个JobConf,Job类有

11     //setUseNewAPI()方法,当Job.submit()时使用它,这样,waitForCompletion()就用submit()设置了使用New API,而此时就使用它。

12     initialize(job, getJobID(), reporter, useNewApi);//一个Task的初始化工作,包括jobContext,taskContext,输出路径等,

13                                  //使用的是Task.initialize()方法

14  

15     // check if it is a cleanupJobTask

16     if (jobCleanup) {

17       runJobCleanupTask(umbilical, reporter);

18       return;

19     }

20     if (jobSetup) {

21       runJobSetupTask(umbilical, reporter);

22       return;

23     }

24     if (taskCleanup) {

25       runTaskCleanupTask(umbilical, reporter);

26       return;

27     }

28 

29     if (useNewApi) {//根据情况使用不同的MapReduce版本执行Mapper

30       runNewMapper(job, splitMetaInfo, umbilical, reporter);

31     } else {

32       runOldMapper(job, splitMetaInfo, umbilical, reporter);

33     }

34     done(umbilical, reporter);

35   }

 runNewMapper对应new API的MapReduce,而runOldMapper对应旧API。

 

   runNewMapper首先创建TaskAttemptContext对象,Mapper对象,InputFormat对象,InputSplit,RecordReader;然后根据是否有Reduce task来创建不同的输出收集器NewDirectOutputCollector[没有reducer]或NewOutputCollector[有reducer],接下来调用input.initialize()初始化RecordReader,主要是为输入做准备,设置RecordReader,输入路径等等。然后到最主要的部分:mapper.run()。这个方法就是调用前面[Hadoop源码解读](二)MapReduce篇之Mapper类讲到的Mapper.class的run()方法。然后就是一条一条的读取K/V对,这样就衔接起来了。

 1  @SuppressWarnings("unchecked")

 2   private <INKEY,INVALUE,OUTKEY,OUTVALUE>

 3   void runNewMapper(final JobConf job,

 4                     final TaskSplitIndex splitIndex,

 5                     final TaskUmbilicalProtocol umbilical,

 6                     TaskReporter reporter

 7                     ) throws IOException, ClassNotFoundException,

 8                              InterruptedException {

 9     // make a task context so we can get the classes

10     org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =

11       new org.apache.hadoop.mapreduce.TaskAttemptContext(job, getTaskID());

12     // make a mapper

13     org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =

14       (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)

15         ReflectionUtils.newInstance(taskContext.getMapperClass(), job);

16     // make the input format

17     org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =

18       (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)

19         ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

20     // rebuild the input split

21     org.apache.hadoop.mapreduce.InputSplit split = null;

22     split = getSplitDetails(new Path(splitIndex.getSplitLocation()),

23         splitIndex.getStartOffset());

24 

25     org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =

26       new NewTrackingRecordReader<INKEY,INVALUE>

27           (split, inputFormat, reporter, job, taskContext);

28 

29     job.setBoolean("mapred.skip.on", isSkipping());

30     org.apache.hadoop.mapreduce.RecordWriter output = null;

31     org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 

32          mapperContext = null;

33     try {

34       Constructor<org.apache.hadoop.mapreduce.Mapper.Context> contextConstructor =

35         org.apache.hadoop.mapreduce.Mapper.Context.class.getConstructor

36         (new Class[]{org.apache.hadoop.mapreduce.Mapper.class,

37                      Configuration.class,

38                      org.apache.hadoop.mapreduce.TaskAttemptID.class,

39                      org.apache.hadoop.mapreduce.RecordReader.class,

40                      org.apache.hadoop.mapreduce.RecordWriter.class,

41                      org.apache.hadoop.mapreduce.OutputCommitter.class,  //

42                      org.apache.hadoop.mapreduce.StatusReporter.class,

43                      org.apache.hadoop.mapreduce.InputSplit.class});

44 

45       // get an output object

46       if (job.getNumReduceTasks() == 0) {

47          output =

48            new NewDirectOutputCollector(taskContext, job, umbilical, reporter);

49       } else {

50         output = new NewOutputCollector(taskContext, job, umbilical, reporter);

51       }

52 

53       mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),

54                                                      input, output, committer,

55                                                      reporter, split);

56 

57       input.initialize(split, mapperContext);

58       mapper.run(mapperContext);

59       input.close();

60       output.close(mapperContext);

61     } catch (NoSuchMethodException e) {

62       throw new IOException("Can't find Context constructor", e);

63     } catch (InstantiationException e) {

64       throw new IOException("Can't create Context", e);

65     } catch (InvocationTargetException e) {

66       throw new IOException("Can't invoke Context constructor", e);

67     } catch (IllegalAccessException e) {

68       throw new IOException("Can't invoke Context constructor", e);

69     }

70   }

至于运行哪个Mapper类,一般是我们用job.setMapperClass(SelectGradeMapper.class)设置的,那设置后是怎样获取的,或者默认值是什么,且看下面的追溯。

            MapTask.runNewMapper()

=>       (TaskAttemptContext)taskContext.getMapperClass();     //runNewMapper生成mapper时用到。

=>       JobContext.getMapperClass()

=>       JobConf.getClass(MAP_CLASS_ATTR,Mapper.class)

=>       Configuration.getClass(name,default)

根据上面一层的调用关系,找到了默认值是Mapper.class,它的获取过程也一目了然。

再仔细看看Configuration.getClass()

 1   public Class<?> getClass(String name, Class<?> defaultValue) {

 2     String valueString = get(name);

 3     if (valueString == null)

 4       return defaultValue;

 5     try {

 6       return getClassByName(valueString);

 7     } catch (ClassNotFoundException e) {

 8       throw new RuntimeException(e);

 9     }

10   }

它首先看是否设置了某个属性,如果设置了,就调用getClassByName获取这个属性对应的类[加载之],否则就返回默认值。
Mapper执行完后,关闭RecordReader和OutputCollector等资源就完事了。

 

另外我们把关注点放在上面的runNewMapper()中的mapper.run(mapperContext);前面对Mapper.class提到,这个mapperContext会被用于读取输入分片的K/V对和写出输出结果的K/V对。而由

      mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),

                                                     input, output, committer,

                                                     reporter, split);

可以看出,这个Context是由我们设置的mapper,RecordReader等进行配置的。

 

Mapper中的map方法不断使用context.write(K,V)进行输出,我们看这个函数是怎么进行的,先看Context类的层次关系:

[Hadoop源码解读](六)MapReduce篇之MapTask类

write()方法是由TaskInputOutputContext来的:

  public void write(KEYOUT key, VALUEOUT value

                    ) throws IOException, InterruptedException {

    output.write(key, value);

  }

它调用了RecordWriter.write(),RecordWriter是一个抽象类,主要是规定了write方法。

public abstract class RecordWriter<K, V> {

  public abstract void write(K key, V value

                             ) throws IOException, InterruptedException;



  public abstract void close(TaskAttemptContext context

                             ) throws IOException, InterruptedException;

}

然后看RecordWriter的一个实现NewOutputCollector,它是MapTask的内部类:

 1   private class NewOutputCollector<K,V>

 2     extends org.apache.hadoop.mapreduce.RecordWriter<K,V> {

 3     private final MapOutputCollector<K,V> collector;

 4     private final org.apache.hadoop.mapreduce.Partitioner<K,V> partitioner;

 5     private final int partitions;

 6 

 7     @SuppressWarnings("unchecked")

 8     NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,

 9                        JobConf job,

10                        TaskUmbilicalProtocol umbilical,

11                        TaskReporter reporter

12                        ) throws IOException, ClassNotFoundException {

13       collector = new MapOutputBuffer<K,V>(umbilical, job, reporter);

14       partitions = jobContext.getNumReduceTasks();

15       if (partitions > 0) {

16         partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)

17           ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);

18       } else {

19         partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {

20           @Override

21           public int getPartition(K key, V value, int numPartitions) {

22             return -1;

23           }

24         };

25       }

26     }

27 

28     @Override

29     public void write(K key, V value) throws IOException, InterruptedException {

30       collector.collect(key, value,

31                         partitioner.getPartition(key, value, partitions));

32     }

33 

34     @Override

35     public void close(TaskAttemptContext context

36                       ) throws IOException,InterruptedException {

37       try {

38         collector.flush();

39       } catch (ClassNotFoundException cnf) {

40         throw new IOException("can't find class ", cnf);

41       }

42       collector.close();

43     }

44   }

从它的write()方法,我们从context.write(K,V)追溯到了collector.collect(K,V,partition),注意到输出需要一个Partitioner的getPartitioner()来提供当前K/V对的所属分区,因为要对K/V对分区,不同分区输出到不同Reducer,Partitioner默认是HashPartitioner,可设置,Reduce task数量决定Partition数量;

 

我们可以从NewOutputCollector看出NewOutputCollector就是MapOutputBuffer的封装。MapoutputBuffer是旧API中就存在了的,它很复杂,但很关键,暂且放着先,反正就是收集输出K/V对的。它实现了MapperOutputCollector接口:

  interface MapOutputCollector<K, V> {

    public void collect(K key, V value, int partition

                        ) throws IOException, InterruptedException;

    public void close() throws IOException, InterruptedException;

    public void flush() throws IOException, InterruptedException, 

                               ClassNotFoundException;

  }

这个接口告诉我们,收集器必须实现collect,close,flush方法。

看一个简单的:NewDirectOutputCollector,它在没有reduce task的时候使用,主要是从InputFormat中获取OutputFormat的RecordWriter,然后就可以用这个RecordWriter的write()方法来写出,这就与我们设置的输出格式对应起来了。

 1   private class NewDirectOutputCollector<K,V>

 2   extends org.apache.hadoop.mapreduce.RecordWriter<K,V> {

 3     private final org.apache.hadoop.mapreduce.RecordWriter out;

 4 

 5     private final TaskReporter reporter;

 6 

 7     private final Counters.Counter mapOutputRecordCounter;

 8     private final Counters.Counter fileOutputByteCounter; 

 9     private final Statistics fsStats;

10     

11     @SuppressWarnings("unchecked")

12     NewDirectOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,

13         JobConf job, TaskUmbilicalProtocol umbilical, TaskReporter reporter) 

14     throws IOException, ClassNotFoundException, InterruptedException {

15       this.reporter = reporter;

16       Statistics matchedStats = null;

17       if (outputFormat instanceof org.apache.hadoop.mapreduce.lib.output.FileOutputFormat) { 

18         //outputFormat是Task来的,内部类访问外部类成员变量

19         matchedStats = getFsStatistics(org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

20             .getOutputPath(jobContext), job);

21       }

22       fsStats = matchedStats;

23       mapOutputRecordCounter = 

24         reporter.getCounter(MAP_OUTPUT_RECORDS);

25       fileOutputByteCounter = reporter

26           .getCounter(org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.Counter.BYTES_WRITTEN);

27 

28       long bytesOutPrev = getOutputBytes(fsStats);

29       out = outputFormat.getRecordWriter(taskContext); //主要是这句,获取设置的OutputputFormat里的RecordWriter

30       long bytesOutCurr = getOutputBytes(fsStats);

31       fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);

32     }

33 

34     @Override

35     @SuppressWarnings("unchecked")

36     public void write(K key, V value) 

37     throws IOException, InterruptedException {

38       reporter.progress();  //报告一下进度

39       long bytesOutPrev = getOutputBytes(fsStats);

40       out.write(key, value);//使用out收集一条记录,out是设置的OutputFormat来的。

41       long bytesOutCurr = getOutputBytes(fsStats);

42       fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);  //更新输出字节数

43       mapOutputRecordCounter.increment(1);      //更新输出K/V对数量

44     }

45 

46     @Override

47     public void close(TaskAttemptContext context) 

48     throws IOException,InterruptedException {

49       reporter.progress();

50       if (out != null) {

51         long bytesOutPrev = getOutputBytes(fsStats);

52         out.close(context);

53         long bytesOutCurr = getOutputBytes(fsStats);

54         fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);

55       }

56     }

57 

58     private long getOutputBytes(Statistics stats) {

59       return stats == null ? 0 : stats.getBytesWritten();

60     }

61   }

另外还有一些以runOldMapper()为主导的旧MapReduce API那套,就不进行讨论了。

from:  http://blog.csdn.net/posa88/article/details/7956767

你可能感兴趣的:(mapreduce)