Hadoop的Job提交流程简析（一）

Hadoop为使用者提供了三种提交作业的方法，提供了三种这样的API，之所以有三种不同的方法，是因为Hadoop在其历史上有新老两个API，以及一个变通的的方法，这三种方式分别是：
1、JobClient.runJob(): 调用由JobClient类提供的方法runJob()，这是老的API。
2、Job.waitForCompletion(): 调用由Job类提供的方法waitForCompletion()，属于新API。
3、ToolRunner.run()：调用由ToolRunner类提供的run()方法，这属于变通的方法。

由于老API在现在已经很少使用了，只是为了维护老的版本所以代码中还保留着老的API。所以我们目前主要使用的是第2中API，Hadoop源码中也给出了它的示例程序：hadoop-mapreduce-project下的hadoop-mapreduce-examples下的WordCount.java

关于WordCount类的具体分析可以参考另一篇文章：(https://www.jianshu.com/p/3136a9fa84ed)

WordCount类中定义了两个类：TokenizerMapper 和 IntSumReducer。前者是对Mapper类的map方法进行了覆写，后者是对reducer类的reduce方法进行了覆写。如果提交的代码中没有覆写map和reduce方法，那么hadoop会自动使用自己默认的map reduce方法。

省略具体的map reduce的实现过程

以下的WordCount的main方法的代码：

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount  [...] ");
      System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

可以看到，程序新建了一个Job对象，调用其waitForCompletion方法进行作业的提交，该调用要等到方法的返回true才结束。System.exit(job.waitForCompletion(true) ? 0 : 1);的意思是如果返回true，那么程序返回0，System.exit。如果返回false，那么程序返回1，此时系统就知道了程序运行失败。

所以我们大致知道了Hadoop的mapreduce的job是怎么提交的，下面让我们看看Job类的具体实现：

public class Job extends JobContextImpl implements JobContext{}

在Job类里面定义了许多静态属性，
例如：public static enum JobState {DEFINE, RUNNING}定义了一个枚举属性JobState，DEFINE和RUNNING分别表示了job处于准备阶段和运行阶段。
private static final long MAX_JOBSTATUS_AGE = 1000 * 2 MAX_JOBSTATUS_AGE设置为2000，表示最多2000毫秒就要刷新这个作业的状态。
public static final String OUTPUT_FILTER = "org.apache.hadoop.mapreduce.client.output.filter" OUTPUT_FILTER表示在mapreduce的输出是否还需要一层过滤，如果需要的话以org.apache.hadoop.mapreduce.client.output.filter为键名去XML配置文件中查询。

其中我们最为关心的是waitForCompletion方法：

  public boolean waitForCompletion(boolean verbose
                                   ) throws IOException, InterruptedException,
                                            ClassNotFoundException {
    if (state == JobState.DEFINE) {//确认Job的状态，提交后变为Running
      submit();//通过Job.submit()提交流程
    }
    if (verbose) {//提交之后监控流程，直到结束
      monitorAndPrintJob();//周期性的报告作业进展
    } else {
      // get the completion poll interval from the client.
      int completionPollIntervalMillis = //周期性的询问作业是否已完成
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {//询问作业是否完成
        try {
          Thread.sleep(completionPollIntervalMillis);//未完成，线程休眠一会
        } catch (InterruptedException ie) {//
        }
      }
    }
    return isSuccessful();
  }

该代码实际上就是对Job.submit的调用，只是在调用之前检查一下本作业是否处于DEFINE状态，以确保作业不会被提交多次。在作业提交成功之后就将其改为RUNNING。

在正常情况下，Job.submit()很快就会返回，因为这个方法的作用只是把作业提交上去，而无需等待作业的执行和完成，但是Job.waitForCompletion()则到等到程序完成以后才会返回。在等待期间，如果参数verbose为true，就要周期的报告作业执行的进展，或者就是周期的检测作业是否已经完成。

而后，就是对Job.sumbit()以后的第二段操作了。

Hadoop的Job提交流程简析（一）

你可能感兴趣的:(Hadoop的Job提交流程简析（一）)