《Hadoop The Definitive Guide》ch06 How MapReduce Works

1. MapReduce的工作原理

1) 客户端 提交MapReduce作业。

2) jobtracker 协调作业的运行。 jobtracker是一个Java应用程序,它的主类是JobTracker。

3) tasktracker 运行作业划分后的任务。tasktracker是一个Java应用程序,它的主类是TaskTracker。

4) 分布式文件系统(一般为HDFS),用来在其他实体间共享作业文件。


《Hadoop The Definitive Guide》ch06 How MapReduce Works_第1张图片

2. JobClient的submitJob()方法所实现的作业提交过程如下

a.  Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).
b.  Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is
thrown to the MapReduce program.
c.  Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error
is thrown to the MapReduce program.
d.  Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a
directory named after the job ID. The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to
10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).
e.  Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker) (step 4).

3.  tasktracker中执行的流和管道及其子进程的关系

《Hadoop The Definitive Guide》ch06 How MapReduce Works_第2张图片

你可能感兴趣的:(《Hadoop The Definitive Guide》ch06 How MapReduce Works)