程序到了Job.submit(),提交作业的流程就进入了第二个阶段,如果把Hadoop的集群中每一个节点都看做一个岛屿,那么这就是要出海了,涉及到跨节点的操作了。
Job.submit()代码:
public void submit()
throws IOException, InterruptedException, ClassNotFoundException {
ensureState(JobState.DEFINE);//确认没有重复提交
setUseNewAPI();//根据配置信息确定是否采用新API
connect();//建立与集群的连接,创建Cluster对象cluster
final JobSubmitter submitter =
getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction() {
public JobStatus run() throws IOException, InterruptedException,
ClassNotFoundException {
return submitter.submitJobInternal(Job.this, cluster);
}
});
state = JobState.RUNNING;
LOG.info("The url to track the job: " + getTrackingURL());
}
Job.setUserNewAPI()这个函数根据配置文件中的若干配置项确定本作业所采用的是新API还是老API,并生成显式的配置项“mapred.mapper.new-api”和“mapred.reducer.new-api”,并将之写入内存中的配置块。
connect()起到跨节点操作,建立对外联系的作用。创建一个专门提交作业的JobSubmitter.submitJobInternal()方法。当程序从submitJobInternal()返回的时候,作业已经提交完成,所以就到作业状态置为RUNNING。
private synchronized void connect()
throws IOException, InterruptedException, ClassNotFoundException {
if (cluster == null) {
cluster =
ugi.doAs(new PrivilegedExceptionAction() {
public Cluster run()
throws IOException, InterruptedException,
ClassNotFoundException {
return new Cluster(getConfiguration());
}
});
}
}
Job.connect()代码:
private synchronized void connect()
throws IOException, InterruptedException, ClassNotFoundException {
if (cluster == null) {//如果cluster尚未创建
cluster =
ugi.doAs(new PrivilegedExceptionAction() {
public Cluster run()
throws IOException, InterruptedException,
ClassNotFoundException {
return new Cluster(getConfiguration());
}
});
}
}
可见,connect()的作用就是保证节点上有个Cluster类对象,如果还没有,就创建一个。顾名思义,Cluster类应该存有这个集群的信息,也应该知道如何和集群打交道。来看一下Cluster类的代码:
public class Cluster {
private ClientProtocolProvider clientProtocolProvider;//集群下是YarnClientProtocolProvider
private ClientProtocol client;//在集群条件下,这是与外界通信的条件和规则
static { ConfigUtil.loadResources() }//类的静态初始化
private void initialize(InetSocketAddress jobTrackAddr, Configuration conf)
throws IOException{}
......省略......
}
static { ConfigUtil.loadResources() }通过类的静态初始化装载了一批.xml的配置文件,主要有mapred-default.xml, mapred-site.xml和yarn-default.xml, yarn-site.xml。
整个流程可以简化为:
Job.submit -> job.connect() -> Cluster.cluster() ->Cluster.initialize()
Hadoop运行MapReduce作业的工作原理: