前面讲到在集群模式时,submitJob将调用JobTracker的submitJob。
JobTracker.submitJob()启动一个新的Job。该函数内部创建一个JobInProgress对象,它包含JobProfile和JobStatus两个对象。JobProfile的成员是MapReduce的一些属性信息。
String user;
final JobID jobid;
String jobFile;
String url;
String name;
String queueName;
JobStatus包含的是与Job运行状态相关的信息。
public static final int RUNNING = 1;
public static final int SUCCEEDED = 2;
public static final int FAILED = 3;
public static final int PREP = 4;
public static final int KILLED = 5;
private JobID jobid;
private float mapProgress;
private float reduceProgress;
private float cleanupProgress;
private float setupProgress;
private int runState;
private long startTime;
private String user;
private JobPriority priority;
private String schedulingInfo="NA";
下面来看JobTracker的submitJob。由于多线程的并发访问,JobTracker基本所有方法都声明为synchronized。
public synchronized JobStatus submitJob(JobID jobId) throws IOException {
//jobs是一个Map<JobID, JobInProgress>,包含了当前正在运行的所有Job
if(jobs.containsKey(jobId)) {
//如果左右已经在运行,直接返回
return jobs.get(jobId).getStatus();
}
JobInProgress job = new JobInProgress(jobId, this, this.conf);
String queue = job.getProfile().getQueueName();
//QueueManager提供了MapReduce框架的队列信息
if(!(queueManager.getQueues().contains(queue))) {
new CleanupQueue().addToQueue(conf,getSystemDirectoryForJob(jobId));
throw new IOException("Queue \"" + queue + "\" does not exist");
}
// check for access
try {
checkAccess(job, QueueManager.QueueOperation.SUBMIT_JOB);
} catch (IOException ioe) {
LOG.warn("Access denied for user " + job.getJobConf().getUser()
+ ". Ignoring job " + jobId, ioe);
new CleanupQueue().addToQueue(conf, getSystemDirectoryForJob(jobId));
throw ioe;
}
// Check the job if it cannot run in the cluster because of invalid memory
// requirements.
try {
checkMemoryRequirements(job);
} catch (IOException ioe) {
new CleanupQueue().addToQueue(conf, getSystemDirectoryForJob(jobId));
throw ioe;
}
//将一个job添加到jobtracker中。这是作业提交的核心逻辑。
return addJob(jobId, job);
}
有个有意思的东西:new CleanupQueue().addToQueue(conf,getSystemDirectoryForJob(jobId));出现异常时创建一个清除队列,用来清除文件或路径(MapReduce创建了很多临时文件),该队列是个单件模式。
private synchronized JobStatus addJob(JobID jobId, JobInProgress job) {
totalSubmissions++;
synchronized (jobs) { //知道为什么还要synchronized吗?
synchronized (taskScheduler) {
jobs.put(job.getProfile().getJobID(), job);
//JobInProgressListener是JobInProgress生命周期内的监听器。
for (JobInProgressListener listener : jobInProgressListeners) {
try {
listener.jobAdded(job);
} catch (IOException ioe) {
LOG.warn("Failed to add and so skipping the job : "
+ job.getJobID() + ". Exception : " + ioe);
}
}
}
}
/*JobTrackerInstrumentation相当于一个指示仪表盘,该对象更像一个接口,方法基本都为空。
*Hadoop提供一个JobTrackerMetricsInst,继承自JobTrackerInstrumentation。
* submitJob只是将已提交的计数器加一。
*/
myInstrumentation.submitJob(job.getJobConf(), jobId);
return job.getStatus();
}
前面介绍过submitJobInternal(JobClient)提交Job时,通过jobSubmitClient(类型JobSubmissionProtocol)的submitJob函数实现。jobSubmitClient在Local模式和集群模式时不一样的。集群模式时,JobClient建立一个到JobTracer的RPC。
this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);//配置mapred.job.tracker
createRPCProxy通过Hadoop提供的RPC机制获取一个代理对象。JobClient和JobTracer之间的通信需要协议:JobSubmissionProtocol。
private JobSubmissionProtocol createRPCProxy(InetSocketAddress addr,Configuration conf) throws IOException {
return (JobSubmissionProtocol) RPC.getProxy(JobSubmissionProtocol.class,
JobSubmissionProtocol.versionID, addr, getUGI(conf), conf,
NetUtils.getSocketFactory(conf, JobSubmissionProtocol.class));
}
getProxy在客户端建立一个与JobTracer通信的代理,该代理实现了指定协议。函数中包含版本的协商。
public static VersionedProtocol getProxy(Class<?> protocol,
long clientVersion, InetSocketAddress addr, UserGroupInformation ticket,
Configuration conf, SocketFactory factory) throws IOException {
VersionedProtocol proxy =
(VersionedProtocol) Proxy.newProxyInstance(
protocol.getClassLoader(), new Class[] { protocol },
new Invoker(addr, ticket, conf, factory));
long serverVersion = proxy.getProtocolVersion(protocol.getName(),
clientVersion);
if (serverVersion == clientVersion) {
return proxy;
} else {
throw new VersionMismatch(protocol.getName(), clientVersion,
serverVersion);
}
}