Hive on Spark源码分析(一)—— SparkTask
Hive on Spark源码分析(二)—— SparkSession与HiveSparkClient
Hive on Spark源码分析(三)—— SparkClilent与SparkClientImpl(上)
Hive on Spark源码分析(四)—— SparkClilent与SparkClientImpl(下)
Hive on Spark源码分析(五)—— RemoteDriver
Hive on Spark源码分析(六)—— RemoteSparkJobMonitor与JobHandle
RemoteSparkJobMonitor负责监控一个RSC远程任务的执行状态。它会循环获取任务的执行情况,直到任务完成/失败/kill掉,并将当前任务的状态打印到console。RemoteSparkJobMonitor中主要的方法就是startMonitor,它会根据任务执行的不同情况修改返回码,调用过程是:
SparkTask.execute => rc=jobRef.monitorJob() => RemoteSparkJobRef.monitorJob() => return remoteSparkJobMonitor.startMonitor()
该方法主要内容如下:
1. RemoteSparkJobMonitor循环获取Job执行状态:
JobHandle.State state = sparkJobStatus.getRemoteJobState();
public JobHandle.State getRemoteJobState() {
return jobHandle.getState();
}
2. Job共有六种状态(state):SENT,QUEUED,STARTED,SUCCEEDED,FAILED,CANCELED。根据不同状态做不同处理,其中SENT状态和QUEUED状态使用相同的处理逻辑:
switch (state) {
case SENT:
case QUEUED:
long timeCount = (System.currentTimeMillis() - startTime) / 1000;
//monitorTimeoutInteval是配置文件中hive.spark.job.monitor.timeout的值
if ((timeCount > monitorTimeoutInteval)) {
console.printError("Job hasn\'t been submitted after " + timeCount + "s." +
" Aborting it.\\nPossible reasons include network issues, " +
"errors in remote driver or the cluster has no available resources, etc.\\n" +
"Please check YARN or Spark driver\'s logs for further information.\\nReason1 from RemoteSparkJobMonitor");
console.printError("Status: " + state);
running = false;
done = true;
rc = 2;
}
break;
如果超过预设job提交时间,则修改return code = 2,提示timeCount时间后Job仍未提交,否则继续继续等待
如果state为STARTED,打印job信息和执行进度:
case STARTED:
JobExecutionStatus sparkJobState = sparkJobStatus.getState();
if (sparkJobState == JobExecutionStatus.RUNNING) {
Map<String, SparkStageProgress> progressMap = sparkJobStatus.getSparkStageProgress();
if (!running) {
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.SPARK_SUBMIT_TO_RUNNING);
printAppInfo();
// print job stages.
console.printInfo("\nQuery Hive on Spark job["
+ sparkJobStatus.getJobId() + "] stages:");
for (int stageId : sparkJobStatus.getStageIds()) {
console.printInfo(Integer.toString(stageId));
}
console.printInfo("\nStatus: Running (Hive on Spark job["
+ sparkJobStatus.getJobId() + "])");
running = true;
console.printInfo("Job Progress Format\nCurrentTime StageId_StageAttemptId: "
+ "SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]");
}
printStatus(progressMap, lastProgressMap);
lastProgressMap = progressMap;
}
break;
负责打印任务信息的方法printStatus继承自父类SparkJobMonitor,它会计算总的task数、正在执行的任务数、已完成的任务数和失败的任务数,并进行格式化输出。
其余情况类似,如果状态为SUCCEEDED,打印成功或失败信息及状态码(不修改,默认为0,表示执行成功);如果为FAILED,打印错误信息,修改return code = 3。
case SUCCEEDED:
Map<String, SparkStageProgress> progressMap = sparkJobStatus.getSparkStageProgress();
printStatus(progressMap, lastProgressMap);
lastProgressMap = progressMap;
double duration = (System.currentTimeMillis() - startTime) / 1000.0;
console.printInfo("Status: Finished successfully in "
+ String.format("%.2f seconds", duration));
running = false;
done = true;
break;
case FAILED:
console.printError("Status: Failed");
running = false;
done = true;
rc = 3;
break;
}
如果任务还没有结束,则等待checkInterval时间,然后再次获取任务状态:
if (!done) {
Thread.sleep(checkInterval);
}
catch (Exception e) {
String msg = " with exception \'" + Utilities.getNameMessage(e) + "\'";
msg = "Failed to monitor Job[ " + sparkJobStatus.getJobId() + "]" + msg;
// Has to use full name to make sure it does not conflict with
// org.apache.commons.lang.StringUtils
LOG.error(msg, e);
console.printError(msg, "\\n" + org.apache.hadoop.util.StringUtils.stringifyException(e));
rc = 1;
done = true;
} finally {
if (done) {
break;
}
}
}
JobHandle可以认为是一个job的句柄,用来监控和控制一个正在运行的远程任务。我们首先看一下JobHandle接口中定义的两个特殊的结构,首先state:
static enum State {
SENT,
QUEUED,
STARTED,
CANCELLED,
FAILED,
SUCCEEDED;
}
static interface Listener<T extends Serializable> {
void onJobQueued(JobHandle<T> job);
void onJobStarted(JobHandle<T> job);
void onJobCancelled(JobHandle<T> job);
void onJobFailed(JobHandle<T> job, Throwable cause);
void onJobSucceeded(JobHandle<T> job, T result);
/**
* Called when a monitored Spark job is started on the remote context. This callback
* does not indicate a state change in the client job's status.
*/
void onSparkJobStarted(JobHandle<T> job, int sparkJobId);
}
JobHandleImpl(SparkClientImpl client, Promise<T> promise, String jobId) {
this.client = client;
this.jobId = jobId;
this.promise = promise;
this.listeners = Lists.newLinkedList();
this.metrics = new MetricsCollection();
this.sparkJobIds = new CopyOnWriteArrayList<Integer>();
this.state = State.SENT;
this.sparkCounters = null;
}
/** Requests a running job to be cancelled. */
@Override
public boolean cancel(boolean mayInterrupt) {
if (changeState(State.CANCELLED)) {
client.cancel(jobId);
promise.cancel(mayInterrupt);
return true;
}
return false;
}
@Override
public T get() throws ExecutionException, InterruptedException {
return promise.get();
}
@Override
public T get(long timeout, TimeUnit unit)
throws ExecutionException, InterruptedException, TimeoutException {
return promise.get(timeout, unit);
}
@Override
public boolean isCancelled() {
return promise.isCancelled();
}
@Override
public boolean isDone() {
return promise.isDone();
}
@Override
public MetricsCollection getMetrics() {
return metrics;
}
@Override
public List<Integer> getSparkJobIds() {
return sparkJobIds;
}
@Override
public SparkCounters getSparkCounters() {
return sparkCounters;
}
@Override
public State getState() {
return state;
}
然后是一个比较重要的方法changeState,通过这个方法对state进行修改:
修改时需要判断新的状态的基数是否大于当前状态的基数,并且小于CANCELLED的基数,只有这样才能对将当前的state修改为新state。例如,如果当前state为SENT,newState为QUEUED,则可以进行修改;如果当前state为QUEUED ,newState为SENT,则不能修改。state修改后,通过fireStateChange触发已注册的listener中相应事件(state)对应的回调函数。
boolean changeState(State newState) {
synchronized (listeners) {
if (newState.ordinal() > state.ordinal() && state.ordinal() < State.CANCELLED.ordinal()) {
state = newState;
for (Listener l : listeners) {
fireStateChange(newState, l);
}
return true;
}
return false;
}
}
private void fireStateChange(State s, Listener l) {
switch (s) {
case SENT:
break;
case QUEUED:
l.onJobQueued(this);
LOG.debug("liban: onJobQueued in fireStateChange.");
break;
case STARTED:
l.onJobStarted(this);
LOG.debug("liban: onJobStarted in fireStateChange.");
break;
case CANCELLED:
l.onJobCancelled(this);
break;
case FAILED:
l.onJobFailed(this, promise.cause());
break;
case SUCCEEDED:
try {
l.onJobSucceeded(this, promise.get());
} catch (Exception e) {
// Shouldn\'t really happen.
throw new IllegalStateException(e);
}
break;
default:
throw new IllegalStateException();
}
}