目前我们会把MapReduce Job运行完成后的Task运行的相关信息(status,cpu_time等)记录到后台DB中,监控系统会根据DB中记录的Task运行的相关信息,自动化预警。这些信息主要是从Job运行完成之后产生的相关historyFile中提取出来的。由于已经升级到Yarn,MRv2和MRv1在historyFile的处理上还是有些不同,在此做下简单的分析。
在MRv1中,与此相关最重要的类是org.apache.hadoop.mapred.JobHistory。在Job运行的时候,会首先将jobId_conf.xml和jobHistoryFile写入jt所在节点的本地存储,默认是hadoop.log.dir目录下(默认是/var/log/hadoop-0.20-mapreduce),在作业运行完成后,将这两个文件移入DONE目录中(默认为hadoop.log.dir/history/done),首先看下调用流程
重点说下moveToDone(JobID)方法,调用线程池中的线程异步的进行move操作,在移动之前,会在DONE目录下创建子目录,子目录创建的规则是
jt所在的host_启动时间(millisecond)_/YYYY/MM/DD/根据jobID截取的格式化字符串/
jobHistoryFile的命名规则是
jt-hostname_job-id_username_jobname
我们主要是从jobHistoryFile文件中提取信息,比如task类型,完成状态,运行所在节点,cpu时间,输入记录数,spill记录数等,jobHistoryFile中也包含所有计数器相关的信息。
升级到yarn后,MRv2的处理方式相比MRv1有所不同。在MRv2中,JobHistory作为独立的服务运行,对应的类是org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer,日志写入HDFS中。JobHistoryServer在启动的时候会启动三种服务,最重要的jobHistoryService,负责加载和管理Job日志。clientService负责启动webapp服务和history server api服务。如果开启了日志聚集功能,aggLogDelService负责定期进行日志删除。
jobHistoryService的实现类是org.apache.hadoop.mapreduce.v2.hs.JobHistory,具体启动流程如下
JobHistory.start(){ scheduledExecutor = new ScheduledThreadPoolExecutor(2, new ThreadFactoryBuilder().setNameFormat("Log Scanner/Cleaner #%d") .build()); //在moveThreadInterval后周期执行MoveIntermediateToDoneRunnable //也就是说,在job运行完成后,并不是马上将日志移动到done目录,而是可能会需要等 //一段时间(最长是moveThreadInterval)后再移动 scheduledExecutor.scheduleAtFixedRate(new MoveIntermediateToDoneRunnable(), moveThreadInterval, moveThreadInterval, TimeUnit.MILLISECONDS); // Start historyCleaner boolean startCleanerService = conf.getBoolean( JHAdminConfig.MR_HISTORY_CLEANER_ENABLE, true); if (startCleanerService) { long runInterval = conf.getLong( JHAdminConfig.MR_HISTORY_CLEANER_INTERVAL_MS, JHAdminConfig.DEFAULT_MR_HISTORY_CLEANER_INTERVAL_MS); //周期执行historyCleaner scheduledExecutor.scheduleAtFixedRate(new HistoryCleaner(), 30 * 1000l, runInterval, TimeUnit.MILLISECONDS); } }
Job运行过程中,相关日志文件写入hdfs://yarn.app.mapreduce.am.staging-dir/history/done_intermediate/$user目录下,user为运行Job的用户。在Job运行完成后,并不是像MRv1那样立即将相关文件移入DONE目录(DONE目录的路径为hdfs://yarn.app.mapreduce.am.staging-dir/history/done),而是可能会需要等一段时间,线程会周期性的执行moveToDone。下面分析下MoveIntermediateToDoneRunnable线程主要的一个运行流程。
MoveIntermediateToDoneRunnable.run(){ hsManager.scanIntermediateDirectory(); } HistoryFileManager.scanIntermediateDirectory{ //获取done_intermediate下的子目录 List<FileStatus> userDirList = JobHistoryUtils.localGlobber( intermediateDoneDirFc, intermediateDoneDirPath, ""); //遍历user子目录 for (FileStatus userDir : userDirList) { String name = userDir.getPath().getName(); UserLogDir dir = userDirModificationTimeMap.get(name); if(dir == null) { dir = new UserLogDir(); UserLogDir old = userDirModificationTimeMap.putIfAbsent(name, dir); if(old != null) { dir = old; } } //扫描user子目录 dir.scanIfNeeded(userDir); } } HistoryFileManager$UserLogDir.scanIfNeeded(FileStatus fs) { long newModTime = fs.getModificationTime(); //如果目录的修改时间和之前的不一致,则扫描此目录下的文件 if (modTime != newModTime) { Path p = fs.getPath(); try { scanIntermediateDirectory(p); modTime = newModTime; } catch (IOException e) { LOG.error("Error while trying to scan the directory " + p, e); } } } HistoryFileManager.scanIntermediateDirectory(final Path absPath){ List<FileStatus> fileStatusList = scanDirectoryForHistoryFiles(absPath, intermediateDoneDirFc); for (FileStatus fs : fileStatusList) { //根据-拆解.jhist文件的文件名,设置属性jobIndexInfo的属性(user,jobName等) JobIndexInfo jobIndexInfo = FileNameIndexUtils.getIndexInfo(fs.getPath() .getName()); //JobID_conf.xml String confFileName = JobHistoryUtils .getIntermediateConfFileName(jobIndexInfo.getJobId()); //JobID.summary(MRv1中summary信息写入jt.log中) String summaryFileName = JobHistoryUtils .getIntermediateSummaryFileName(jobIndexInfo.getJobId()); //实例化HistoryFileInfo,state为IN_INTERMEDIATE HistoryFileInfo fileInfo = new HistoryFileInfo(fs.getPath(), new Path(fs .getPath().getParent(), confFileName), new Path(fs.getPath() .getParent(), summaryFileName), jobIndexInfo, false); final HistoryFileInfo old = jobListCache.addIfAbsent(fileInfo); if (old == null || old.didMoveFail()) { final HistoryFileInfo found = (old == null) ? fileInfo : old; long cutoff = System.currentTimeMillis() - maxHistoryAge; //超时需要删除的文件 if(found.getJobIndexInfo().getFinishTime() <= cutoff) { try { found.delete(); } catch (IOException e) { LOG.warn("Error cleaning up a HistoryFile that is out of date.", e); } } else { //异步调用moveToDone() moveToDoneExecutor.execute(new Runnable() { @Override public void run() { try { found.moveToDone(); } catch (IOException e) { LOG.info("Failed to process fileInfo for job: " + found.getJobId(), e); } } }); } } } } HistoryFileManager.moveToDone(){ if (!isMovePending()) { return; } try { long completeTime = jobIndexInfo.getFinishTime(); if (completeTime == 0) { completeTime = System.currentTimeMillis(); } JobId jobId = jobIndexInfo.getJobId(); List<Path> paths = new ArrayList<Path>(2); if (historyFile == null) { LOG.info("No file for job-history with " + jobId + " found in cache!"); } else { paths.add(historyFile); } if (confFile == null) { LOG.info("No file for jobConf with " + jobId + " found in cache!"); } else { paths.add(confFile); } if (summaryFile == null) { LOG.info("No summary file for job: " + jobId); } else { //删除掉summary file,不移动到done目录下 String jobSummaryString = getJobSummary(intermediateDoneDirFc, summaryFile); SUMMARY_LOG.info(jobSummaryString); LOG.info("Deleting JobSummary file: [" + summaryFile + "]"); intermediateDoneDirFc.delete(summaryFile, false); summaryFile = null; } //组装done目录下的目标目录(/user/history/done/YYYY/MM/DD/根据jobID截取的格式化 字符串) Path targetDir = canonicalHistoryLogPath(jobId, completeTime); addDirectoryToSerialNumberIndex(targetDir); //创建目录 makeDoneSubdir(targetDir); if (historyFile != null) { Path toPath = doneDirFc.makeQualified(new Path(targetDir, historyFile .getName())); if (!toPath.equals(historyFile)) { moveToDoneNow(historyFile, toPath); historyFile = toPath; } } if (confFile != null) { Path toPath = doneDirFc.makeQualified(new Path(targetDir, confFile .getName())); if (!toPath.equals(confFile)) { moveToDoneNow(confFile, toPath); confFile = toPath; } } state = HistoryInfoState.IN_DONE; } catch (Throwable t) { LOG.error("Error while trying to move a job to done", t); this.state = HistoryInfoState.MOVE_FAILED; } }
.jhist文件中的内容被封装成了json格式,更便于解析,提供的内容也更全面。