一、前言
1. 版本:
Hadoop 源码版本: Version 2.7.1
Spark 源码版本: Version 2.4.1
二、分析
1. Spark 里 BlockManager 的 LOCAL_DIRS
在 DiskBlockManager里的成员变量 localDirs 代表了 BlockManager 写磁盘的本地目录列表,该成员变量的 DiskBlockManager.scala代码如下:
/* Create one local directory for each path mentioned in spark.local.dir; then, inside this
* directory, create multiple subdirectories that we will hash files into, in order to avoid
* having really large inodes at the top level. */
private[spark] val localDirs: Array[File] = createLocalDirs(conf)
if (localDirs.isEmpty) {
logError("Failed to create any local dir.")
System.exit(ExecutorExitCode.DISK_STORE_FAILED_TO_CREATE_DIR)
}
跟踪createLocalDirs的调用栈,如下:DiskBlockManager.createLocalDirs -> Utils.getConfiguredLocalDirs -> Utils.getYarnLocalDirs(conf).split(",") -> conf.getenv("LOCAL_DIRS")
Utils.scala源码:
/** Get the Yarn approved local directories. */
private def getYarnLocalDirs(conf: SparkConf): String = {
val localDirs = Option(conf.getenv("LOCAL_DIRS")).getOrElse("")
if (localDirs.isEmpty) {
throw new Exception("Yarn Local dirs can't be empty")
}
localDirs
}
那么LOCAL_DIRS的环境变量是什么时候设置的呢?是在Hadoop里先设置好的。
2. 回到Hadoop的源代码,我来看看。ContainerLaunch.call函数
public Integer call() {
...
List localDirs = dirsHandler.getLocalDirs();
...
// /////////// Write out the container-script in the nmPrivate space.
List appDirs = new ArrayList(localDirs.size());
for (String localDir : localDirs) {
Path usersdir = new Path(localDir, ContainerLocalizer.USERCACHE);
Path userdir = new Path(usersdir, user);
Path appsdir = new Path(userdir, ContainerLocalizer.APPCACHE);
appDirs.add(new Path(appsdir, appIdStr));
}
...
// Sanitize the container's environment
sanitizeEnv(environment, containerWorkDir, appDirs, containerLogDirs,
localResources, nmPrivateClasspathJarDir);
...
}
ContainerLaunch.call函数里分别为不同等级PUBLIC(usersdir)、PRIVATE(userdir)和APPLICATION(appsdir)赋值了对应的resource localization目录列表。
ContainerLaunch.call -> ContainerLaunch.sanitizeEnv, 我们能看到 sanitizeEnv 函数里把 appDirs 写到 环境变量中了,所以Spark 那边可以从环境变量中get读出来。
public void sanitizeEnv(Map environment, Path pwd,
List appDirs, List containerLogDirs,
Map> resources,
Path nmPrivateClasspathJarDir) throws IOException {
...
environment.put(Environment.LOCAL_DIRS.name(),
StringUtils.join(",", appDirs));
...
}
回到ContainerLaunch.call -> dirsHandler.getLocalDirs()后,赋值给了localDirs。而 dirsHandler 是一个 LocalDirsHandlerService 类型对象。
private final LocalDirsHandlerService dirsHandler;
LocalDirsHandlerService里是怎么获取到 localDirs 的呢?先看 getLocalDirs() 函数的实现,其实就是调用其成员变量localDirs
/**
* @return the good/valid local directories based on disks' health
*/
public List getLocalDirs() {
return localDirs.getGoodDirs();
}
再来看成员变量 localDirs 是怎么初始化的呢?其初始化流程如下:
LocalDirsHandlerService.serviceInit -> new MonitoringTimerTask(conf) -> LocalDirsHandlerService.localDirs ->
Configuration.getTrimmedStrings -> Configuration.get ->Configuration.getProps-> Properties.getProperty -> yarn-default.xml
其中 ResourceLocalizationService.serviceInit(...) 代码:
protected void serviceInit(Configuration config) throws Exception {
// Clone the configuration as we may do modifications to dirs-list
Configuration conf = new Configuration(config);
diskHealthCheckInterval = conf.getLong(
YarnConfiguration.NM_DISK_HEALTH_CHECK_INTERVAL_MS,
YarnConfiguration.DEFAULT_NM_DISK_HEALTH_CHECK_INTERVAL_MS);
monitoringTimerTask = new MonitoringTimerTask(conf);
isDiskHealthCheckerEnabled = conf.getBoolean(
YarnConfiguration.NM_DISK_HEALTH_CHECK_ENABLE, true);
minNeededHealthyDisksFactor = conf.getFloat(
YarnConfiguration.NM_MIN_HEALTHY_DISKS_FRACTION,
YarnConfiguration.DEFAULT_NM_MIN_HEALTHY_DISKS_FRACTION);
lastDisksCheckTime = System.currentTimeMillis();
...
}
LocalDirsHandlerService.serviceInit -> new MonitoringTimerTask(conf) -> LocalDirsHandlerService.localDirs ->
Configuration.getTrimmedStrings -> Configuration.get ->Configuration.getProps-> Properties.getProperty -> yarn-default.xml
MonitoringTimerTask(...) 里赋值了 ResourceLocalizationService.localDirs,其实就是读取yarn-default.xml 里的配置.
public MonitoringTimerTask(Configuration conf) throws YarnRuntimeException {
float maxUsableSpacePercentagePerDisk =
conf.getFloat(
YarnConfiguration.NM_MAX_PER_DISK_UTILIZATION_PERCENTAGE,
YarnConfiguration.DEFAULT_NM_MAX_PER_DISK_UTILIZATION_PERCENTAGE);
long minFreeSpacePerDiskMB =
conf.getLong(YarnConfiguration.NM_MIN_PER_DISK_FREE_SPACE_MB,
YarnConfiguration.DEFAULT_NM_MIN_PER_DISK_FREE_SPACE_MB);
localDirs =
new DirectoryCollection(
validatePaths(conf
.getTrimmedStrings(YarnConfiguration.NM_LOCAL_DIRS)),
maxUsableSpacePercentagePerDisk, minFreeSpacePerDiskMB);
logDirs =
new DirectoryCollection(
validatePaths(conf.getTrimmedStrings(YarnConfiguration.NM_LOG_DIRS)),
maxUsableSpacePercentagePerDisk, minFreeSpacePerDiskMB);
localDirsAllocator = new LocalDirAllocator(
YarnConfiguration.NM_LOCAL_DIRS);
logDirsAllocator = new LocalDirAllocator(YarnConfiguration.NM_LOG_DIRS);
}
LocalDirsHandlerService.serviceInit -> new MonitoringTimerTask(conf) -> LocalDirsHandlerService.localDirs ->
Configuration.getTrimmedStrings -> Configuration.get ->Configuration.getProps-> Properties.getProperty -> yarn-default.xml
getTrimmedStrings(...)函数
/**
* Get the comma delimited values of the name
property as
* an array of String
s, trimmed of the leading and trailing whitespace.
* If no such property is specified then default value is returned.
*
* @param name property name.
* @param defaultValue The default value
* @return property value as an array of trimmed String
s,
* or default value.
*/
public String[] getTrimmedStrings(String name, String... defaultValue) {
String valueString = get(name);
if (null == valueString) {
return defaultValue;
} else {
return StringUtils.getTrimmedStrings(valueString);
}
}
LocalDirsHandlerService.serviceInit -> new MonitoringTimerTask(conf) -> LocalDirsHandlerService.localDirs ->
Configuration.getTrimmedStrings -> Configuration.get ->Configuration.getProps-> Properties.getProperty -> yarn-default.xml
Configuration.get函数:
public String get(String name) {
String[] names = handleDeprecation(deprecationContext.get(), name);
String result = null;
for(String n : names) {
result = substituteVars(getProps().getProperty(n));
}
return result;
}
yarn-default.xml 配置如下:
List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
yarn.nodemanager.local-dirs
${hadoop.tmp.dir}/nm-local-dir
三、总结
所以Spark 里 BlockManager 的 LOCAL_DIRS 是根据 ContainerLocalizer.APPCACHE 来设置的,格式如下:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}/blockmgr-xxx-xxx-xxx-xxx-xxx
而 ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}/filecache放的是application级别的resource localization.可参考另一贴:ContainerLocalization的分析[YARN] https://mp.csdn.net/postedit/86539985
四、参考
Hadoop相关的源码文件:
LocalDirsHandlerService.java
ResourceLocalizationService.java
ContainerLaunch.java
YarnConfiguration.java
DirectoryCollection.java
Configuration.java
Spark相关的源码文件:
Utils.scala
DiskBlockManager.scala