讲个前段时间遇到的问题,项目中用到的spark on yarn基于oozie进行应用的编排调度,oozie支持fork/join机制,就是可以在fork之后可以分出多个分支用于调度其他action,对我们来说就是调用多个spark应用。但遇到的现象是明明有很多内存空闲,可这些spark却全都是在Accept状态,不能被调度执行。
要分析这个问题,得先从oozie的调度机制说起。
oozie中的每个action job的调度通过一个只有map的mapreduce job来启动,称其为oozie-launcher。oozie中每启动一个job都会启动这么一个launcher作为加载真正作业的加载器。这里的launcher需要向yarn集群申请AM运行,同时真正spark的任务运行也需要先申请AM。
但是oozie中的fork分出来的多个action的launcher是同时启动的(还受限于并发度的配置项,但均在spark之前转为running)。这样一来仅这些launcher就占用了不少AM内存资源。在yarn的fair-scheduler调度策略中,AM占用队列的资源默认最大为0.5,如此一来,当同事fork了多个action的话,仅其launcher占用的AM就接近50%,以至于没有真正的spark节点能够运行了。
解决这问题,有几种思路,一是将launcher的资源占用减小,二是AM的占比增大,三是将launcher的队列与spark job action的队列分开,这样launcher AM和job AM不冲突,不会有卡死现象。
launcher的资源占用可以参考oozie的源码:src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java
// memory.mb
int launcherMapMemoryMB = launcherConf.getInt(HADOOP_MAP_MEMORY_MB, 1536);
int amMemoryMB = launcherConf.getInt(YARN_AM_RESOURCE_MB, 1536);
// YARN_MEMORY_MB_MIN to provide buffer.
// suppose launcher map aggressively use high memory, need some
// headroom for AM
int memoryMB = Math.max(launcherMapMemoryMB, amMemoryMB) + YARN_MEMORY_MB_MIN;
// limit to 4096 in case of 32 bit
if (launcherMapMemoryMB < 4096 && amMemoryMB < 4096 && memoryMB > 4096) {
memoryMB = 4096;
}
launcherConf.setInt(YARN_AM_RESOURCE_MB, memoryMB);
// We already made mapred.child.java.opts and
// mapreduce.map.java.opts equal, so just start with one of them
String launcherMapOpts = launcherConf.get(HADOOP_MAP_JAVA_OPTS, "");
String amChildOpts = launcherConf.get(YARN_AM_COMMAND_OPTS);
StringBuilder optsStr = new StringBuilder();
int heapSizeForMap = extractHeapSizeMB(launcherMapOpts);
int heapSizeForAm = extractHeapSizeMB(amChildOpts);
int heapSize = Math.max(heapSizeForMap, heapSizeForAm) + YARN_MEMORY_MB_MIN;
// limit to 3584 in case of 32 bit
if (heapSizeForMap < 4096 && heapSizeForAm < 4096 && heapSize > 3584) {
heapSize = 3584;
}
if (amChildOpts != null) {
optsStr.append(amChildOpts);
}
optsStr.append(" ").append(launcherMapOpts.trim());
if (heapSize > 0) {
// append calculated total heap size to the end
optsStr.append(" ").append("-Xmx").append(heapSize).append("m");
}
launcherConf.set(YARN_AM_COMMAND_OPTS, optsStr.toString().trim());
上面所述spark job,其实也可用于Hive/sqoop/MR等action节点,其executor均继承自JavaActionExecutor。
oozie假设launcher的map会执行聚合操作消耗大量内存,所以取得是oozie.launcher.mapreduce.map.memory.mb和oozie.launcher.yarn.app.mapreduce.am.resource.mb两个配置项的最大值,再加上YARN_MEMORY_MB_MIN(取值为512M)作为buffer;前两个配置项的默认值默认为1536M,这样一个oozie launcher将会占用1C2G的container,AM被他吃掉2G去。
我们可以在workflow.xml的相应spark action下增加配置, 也可在oozie-site.xml中增加配置将其设置为512, 从而使得每个launcher只占用1G内存。
yarn在采用fairScheduler下,我们可以更改fair-scheduler.xml配置,对指定队列的配置中增加maxAMShare的配置
maxAMShare: limit the fraction of the queue's fair share that can be used to run application masters. This property can only be used for leaf queues. For example, if set to 1.0f, then AMs in the leaf queue can take up to 100% of both the memory and CPU fair share. The value of -1.0f will disable this feature and the amShare will not be checked. The default value is 0.5f.
如设置queue_zjh队列的maxAMShare为0.9
<allocations>
<queue name="queue_zjh">
<minResources>10000 mb,0vcoresminResources>
<maxResources>90000 mb,0vcoresmaxResources>
<maxRunningApps>50maxRunningApps>
<maxAMShare>0.9maxAMShare>
<weight>2.0weight>
<schedulingPolicy>fairschedulingPolicy>
<queue name="sample_sub_queue">
<aclSubmitApps>charlieaclSubmitApps>
<minResources>5000 mb,0vcoresminResources>
queue>
queue>
也可通过queueMaxAMShareDefault设置全局的默认值,其说明如下
queueMaxAMShareDefault: which sets the default AM resource limit for queue; overriden by maxAMShare element in each queue
示例如下,将所有队列的默认值设置为0.8
<allocations>
<queue name="queue_zjh">
<minResources>10000 mb,0vcoresminResources>
<maxResources>90000 mb,0vcoresmaxResources>
<maxRunningApps>50maxRunningApps>
<maxAMShare>0.9maxAMShare>
<weight>2.0weight>
<schedulingPolicy>fairschedulingPolicy>
queue>
<queueMaxAMShareDefault>0.8queueMaxAMShareDefault>
配置oozie.launcher.mapred.job.queue.name
为launcher要用的队列
配置mapreduce.job.queuename(mapred.job.queue.name)
为action用的队列
最后这种方案没有真实实验过…