yarn Configuration
摘自《hadoop权威指南第三版 第九章》, 中间有些地方有自己见解, 翻译的不好,还请大家指正。
YARN is the NGarchitecture for running MR(and is described in "YARN(MR2)") . It has a different set of daemons andconfiguration options than classic MR(also called MR1) , and in this section WeLook At these differences and discuss how to run MR on YARN.
YARN是下一代的Mapreduce,它有着与MR1不同的守护进程和配置, 现在我们讨论一下两代MR之间的不同,以及如何运行YARN。
Under YARN , Youno longer run a jobtracker or tasktrackers. Instead, there is a single resource manager running on the same machineas the HDFS namenode(for clusters) or on a dedicatedmachine, and node managers running on each worker node in thecluster。
在YARN下, 你不再运行“jobtracker” 和 “tasktrackers”,取而代之的是一个运行在HDFS名称节点上或单独节点的资源管理器,和运行在集群中各个工作节点上的节点管理器。
YARN also has jobhistoryserver daemon that provides users with details of past job runs, and aweb app proxy server for providing a secure way for users to access the UIprovided by YARN applications . In thecase of MR, the web UI served by the proxy provides information about thecurrent job you are running, similar to the one described in "The MR WebUI" on page 164. By default, the web app proxy server runs in the sameprocess as the Resource manager, bu it may be configured to run as a standalonedaemon。
YARN同样运行一个叫做历史服务器的守护进程,用于记录过往作业的详细信息,并提供一个可以提供用户界面的代理服务器。在这个代理服务器提供的界面中可以查看你正在运行的作业, 这于164页描述的MR的网页界面是一致的。在默认的情况下,该代理服务器上同样运行着资源管理器,但是常常被配置为两个独立运行的守护进程。
YARN has its ownset of configuration files, listed in Table 9-8; these are used in adition tothose in Table 9-1。
Table 9-8。 YARN configuration files
filename |
Format |
description |
Yarn-env.sh |
bash shell |
Environment variables that are used in the scripts to run YARN |
Yarn-site.xml |
hadoop configuration XML. |
Configuration settings for YARN daemons: the resource manager, the job history |
Important YARNDaemon Properties
When running MR onYARN, the mapred-site.xml file is still used for general MR properties,although the jobtracker and tasktracker-related properties are not used. Noneof the properties in Table 9-4 are applicable to YARN, except for "mapred.child.java.opts"(and therelated properties mapreduce.map.java.optsand mapreduce.reduce.java.opts, whichapply only to map or reduce tasks ,respectively)。 The JVM options specified in this way areused to lanch the YARN child process that runs map or reduce tasks.
当在YARN上运行MR程序时,配置文件 mapred-site.xml 一直被用作配置通用的参数, 尽管jobtracker与tasktracker相关的参数不再使用,在表格9-4中列举的参数都不再适用, 只有“mapred.child.java.opts” 与 “mapreduce.map.java.opts and mapreduce.reduce.java.opts” 这些只与map和reduce相关的参数除外。
<?xmlversion="1.0"?>
<!--mapred-site.xml -->
<configuration>
<property>
<name>mapred.child.java.opts</name>
<name>-Xmx400m</name>
</property>
<configuration>
<? xmlversion="1.0"?>
<!--mapred-site.xml -->
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager:8032</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/disk1/nm-local-dir,/disk2/nm-local-dir</value>
<final>true</final>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
</configuration>
The YARN resourcenanager address is controlled via yarn.resourcemanage.address, whichtakes the form of a host:port pair( such as master:8032)。 In a client configuration, thisproperty is used to connect to the resource manager(Using RPC), and inaddition, the mapreduce.framework.name property must be set to "yarn"for the client to use YARN rather than the local job runner。
YARN 的资源管理器通过“yarn.resourcemanage.address”参数配置,它是一个“主机:端口”的格式。 如“master:8032” ,在客户端的配置中,该参数被用来连接资源管理器(use RPC 远程过程调用),进一步来说,参数“mapreduce.framework.name” 必须设置为“yarn”,不管是客户端还是运行它的节点。 这点可能不好理解,这是说资源管理器运行在主节点上,其它节点呢相对它来说,就是客户端了,并且无论是主节点还是其它节点参数“mapreduce.framework.name”都应该配置为“yarn”。
Although YARN doesnot honor mapred.local.dir, it hasan equivalent property called "yarn.nodemanager.local-dirs",which allows you to specify the local disks to store intermediate data on。 It is specified by a comma-separatedlist of local directory paths, which ared used in a roundrobin fashion。
尽管YARN不提倡使用“mapred.local.dir”参数,它有一个参数叫做“yarn.nodemanager.local-dirs”来代替前者,但两者都可以用来设置运行过程中的中间数据的本地化路径。这些路径中间用逗号隔开,轮循读取。
such as this configuration:
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/disk1/nm-local-dir,/disk2/nm-local-dir</value>
<final>true</final>
</property>
上一段好理解吧,本地文件有两种设置方式。
在 mapred-site.xml 中设置“mapred.local.dir” , 不提倡。
在yarn-site.xml 中设置 “yarn.nodemanager.local-dirs”, 提倡。
各文件夹中间用逗号隔开, 使用轮循方式读取。
YARN does't havetasktrackers to serve map outputs to reduce tasks, so for this function itrelies on shuffle handlers, which arelong-running auxiliary services running in node managers . Because YARN is ageneral-purpose service, the MR shuffle handlers need to be enabled explicitlyin yarn-site.xml by setting the yarn.nodemanager.aux-servicesproperty to "mapreduce.shuffle"。
这一段不好理解, 大致意思是: YARN没有tasktracker伺服把map的结果输出给reduce task,所以这部分功能完全依赖于“shuffle handlers”, 这是个长期支行在结点管理器上的辅助服务。 因为YARN为通用计算框架设计的服务,所以Mapreduce 的“shuffle handlers”需要显式的配置出来,言下之意,如果是其它的框架如storm或spark 它们可能没有这个“洗牌的操作”,配置的参数在“yarn-site.xml”中,该配置参数的名称为“yarn.nodemanager.aux-services”,参数的值要设置为“mapreduce.shuffle”, 但是实际上该参数的值应该设置为“mapreduce_shuffle”(实际经验)。
Table 9-9 .Important YARN daemon properties
Property name |
Type |
Default value |
Description |
Yarn.resourcenanager.address |
Host:port |
0.0.0.0:8032 |
RM Server's RPC runs . |
Yarn.nodemanager.local-dirs |
Comma-separated dir names |
/tmp |
|
Yarn.nodemanager.aux-services |
Service names |
|
mapreduce_shuffle |
Yarn.nodemanager.resource.memory-mb |
Int |
8192 |
All memory allocated to yarn |
Yarn.nodemanager.vmem-pmem-ratio |
Float |
2.1 |
Exceed |
|
|
|
|
|
|
|
|
Memory
YARN treats memoryin a more fine-grained manner than the slot-based model used in the classicimplementation of MR. Rather than specifying a fixed maximum number of map andreduce slots that may run on a tasktracker node at once, YARN allowsapplications to request an arbitrary amount of memory(within limits) for a task。 In the YARN nodel, node managers allocatememory from a pool , so the number of tasks that are running on a particularnode depends on the sum of their memory requirements, and not simply on a fixednumber of slots。
这一段说了MR1与yarn关于任务分配方式的区别,YARN采用了更细致的资源分配方式, 相比MR1基于任务槽一次确定map和reduce个数的分配方式更加科学。 YARN支持某一应用为任务(task)在许可范围内申请任意大小的内存。在YARN模式下,节点管理器们从一个资源池中分配内存,所以在特定节点以运行的任务数取决于它们需要的内存总数, 而非简单地决定于任务槽数。这句话怎么理解呢,以我实际的观察,应该这么说, 在某个节点分配了任务之后,YARN会估计完成这些任务需要多少内存资源,而一个任务所需要的内存资源大致是确定的(在mapred-site中配置map和reduce所需要内存的大小),所以任务数也就确定了,完成这些任务需要的内存资源多,则yarnchild可能会多些,相反会少些。
The slot-basedmodel can lead to cluster underutilization, since the proportion of map slotsto reduce slots is fixed as a cluster-wide configuration. However, the numberof map versus reduce slots that are in demand changes over time: at thebeginning of a job only map slots are needed, whereas at the end of the jobonly reduce slots are needed。 On larger clusters with many concurrent jobs , the variation in demandfor a particular type of slot may be less pronounced , but there is stillwastage. YARN avoids this problem by not distinguishing between the two typesof slots。
MR1 基于槽的任务分配模式带来了集群资源的非充分利用,起初map的槽数对reduce的槽数的比例在集群范围内采用固定的值, 但是在任务处理的过程中该比偶是时时变化的: 任务一开始,只有map任务槽是需要的,而在任务的结束往往只有reduce任务槽是必要的。 尽管在大型集群中有这么多并发的任务, 任务槽类型的变化不是那么显著,但是浪费依然存在。 YARN 不再区分这两种类型的槽,从而避免了这个问题。
The considerationsfor how much memory to dedicate to a node manager for running containers aresimilar to the those discussed in "Memory" on page 307。 Each hadoop daemon uses 1,000MB, so for adatanode and a node manager , the total is 2,000MB 。 Set aside enough for other processedthat are running on the machine, and the remainder can be dedicated to the nodemanager's containers by setting the configuration property "yarn.nodemanager.resource.memory-mb"to the total allocation in MB.(The default is 8192MB) 。
节点管理器分配多大内存给容器的考虑与307页关于“内存”的考虑是一样的, hadoop daemon 与 nodemanager 各用1000MB, 共2GB。为其它守护进程分配足够的内存之后,剩下的资源就可以分配给节点管理器的容器了,设置的方法就是在yarn-site.xml中配置“yarn.nodemanager.resource.memory-mb”这一参数,这是分配给容器的所有内存。(这里说一下,实际上hadoop各守护进程以及操作系统预留20%的资源,而剩下80%的内存资源分配给container)。
The next step isto determine how to set memory options for individual jobs。 These are two controls : mapred.child.java.opts,which allows you to set the JVM heap size of the map or reduce task; and mapreduce.map.memory.mb(or mapreduce.reduce.memory.mb), which is used to specify how much memory you need for map (or reduce) taskcontainers。 The latter setting is used by the application master(APPMASTER) whennegotiating for resources in the cluster, and also by the node manager, whichruns and monitors the task containers。
下一步就是确定如何单个的任务分配内存, 这里有两个参数:
mapred.child.java.opts , 为map 或reduce设置JAVA虚拟机的堆内存的大小。
mapreduce.map.memory.mb(mapreduce.reduce.memory.mb), 这两个参数声明map或reduce容器所需要的内存的大小,这两个参数可以用于appmaster协调集群资源,也可以用于节点管理器监控作业容器。
For example,suppose that mapred.child.java.opts is set to -Xmx800m and mapreduce.map.memory.mbis left at its default value of 1,024Mb , When a map task is run, the nodemanager will allocate a 1,024 MB container(decreasing the size of its pool bythat amount for the duration of the task) and will launch the task JVMconfigurated with an 800MB maximum heap size。 Note that the JVM process will have a larger memoryfootprint than the heap size, and the overhead will depend on such things asthe native libraries that are in use , the size of the permanent generationspace , and so on。 The important thing is that the physical memory used by the JVM process,including any processes that is spawns, such as Streaming or Pipes processes,does not exceed its allocation(1024MB) . If a container uses more memory thanit has been allocated, then it may be terminated by the node manager and markedas failed 。
举个例子, 设想“mapred.child.java.opts”设置为“-Xmx800m”,同时 “mapreduce.map.memory.mb” 使用默认值“1024”MB, 当一个map作业启动时, 节点管理器会分配一个1024MB的容器(其资源池的大小在作业执行过期间逐步减少), 并启动一个堆内存大小为800MB的java虚拟机。 假设JVM进程内存占用大于堆内存的最大值,多出的部分取决于使用的本地库、固定的内存空间等。 重要的事情是JVM进程使用的物理内存,包括产生的进程,比如流、管道等不超出它的内存分配(1024MB)。如果容器使用了更多的内存, 那么该容器会被终止并标记为“失败”。
Schedulers may imposea minimum or maximum on memory allocations. For example, for the CapacityScheduler, the default minimum 1024MB(set by yarn.scheduler.capacity.minimum-allocation-mb), and the default maximum is 10240MB(set by yarn.scheduler.capacity.maxmum-allocation-mb)。
调度器可能会强制使用最大或最小的内存分配。 比如,使用“容量调度器”, 默认的最小内存为1024MB,最大为10240MB,分别使用yarn.scheduler.capacity.minimum-allocation-m &yarn.scheduler.capacity.maxmum-allocation-mb配置。(这个配置据我观察,最大为8192MB,可能有方式可以调整)
There are alsovirtual memory constraints that a container must meet。If a container's virtual memory usageexceeds a given multiple of the allocated physical memory , the node managermay termimate the process。 The multiple is expressed by the "yarn.node.manager.vmem-pmem-ratio"property, which defaults to 2.1. In the example used earlier, the virtualmemory threshold above which the task may be termimated is 2,150MB ,which is2.1*1024Mb 。
容器还有一个需要满足的地方就是虚拟内存。 如果一个容器的虚拟内存超过了物理内存的多少倍,节点管理器就会终止这个进程。 这个倍数用“yarn.node.manager.vmem-pmem-ratio”参数来描述, 其默认值为2.1。 在以前使用的例子中,虚拟内存的阀值为2150,即2.1*1024M,超过它作业将会被终止。
When configuringmemory parameters it's very useful to be able to monitor a task's actual memoryusage during a job run, and this is possible via MR task counters. The countersPHYSICAL_MEMORY_BYTES , VIRTUAL_MEMORY_BYTES, and COMMITTED_HEAP_BYTES(describedin Table 8-2) provide snapshot values of memory usage and are thereforesuitable for observation during the course of a task attempt。
我们配置的内存参数对监控作业在运行过程中的内存使用情况很有作用, 经由MR的task 的一些诸如PHYSICAL_MEMORY_BYTES , VIRTUAL_MEMORY_BYTES, andCOMMITTED_HEAP_BYTES的计数器可以看到内存使用的快照, 因此很适合观察作业的运行。