Hadoop-3.1.1 Yarn 对GPU的调度和管理

GPU ON YARN上的配置

hadoop-3.1.1版本的yarn支持对gpu的调度和管理,主要分为两种模式:

  • yarn自动获取gpu的资源,进行分配;
  • 由用户指定使用哪些gpu资源;

具体的配置方法详见文档 Using GPU On YARN.md

GPU ON YARN 上的执行

执行命令:

yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
  -jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
  -shell_script /root/smitest.sh \
  -container_memory 1024 \
  -container_resources memory-mb=1024,vcores=1,yarn.io/gpu=1 \
  -num_containers 2

或者直接执行

yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
  -jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
  -shell_script /root/smitest.sh 

脚本smitest.sh里面的内容:

# !/bin/bash

/usr/bin/nvidia-smi > /tmp/smitest

输出文件smitest里面的内容:

Fri Sep 28 14:43:09 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   33C    P8    12W / 180W |     10MiB /  8112MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:02:00.0 Off |                  N/A |
| 27%   33C    P8    12W / 180W |     10MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

执行命令的结果:

2018-09-28 14:43:04,259 INFO distributedshell.Client: Initializing Client
2018-09-28 14:43:04,279 INFO distributedshell.Client: Running Client
2018-09-28 14:43:04,378 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-09-28 14:43:04,454 INFO client.RMProxy: Connecting to ResourceManager at /10.47.85.211:8032
2018-09-28 14:43:04,714 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=2
2018-09-28 14:43:04,734 INFO distributedshell.Client: Got Cluster node info from ASM
2018-09-28 14:43:04,734 INFO distributedshell.Client: Got node report from ASM for, nodeId=yita171:46027, nodeAddress=yita171:8042, nodeRackName=/default-rack, nodeNumContainers=0
2018-09-28 14:43:04,734 INFO distributedshell.Client: Got node report from ASM for, nodeId=yita172:43130, nodeAddress=yita172:8042, nodeRackName=/default-rack, nodeNumContainers=0
2018-09-28 14:43:04,750 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.0, queueMaxCapacity=1.0, queueApplicationCount=0, queueChildQueueCount=0
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE
2018-09-28 14:43:04,787 INFO distributedshell.Client: Max mem capability of resources in this cluster 8192
2018-09-28 14:43:04,787 INFO distributedshell.Client: Max virtual cores capability of resources in this cluster 4
2018-09-28 14:43:04,798 WARN distributedshell.Client: AM Memory not specified, use 100 mb as AM memory
2018-09-28 14:43:04,798 WARN distributedshell.Client: AM vcore not specified, use 1 mb as AM vcores
2018-09-28 14:43:04,798 WARN distributedshell.Client: AM Resource capability=
2018-09-28 14:43:04,799 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment
2018-09-28 14:43:05,467 INFO distributedshell.Client: Set the environment for the application master
2018-09-28 14:43:05,467 INFO distributedshell.Client: Setting up app master command
2018-09-28 14:43:05,468 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx100m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type GUARANTEED --container_memory 1024 --container_vcores 1 --container_resources yarn.io/gpu=1 --num_containers 2 --priority 0 1>/AppMaster.stdout 2>/AppMaster.stderr
2018-09-28 14:43:05,477 INFO distributedshell.Client: Submitting application to ASM
2018-09-28 14:43:05,510 INFO impl.YarnClientImpl: Submitted application application_1538048201778_0050
2018-09-28 14:43:06,513 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:07,516 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:08,518 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:09,520 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:10,523 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=FINISHED, distributedFinalState=SUCCEEDED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:10,523 INFO distributedshell.Client: Application has completed successfully. Breaking monitoring loop
2018-09-28 14:43:10,523 INFO distributedshell.Client: Application completed successfully

GPU ON YARN 使用说明

  • 每个nodemanager只能使用该节点上的gpu资源,同过自动获取或者用户手动指定的方式;
  • YARN 中容器的可用资源是一致的,即每个container执行时的资源是相同的,无法单独指定;可使用的资源数可以在yarn-site.xml中指定,或者在执行的命令中指定;
  • 多个gpu可以并行执行,如一个container需要一个gpu,节点上有两个gpu,启动两个container,这两个container是可以并行执行的;

环境搭建

安装hadoop3.1.1环境(借鉴参考http://blog.51cto.com/taoismli/2163097?source=dra)和安装gpu驱动(事先装好的);

然后进行执行验证;

问题解决

问题1:

安装hadoop-3.1.1遇到的问题:

ERROR: Cannot set priority of datanode process

其实是 从节点的配置文件有问题,保持配置文件一致。

问题2:

执行语句:

yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \

-jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \

-shell_command /usr/bin/nvidia-smi \

-container_resources memory-mb=3072,vcores=1,yarn.io/gpu=2 \

-num_containers 2

报错:

18/09/26 17:37:20 INFO distributedshell.Client: Initializing Client

18/09/26 17:37:20 FATAL distributedshell.Client: Error running Client

org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -container_resources

    at org.apache.commons.cli.Parser.processOption(Parser.java:363)

    at org.apache.commons.cli.Parser.parse(Parser.java:199)

    at org.apache.commons.cli.Parser.parse(Parser.java:85)

    at org.apache.hadoop.yarn.applications.distributedshell.Client.init(Client.java:313)

    at org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:206)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:498)

    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

解决:

yarn的版本冲突,使用新版本3.1.1的yarn,使用时加上绝对路经即可:

/home/nht/hadoop-3.1.1/bin/yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
  -jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
  -shell_command /usr/bin/nvidia-smi \
  -container_memory 1024 \
  -container_resources memory-mb=1024,vcores=1,yarn.io/gpu=3 \
  -num_containers 1

问题3、

Error launching appattempt_1538019293330_0001_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.

问题原因:

namenode,datanode时间同步问题

解决办法:

见时间同步文档服务器同步时间.md

问题4、

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create file/user/root/DistributedShell/application_1538047235045_0001/AppMaster.jar. Name node is in safe mode.

解决办法:

safemode模式

NameNode在启动的时候首先进入安全模式,如果datanode丢失的block达到一定的比例(1-dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态。

dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS启动的时候,如果DataNode上报的block个数达到了元数据记录的block个数的0.999倍才可以离开安全模式,否则一直是这种只读模式。如果设为1则HDFS永远是处于SafeMode。

下面这行摘录自NameNode启动时的日志(block上报比例1达到了阀值0.9990)

The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 18 seconds.

hadoop dfsadmin -safemode leave

有两个方法离开这种安全模式

  1. 修改dfs.safemode.threshold.pct为一个比较小的值,缺省是0.999。
    在hdfs-site.xml中把 dfs.safemode.threshold.pct设置为0即可关闭safemode
  2. hadoop dfsadmin -safemode leave命令强制离开

你可能感兴趣的:(yarn,gpu)