hadoop-3.1.1版本的yarn支持对gpu的调度和管理,主要分为两种模式:
具体的配置方法详见文档 Using GPU On YARN.md
执行命令:
yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-shell_script /root/smitest.sh \
-container_memory 1024 \
-container_resources memory-mb=1024,vcores=1,yarn.io/gpu=1 \
-num_containers 2
或者直接执行
yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-shell_script /root/smitest.sh
脚本smitest.sh里面的内容:
# !/bin/bash
/usr/bin/nvidia-smi > /tmp/smitest
输出文件smitest里面的内容:
Fri Sep 28 14:43:09 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66 Driver Version: 384.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A |
| 27% 33C P8 12W / 180W | 10MiB / 8112MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:02:00.0 Off | N/A |
| 27% 33C P8 12W / 180W | 10MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
执行命令的结果:
2018-09-28 14:43:04,259 INFO distributedshell.Client: Initializing Client
2018-09-28 14:43:04,279 INFO distributedshell.Client: Running Client
2018-09-28 14:43:04,378 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-09-28 14:43:04,454 INFO client.RMProxy: Connecting to ResourceManager at /10.47.85.211:8032
2018-09-28 14:43:04,714 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=2
2018-09-28 14:43:04,734 INFO distributedshell.Client: Got Cluster node info from ASM
2018-09-28 14:43:04,734 INFO distributedshell.Client: Got node report from ASM for, nodeId=yita171:46027, nodeAddress=yita171:8042, nodeRackName=/default-rack, nodeNumContainers=0
2018-09-28 14:43:04,734 INFO distributedshell.Client: Got node report from ASM for, nodeId=yita172:43130, nodeAddress=yita172:8042, nodeRackName=/default-rack, nodeNumContainers=0
2018-09-28 14:43:04,750 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.0, queueMaxCapacity=1.0, queueApplicationCount=0, queueChildQueueCount=0
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
2018-09-28 14:43:04,758 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE
2018-09-28 14:43:04,787 INFO distributedshell.Client: Max mem capability of resources in this cluster 8192
2018-09-28 14:43:04,787 INFO distributedshell.Client: Max virtual cores capability of resources in this cluster 4
2018-09-28 14:43:04,798 WARN distributedshell.Client: AM Memory not specified, use 100 mb as AM memory
2018-09-28 14:43:04,798 WARN distributedshell.Client: AM vcore not specified, use 1 mb as AM vcores
2018-09-28 14:43:04,798 WARN distributedshell.Client: AM Resource capability=
2018-09-28 14:43:04,799 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment
2018-09-28 14:43:05,467 INFO distributedshell.Client: Set the environment for the application master
2018-09-28 14:43:05,467 INFO distributedshell.Client: Setting up app master command
2018-09-28 14:43:05,468 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx100m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type GUARANTEED --container_memory 1024 --container_vcores 1 --container_resources yarn.io/gpu=1 --num_containers 2 --priority 0 1>/AppMaster.stdout 2>/AppMaster.stderr
2018-09-28 14:43:05,477 INFO distributedshell.Client: Submitting application to ASM
2018-09-28 14:43:05,510 INFO impl.YarnClientImpl: Submitted application application_1538048201778_0050
2018-09-28 14:43:06,513 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:07,516 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:08,518 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:09,520 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:10,523 INFO distributedshell.Client: Got application report from ASM for, appId=50, clientToAMToken=null, appDiagnostics=, appMasterHost=yita172/10.47.85.211, appQueue=default, appMasterRpcPort=-1, appStartTime=1538116985487, yarnAppState=FINISHED, distributedFinalState=SUCCEEDED, appTrackingUrl=http://yita172:8088/proxy/application_1538048201778_0050/, appUser=root
2018-09-28 14:43:10,523 INFO distributedshell.Client: Application has completed successfully. Breaking monitoring loop
2018-09-28 14:43:10,523 INFO distributedshell.Client: Application completed successfully
安装hadoop3.1.1环境(借鉴参考http://blog.51cto.com/taoismli/2163097?source=dra)和安装gpu驱动(事先装好的);
然后进行执行验证;
问题1:
安装hadoop-3.1.1遇到的问题:
ERROR: Cannot set priority of datanode process
其实是 从节点的配置文件有问题,保持配置文件一致。
问题2:
执行语句:
yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-shell_command /usr/bin/nvidia-smi \
-container_resources memory-mb=3072,vcores=1,yarn.io/gpu=2 \
-num_containers 2
报错:
18/09/26 17:37:20 INFO distributedshell.Client: Initializing Client
18/09/26 17:37:20 FATAL distributedshell.Client: Error running Client
org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -container_resources
at org.apache.commons.cli.Parser.processOption(Parser.java:363)
at org.apache.commons.cli.Parser.parse(Parser.java:199)
at org.apache.commons.cli.Parser.parse(Parser.java:85)
at org.apache.hadoop.yarn.applications.distributedshell.Client.init(Client.java:313)
at org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:206)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
解决:
yarn的版本冲突,使用新版本3.1.1的yarn,使用时加上绝对路经即可:
/home/nht/hadoop-3.1.1/bin/yarn jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-jar /home/nht/hadoop-3.1.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.1.1.jar \
-shell_command /usr/bin/nvidia-smi \
-container_memory 1024 \
-container_resources memory-mb=1024,vcores=1,yarn.io/gpu=3 \
-num_containers 1
问题3、
Error launching appattempt_1538019293330_0001_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
问题原因:
namenode,datanode时间同步问题
解决办法:
见时间同步文档服务器同步时间.md
问题4、
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create file/user/root/DistributedShell/application_1538047235045_0001/AppMaster.jar. Name node is in safe mode.
解决办法:
safemode模式
NameNode在启动的时候首先进入安全模式,如果datanode丢失的block达到一定的比例(1-dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态。
dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS启动的时候,如果DataNode上报的block个数达到了元数据记录的block个数的0.999倍才可以离开安全模式,否则一直是这种只读模式。如果设为1则HDFS永远是处于SafeMode。
下面这行摘录自NameNode启动时的日志(block上报比例1达到了阀值0.9990)
The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 18 seconds.
hadoop dfsadmin -safemode leave
有两个方法离开这种安全模式