https://www.cnblogs.com/caiyishuai/p/14324250.html
使用的是training_results_v0.6,而不是mlperf / training存储库中提供的参考实现。请注意,这些实现有效地用作基准实现的起点,但尚未完全优化,并且不打算用于软件框架或硬件的“实际”性能评估。
git clone https://github.com/Caiyishuai/training_results_v0.6
在此存储库中,有每个供应商提交的目录(Google,Intel,NVIDIA等),其中包含用于生成结果的代码和脚本。在NVIDIA GPU上运行基准测试。
[root@2 ~]# cd training_results_v0.6/ [root@2 training_results_v0.6]# ls Alibaba CONTRIBUTING.md Fujitsu Google Intel LICENSE NVIDIA README.md [root@2 training_results_v0.6]# cd NVIDIA/; ls benchmarks LICENSE.md README.md results systems [root@2 NVIDIA]# cd benchmarks/; ls gnmt maskrcnn minigo resnet ssd transformer
[root@2 implementations]# pwd /data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations [root@2 implementations]# ls data download_dataset2.sh download_dataset3.sh download_dataset.sh pytorch verify_dataset.sh wget-log
[root@2 implementations]# bash download_dataset.sh
查看download_dataset.sh,可以查看数据的具体链接,如果网速较慢,可以将链接复制到其它下载器中下载,然后更改download_dataset.sh
[root@2 implementations]# cat download_dataset.sh #! /usr/bin/env bash # Copyright 2017 Google Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. set -e export LANG=C.UTF-8 export LC_ALL=C.UTF-8 OUTPUT_DIR=${1:-"data"} echo "Writing to ${OUTPUT_DIR}. To change this, set the OUTPUT_DIR environment variable." OUTPUT_DIR_DATA="${OUTPUT_DIR}/data" mkdir -p $OUTPUT_DIR_DATA echo "Downloading Europarl v7. This may take a while..." wget -nc -nv -O ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \ http://www.statmt.org/europarl/v7/de-en.tgz echo "Downloading Common Crawl corpus. This may take a while..." wget -nc -nv -O ${OUTPUT_DIR_DATA}/common-crawl.tgz \ http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz echo "Downloading News Commentary v11. This may take a while..." wget -nc -nv -O ${OUTPUT_DIR_DATA}/nc-v11.tgz \ http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz echo "Downloading dev/test sets" wget -nc -nv -O ${OUTPUT_DIR_DATA}/dev.tgz \ http://data.statmt.org/wmt16/translation-task/dev.tgz wget -nc -nv -O ${OUTPUT_DIR_DATA}/test.tgz \ http://data.statmt.org/wmt16/translation-task/test.tgz ……………… done echo "All done."
如果通过其它方式已经下载了文件在本目录下,可以更改上述wegt代码
echo "Downloading Europarl v7. This may take a while..." mv -i data/de-en.tgz ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \ echo "Downloading Common Crawl corpus. This may take a while..." mv -i data/training-parallel-commoncrawl.tgz ${OUTPUT_DIR_DATA}/common-crawl.tgz \ echo "Downloading News Commentary v11. This may take a while..." mv -i data/training-parallel-nc-v11.tgz ${OUTPUT_DIR_DATA}/nc-v11.tgz \ echo "Downloading dev/test sets" mv -i data/dev.tgz ${OUTPUT_DIR_DATA}/dev.tgz \ mv -i data/test.tgz ${OUTPUT_DIR_DATA}/test.tgz \
执行脚本以验证是否已正确下载数据集。
[root@2 implementations]# du -sh data/ 13G data/
用于执行训练作业的脚本和代码位于pytorch目录中。
[root@2 implementations]# cd pytorch/ [root@2 pytorch]# ll total 124 -rw-r--r-- 1 root root 5047 Jan 22 15:45 bind_launch.py -rwxr-xr-x 1 root root 1419 Jan 22 15:45 config_DGX1_multi.sh -rwxr-xr-x 1 root root 718 Jan 25 10:50 config_DGX1.sh -rwxr-xr-x 1 root root 1951 Jan 22 15:45 config_DGX2_multi_16x16x32.sh -rwxr-xr-x 1 root root 1950 Jan 22 15:45 config_DGX2_multi.sh -rwxr-xr-x 1 root root 718 Jan 22 15:45 config_DGX2.sh -rw-r--r-- 1 root root 1372 Jan 22 15:45 Dockerfile -rw-r--r-- 1 root root 1129 Jan 22 15:45 LICENSE -rw-r--r-- 1 root root 6494 Jan 22 15:45 mlperf_log_utils.py -rw-r--r-- 1 root root 4145 Jan 22 15:45 preprocess_data.py -rw-r--r-- 1 root root 12665 Jan 22 15:45 README.md -rw-r--r-- 1 root root 43 Jan 22 15:45 requirements.txt -rwxr-xr-x 1 root root 2220 Jan 22 15:45 run_and_time.sh -rwxr-xr-x 1 root root 7173 Jan 25 10:56 run.sub drwxr-xr-x 3 root root 45 Jan 22 15:45 scripts drwxr-xr-x 7 root root 90 Jan 22 15:45 seq2seq -rw-r--r-- 1 root root 1082 Jan 22 15:45 setup.py -rw-r--r-- 1 root root 25927 Jan 22 15:45 train.py -rw-r--r-- 1 root root 8056 Jan 22 15:45 translate.py
需要配置config_
要编辑的参数:
DGXNGPU = 8
DGXSOCKETCORES = 18
DGXNSOCKET = 2
您可以使用nvidia-smi
命令获取GPU信息,并使用lscpu
命令获取CPU信息,尤其是:
Core(s) per socket: 18
Socket(s): 2
docker build -t mlperf-nvidia:rnn_translator .
需要不少时间
View Code
我们可以查看一下dockfile文件
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.05-py3 FROM ${FROM_IMAGE_NAME} # Install dependencies for system configuration logger RUN apt-get update && apt-get install -y --no-install-recommends \ infiniband-diags \ pciutils && \ rm -rf /var/lib/apt/lists/* # Install Python dependencies WORKDIR /workspace/rnn_translator COPY requirements.txt . RUN pip install --no-cache-dir https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance \ && pip install --no-cache-dir -r requirements.txt # Copy & build extensions COPY seq2seq/csrc seq2seq/csrc COPY setup.py . RUN pip install . # Copy GNMT code COPY . . # Configure environment variables ENV LANG C.UTF-8 ENV LC_ALL C.UTF-8
docker 17.05.0-ce (2017-05-04)
之后才引入的,查看我的docker版本[root@2 pytorch]# docker version Client: Docker Engine - Community Version: 20.10.2 API version: 1.41 Go version: go1.13.15 Git commit: 2291f61 Built: Mon Dec 28 16:17:48 2020 OS/Arch: linux/amd64 Context: default Experimental: true …………
对于本测试,将使用config_DGX1.sh并因此将DGXSYTEM指定为DGX1。还要将PULL设置为0,以指示使用本地映像而不是从存储库中提取docker映像。创建了一个新目录“ logs”来存储基准日志文件,并在启动基准运行时提供数据目录路径,如下所示:
[root@2 pytorch]# DATADIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/data [root@2 pytorch]# LOGDIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/logs [root@2 pytorch]# PULL=0 DGXSYSTEM=DGX1 [root@2 pytorch]# ./run.sub
如果报错以下,表示计算机没有GPU
[root@2 pytorch]# ./run.sub mlperf-nvidia:rnn_translator nvidia-docker | 2021/01/25 13:48:39 Error: Could not load UVM kernel module. Is nvidia-modprobe installed? ERR: Base container launch failed.
查看GPU信息nvidia-smi
如果一切顺利,它将执行基准测试10次,并将日志文件存储在指定目录中。由于在配置文件中指定了8个GPU,因此将看到所有8个GPU被用于训练GNMT模型。
可以使用此命令watch -d -n 1 nvidia-smi
来定期监视GPU使用情况。
run.sub里也可以设置PULL=0
# Pull latest image PULL=0 if [[ "${PULL}" != "0" ]]; then DOCKERPULL="docker pull $CONT" pids=(); for hostn in ${hosts[@]}; do timeout -k 600s 600s \ $(eval echo $SRUN) $DOCKERPULL & pids+=($!); done wait "${pids[@]}" success=$? ; if [ $success -ne 0 ]; then echo "ERR: Image pull failed."; exit $success ; fi fi