MLPerf

https://www.cnblogs.com/caiyishuai/p/14324250.html

将MLPerf训练结果库拷到本地

使用的是training_results_v0.6,而不是mlperf / training存储库中提供的参考实现。请注意,这些实现有效地用作基准实现的起点,但尚未完全优化,并且不打算用于软件框架或硬件的“实际”性能评估。

git clone https://github.com/Caiyishuai/training_results_v0.6

在此存储库中,有每个供应商提交的目录(Google,Intel,NVIDIA等),其中包含用于生成结果的代码和脚本。在NVIDIA GPU上运行基准测试。

MLPerf_第1张图片

[root@2 ~]# cd training_results_v0.6/

[root@2 training_results_v0.6]# ls
Alibaba  CONTRIBUTING.md  Fujitsu  Google  Intel  LICENSE  NVIDIA  README.md

[root@2 training_results_v0.6]# cd NVIDIA/; ls
benchmarks  LICENSE.md  README.md  results  systems

[root@2 NVIDIA]# cd benchmarks/; ls
gnmt  maskrcnn  minigo  resnet  ssd  transformer

MLPerf_第2张图片

下载并验证数据集

[root@2 implementations]# pwd
/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations

[root@2 implementations]# ls
data  download_dataset2.sh  download_dataset3.sh  download_dataset.sh  pytorch  verify_dataset.sh  wget-log
[root@2 implementations]# bash download_dataset.sh

查看download_dataset.sh,可以查看数据的具体链接,如果网速较慢,可以将链接复制到其它下载器中下载,然后更改download_dataset.sh

复制代码

[root@2 implementations]# cat download_dataset.sh
#! /usr/bin/env bash

# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -e

export LANG=C.UTF-8
export LC_ALL=C.UTF-8

OUTPUT_DIR=${1:-"data"}
echo "Writing to ${OUTPUT_DIR}. To change this, set the OUTPUT_DIR environment variable."

OUTPUT_DIR_DATA="${OUTPUT_DIR}/data"

mkdir -p $OUTPUT_DIR_DATA

echo "Downloading Europarl v7. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \
  http://www.statmt.org/europarl/v7/de-en.tgz

echo "Downloading Common Crawl corpus. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/common-crawl.tgz \
  http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz

echo "Downloading News Commentary v11. This may take a while..."
wget -nc -nv -O ${OUTPUT_DIR_DATA}/nc-v11.tgz \
  http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz

echo "Downloading dev/test sets"
wget -nc -nv -O  ${OUTPUT_DIR_DATA}/dev.tgz \
  http://data.statmt.org/wmt16/translation-task/dev.tgz
wget -nc -nv -O  ${OUTPUT_DIR_DATA}/test.tgz \
  http://data.statmt.org/wmt16/translation-task/test.tgz

………………

done

echo "All done."

复制代码

如果通过其它方式已经下载了文件在本目录下,可以更改上述wegt代码

复制代码

echo "Downloading Europarl v7. This may take a while..."
mv -i data/de-en.tgz  ${OUTPUT_DIR_DATA}/europarl-v7-de-en.tgz \
  

echo "Downloading Common Crawl corpus. This may take a while..."
mv -i data/training-parallel-commoncrawl.tgz  ${OUTPUT_DIR_DATA}/common-crawl.tgz \
  
echo "Downloading News Commentary v11. This may take a while..."
mv -i data/training-parallel-nc-v11.tgz  ${OUTPUT_DIR_DATA}/nc-v11.tgz \
  

echo "Downloading dev/test sets"
mv -i data/dev.tgz  ${OUTPUT_DIR_DATA}/dev.tgz \
  
mv -i data/test.tgz  ${OUTPUT_DIR_DATA}/test.tgz \

复制代码

执行脚本以验证是否已正确下载数据集。

[root@2 implementations]# du -sh data/
13G     data/

配置文件开始准备训练

用于执行训练作业的脚本和代码位于pytorch目录中。

复制代码

[root@2 implementations]# cd pytorch/
[root@2 pytorch]# ll
total 124
-rw-r--r-- 1 root root  5047 Jan 22 15:45 bind_launch.py
-rwxr-xr-x 1 root root  1419 Jan 22 15:45 config_DGX1_multi.sh
-rwxr-xr-x 1 root root   718 Jan 25 10:50 config_DGX1.sh
-rwxr-xr-x 1 root root  1951 Jan 22 15:45 config_DGX2_multi_16x16x32.sh
-rwxr-xr-x 1 root root  1950 Jan 22 15:45 config_DGX2_multi.sh
-rwxr-xr-x 1 root root   718 Jan 22 15:45 config_DGX2.sh
-rw-r--r-- 1 root root  1372 Jan 22 15:45 Dockerfile
-rw-r--r-- 1 root root  1129 Jan 22 15:45 LICENSE
-rw-r--r-- 1 root root  6494 Jan 22 15:45 mlperf_log_utils.py
-rw-r--r-- 1 root root  4145 Jan 22 15:45 preprocess_data.py
-rw-r--r-- 1 root root 12665 Jan 22 15:45 README.md
-rw-r--r-- 1 root root    43 Jan 22 15:45 requirements.txt
-rwxr-xr-x 1 root root  2220 Jan 22 15:45 run_and_time.sh
-rwxr-xr-x 1 root root  7173 Jan 25 10:56 run.sub
drwxr-xr-x 3 root root    45 Jan 22 15:45 scripts
drwxr-xr-x 7 root root    90 Jan 22 15:45 seq2seq
-rw-r--r-- 1 root root  1082 Jan 22 15:45 setup.py
-rw-r--r-- 1 root root 25927 Jan 22 15:45 train.py
-rw-r--r-- 1 root root  8056 Jan 22 15:45 translate.py

MLPerf_第3张图片

需要配置config_ .sh以反映您的系统配置。如果系统具有8个或16个GPU,则可以使用现有的config_DGX1.sh或config_DGX2.sh配置文件来启动训练作业。

要编辑的参数:
DGXNGPU = 8
DGXSOCKETCORES = 18
DGXNSOCKET = 2

您可以使用nvidia-smi命令获取GPU信息,并使用lscpu命令获取CPU信息,尤其是:

Core(s) per socket: 18
Socket(s): 2

下载docker镜像

docker build -t mlperf-nvidia:rnn_translator .

需要不少时间

 View Code

我们可以查看一下dockfile文件

复制代码

# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.05-py3
FROM ${FROM_IMAGE_NAME}

# Install dependencies for system configuration logger
RUN apt-get update && apt-get install -y --no-install-recommends \
        infiniband-diags \
        pciutils && \
    rm -rf /var/lib/apt/lists/*

# Install Python dependencies
WORKDIR /workspace/rnn_translator

COPY requirements.txt .
RUN pip install --no-cache-dir https://github.com/mlperf/training/archive/6289993e1e9f0f5c4534336df83ff199bd0cdb75.zip#subdirectory=compliance \
 && pip install --no-cache-dir -r requirements.txt

# Copy & build extensions
COPY seq2seq/csrc seq2seq/csrc
COPY setup.py .
RUN pip install .

# Copy GNMT code
COPY . .

# Configure environment variables
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

复制代码

Docker问题:Dockerfile的From之前不能使用ARG,允许这种用法是在docker 17.05.0-ce (2017-05-04)之后才引入的,查看我的docker版本

复制代码

[root@2 pytorch]# docker version
Client: Docker Engine - Community
 Version:           20.10.2
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        2291f61
 Built:             Mon Dec 28 16:17:48 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true
…………

复制代码

 对于本测试,将使用config_DGX1.sh并因此将DGXSYTEM指定为DGX1。还要将PULL设置为0,以指示使用本地映像而不是从存储库中提取docker映像。创建了一个新目录“ logs”来存储基准日志文件,并在启动基准运行时提供数据目录路径,如下所示:

复制代码

[root@2 pytorch]# DATADIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/data 

[root@2 pytorch]# LOGDIR=/data/training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/logs 

[root@2 pytorch]# PULL=0 DGXSYSTEM=DGX1 

[root@2 pytorch]# ./run.sub

复制代码

如果报错以下,表示计算机没有GPU

[root@2 pytorch]# ./run.sub
mlperf-nvidia:rnn_translator
nvidia-docker | 2021/01/25 13:48:39 Error: Could not load UVM kernel module. Is nvidia-modprobe installed?
ERR: Base container launch failed.

查看GPU信息nvidia-smi

如果一切顺利,它将执行基准测试10次,并将日志文件存储在指定目录中。由于在配置文件中指定了8个GPU,因此将看到所有8个GPU被用于训练GNMT模型。
可以使用此命令watch -d -n 1 nvidia-smi来定期监视GPU使用情况。

run.sub里也可以设置PULL=0

复制代码

# Pull latest image
PULL=0
if [[ "${PULL}" != "0" ]]; then
  DOCKERPULL="docker pull $CONT"
  pids=();
  for hostn in ${hosts[@]}; do
    timeout -k 600s 600s \
      $(eval echo $SRUN) $DOCKERPULL &
    pids+=($!);
  done
  wait "${pids[@]}"
  success=$? ; if [ $success -ne 0 ]; then echo "ERR: Image pull failed."; exit $success ; fi
fi

复制代码

你可能感兴趣的:(服务器)