本文只阐述博主遇到的问题及其如何解决,仅供参考
如何运行代码测试可以参考这篇Blog
Swin-Transformer分类源码(已跑通)
由于博主使用的服务器配置了多个cuda,根据swin-transformer的要求,指定cuda10.1
# 先用vim打开.bashrc文件
vim ~/.bashrc
# 加入配置
export PATH="$PATH:/usr/local/cuda-10.1/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64/"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/local/cuda-10.1/lib64"
# 执行修改
source ~/.bashrc
报错信息
=> merge config from configs/swin_base_patch4_window7_224.yaml Traceback (most recent call last): File "main.py", line 302, in assert amp is not None, "amp not installed!" AssertionError: amp not installed! Traceback (most recent call last): File "/home/anaconda3/envs/swin1/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec)
File "/home/anaconda3/envs/swin1/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/anaconda3/envs/swin1/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main()
File "/home/anaconda3/envs/swin1/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/anaconda3/envs/swin1/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/swin_base_patch4_window7_224.yaml', '--data-path', 'imagenet', '--batch-size', '64']' returned non-zero exit status 1.
一开始是遇到缺乏amp的问题,但这其实是apex安装没成功导致的错误
于是博主重新安装apex
发现安装失败原因
原因一上文已解决
原因二解决
git clone https://github.com/NVIDIA/apex.git
cd apex
# 加入该语句
python setup.py install --cuda_ext --cpp_ext
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
由于设置的GPU数量超过服务器配置报错
查看服务器GPU配置
watch -n 10 nvidia-smi
博主服务器配置GPU数量为2
于是重新执行命令
python -m torch.distributed.launch --nproc_per_node 2 --master_port 88888 main.py --cfg configs/swin_tiny_patch4_window7_224.yaml --data-path imagenet --batch-size 64
执行命令中--master_port 12345
,12345端口被占用,重新设置新的端口即可,如--master_port 88888