swin-transformer图像分类测试

本文只阐述博主遇到的问题及其如何解决,仅供参考

如何运行代码测试可以参考这篇Blog

Swin-Transformer分类源码(已跑通)

服务器多个cuda问题

由于博主使用的服务器配置了多个cuda,根据swin-transformer的要求,指定cuda10.1

# 先用vim打开.bashrc文件
vim ~/.bashrc

# 加入配置
export PATH="$PATH:/usr/local/cuda-10.1/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64/"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/local/cuda-10.1/lib64"

# 执行修改
source ~/.bashrc

apex相关

报错信息

=> merge config from configs/swin_base_patch4_window7_224.yaml Traceback (most recent call last): File "main.py", line 302, in  assert amp is not None, "amp not installed!" AssertionError: amp not installed! Traceback (most recent call last): File "/home/anaconda3/envs/swin1/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec)
File "/home/anaconda3/envs/swin1/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/anaconda3/envs/swin1/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in  main()
File "/home/anaconda3/envs/swin1/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/anaconda3/envs/swin1/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/swin_base_patch4_window7_224.yaml', '--data-path', 'imagenet', '--batch-size', '64']' returned non-zero exit status 1.

一开始是遇到缺乏amp的问题,但这其实是apex安装没成功导致的错误

于是博主重新安装apex

发现安装失败原因

  • cuda不匹配,因为博主使用的服务器默认为cuda10.0
  • 找不到setup.py

原因一上文已解决

原因二解决

git clone https://github.com/NVIDIA/apex.git
cd apex

# 加入该语句
python setup.py install --cuda_ext --cpp_ext

pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

torch.cuda.set_device(config.LOCAL_RANK)

由于设置的GPU数量超过服务器配置报错

查看服务器GPU配置

watch -n 10 nvidia-smi 

博主服务器配置GPU数量为2

于是重新执行命令

python -m torch.distributed.launch --nproc_per_node 2 --master_port 88888 main.py --cfg configs/swin_tiny_patch4_window7_224.yaml --data-path imagenet --batch-size 64

RuntimeError: Address already in use

执行命令中--master_port 12345,12345端口被占用,重新设置新的端口即可,如--master_port 88888

你可能感兴趣的:(Transformer,神经网络)