日常Bugs及解决方案记录

1. No module named _tkinter

solution:

sudo apt-get install python3-tk
For python 2.7和3.6:
sudo apt-get install python3.6-tk # python2.7-tk

notes

In python 3 Tkinter is renamed to tkinter

links

https://stackoverflow.com/questions/6084416/tkinter-module-not-found-on-ubuntu
https://blog.csdn.net/qq_33144323/article/details/80556954
https://www.jianshu.com/p/e041692438cc

2. 各种error

2.1 python安装opencv后import报错

python安装opencv:pip install opencv-python
在python3中,依次报错:

ImportError: libSM.so.6: cannot open shared object file: No such file or directory
ImportError: libXrender.so.1: cannot open shared object file: No such file or directory
ImportError: libXext.so.6: cannot open shared object file: No such file or directory

分别安装对应的软件包可解决:

apt-get install libsm6
apt-get install libxrender1
apt-get install libxext-dev
# apt-get install libxext6

在python2中,报错

ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory

对应解决方式:

apt-get install libglib2.0-0

2.2 一些ImportError和共享库缺失错误:

----- libGL.so.1: cannot open shared object file (link)
sudo apt-get update
sudo apt-get install libgl1-mesa-glx

----- error while loading shared libraries: libjpeg.so.8: cannot open shared object file
----- error while loading shared libraries: libpng12.so.0: cannot open shared object file
----- error while loading shared libraries: libtiff.so.5: cannot open shared object file
----- error while loading shared libraries: libjasper.so.1: cannot open shared object file
apt-get install libjpeg8 libpng12.0 libtiff5 libjasper1

2.3 OSError

----- OSError: libgtk-x11-2.0.so.0: cannot open shared object file
apt-get update
apt-get install libgtk2.0

3. 安装package

3.1 pyflann

pip install pyflann
# --------------------- 1. 若使用时报错
sudo 2to3 -w [pyflann安装路径]
# --------------------- 2. 如何查看pyflann安装路径
1.试试 whereis pyflann
2.终极 pip uninstall pyflann(列出路径后,输入n拒绝卸载)

3.2 安装screen | gpustat | htop

apt-get install screen # 可能需要先执行 apt-get update
apt-get install htop
pip install gpustat

4. Pytorch训练代码突然out of memory

两类内存报错:

  1. RuntimeError: cuda runtime error (2) : out of memory ...
  2. RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB ...

对于第一种:pytorch版本升级、避免中间变量累积、pin_memory置False
对于第二种:batchsize调小、选小模型…
Note: 小标题中的“突然”,是指原本可以正常训练的代码,在未知原因下,运行代码时会报第一种内存错误。
遇到的神奇现象: 同样的代码在服务器的0, 1号GPU上可运行,在2,3号服务器上却又不能运行,在2号GPU上单卡运行又ok,在其他服务器上也是可以运行的…
回顾:纠结了很久,回过头来看又如此简单,解决思路是从bug的traceback中启发的,即pin_memory参数的设置,由True改为False即可,顺道查看了pin_memory(锁页内存)的概念 1,2。

5. Pytorch用法记录

x = torch.tensor([1,2,3]), y=torch.tensor([4,5,6]).cuda(0)
(1)tensor to numpy: x.numpy() or y.cpu().numpy
(2)numpy to tensor: torch.from_numpy()
(2.1)a=torch.from_numpy(b),b改变会影响a,a的dtype=torch.float64
(2.2)a=torch.FloatTensor(b),b改变不影响a,a的dtype=torch.float32
(3)tensor to list: list(x.numpy())
(4)共享内存:z=x[0] or z=x[torch.tensor(0)],即改变x[0]会改变z
(5)不共享内存:z=x[[0]] or z=x[torch.tensor([0])],即改变x[0]不会改变z
(6)作为索引的tensor须为Long型,上述因0是整数,默认是Long型。也可以显式定义z=x[torch.LongTensor([0.0])]
(7)torch.max(x)返回最大值tensor(3),torch.max(x,dim=0)返回最大值及其索引(tensor(3),tensor(2))
(8)对要更新的参数及其学习率等,可手动在OPTIMIZER.param_groups中添加修改,务必将参数作为列表等的元素,使可迭代!如OPTIMIZER.param_groups[0][‘params’] = [ HEAD.weight ]
(9) 对于pytorch,train_loader = torch.utils.data.DataLoader(),不用reset;对于mxnet,train_loader = mx.io.PrefetchingIter(train_loader),务必在每个epoch开始时要reset()
(10)可以直接将tensor赋值给numpy ndarray,反过来不行
tensor: a.shape = a.size(),都对应形状
numpy: a.shape是形状,a.size是元素个数
(11)扩展单张图片维度 image = cv2.imread(img_path) # image是np.ndarray
法一
image = torch.tensor(image)
img = image.view(1, *image.size())
法二
img = image[np.newaxis, :, :, :]
法三
image = torch.tensor(image)
img = img.unsqueeze(dim=0)

Pytorch模型恢复训练之load_state_dict的隐藏坑

官方教程:https://pytorch.org/tutorials/beginner/saving_loading_models.html#

  1. 新版pytorch中load_state_dict()并不支持map_location参数,但是torch.load()支持
  2. 【背景】Saving & Loading a General Checkpoint(假设save时参数是在gpu上)
    checkpoint = torch.load(checkpoint_path),运行这一步后,checkpoint[‘backbone’], checkpoint[‘optimizer’]中的参数在gpu上,但是经load_state_dict后,比如backbone.load_state_dict(checkpoint[‘backbone’])和optimizer.load_state_dict(checkpoint[‘optimizer’]),此时参数均在cpu上,接着backbone.cuda()可以将参数转换到gpu上。
    【问题】此时坑出现了,optimizer[‘param_groups’]会随着backbone,head的转换而自动转换到gpu上,但是optimizer[‘state’]仍然留在cpu上,所以运行optimizer.step()时,在这一步会报错:buf.mul_(momentum).add_(1 - dampening, d_p),因为buf在cpu上,而梯度d_p是在gpu上。
    【解决】load_state_dict后手动将optimizer.state中的参数转移至gpu上,代码如下:
    for k, v in optimizer.state.items():    # key is Parameter, val is a dict={key='momentum_buffer':tensor(...)}
        if 'momentum_buffer' not in v:
            continue
        optimizer.state[k]['momentum_buffer'] = optimizer.state[k]['momentum_buffer'].cuda()
    
    【其他】对于lr_scheduler,可以保存其_step_count, last_epoch这些参数,恢复训练时,可以基于恢复的optimizer重定义lr_scheduler(重要!若中断前保存了整个lr_scheduler,恢复时这个scheduler其实是用不了的,因为它是指向的原来的optimizer,所以需要基于新定义,也即新恢复的optimizer重新设置),然后用上述保存的参数更新lr_scheduler中的对应参数。示例代码如下:
    checkpoint = torch.load(checkpoint_path)
    ...
    optimizer.load_state_dict(checkpoint['optimizer'])
    lr_scheduler_saved = checkpoint['lr_scheduler']
    lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer)
    lr_scheduler._step_count = lr_scheduler_saved._step_count
    lr_scheduler.last_epoch = lr_scheduler_saved.last_epoch
    

6. nvidia驱动版本低

NVIDIA driver on your system is too old (found version 9020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

  1. 根据提示,到官网下载对应的驱动并运行

    sh NVIDIA-Linux-x86_64-430.40.run

  2. 第一步之前,其实应该先卸载旧版本驱动(以为会自动覆盖,结果bug,只好分别手动卸载再重装)

    摘要
    sudo nvidia-uninstall
    sudo apt-get purge nvidia-*
    sudo apt-get remove --purge nvidia-*
    sudo add-apt-repository ppa:graphics-drivers/ppa
    sh NVIDIA-Linux-x86_64-430.40.run --uninstall
    参考 1 2 3

  3. 其他设置

    sudo /sbin/telinit 3 # To get out of X
    sudo /sbin/telinit 5 # to return to X afterwards
    ref: 1

  4. 关于驱动版本

    nvidia-smi # 可看到驱动版本号,并且也显示了一个cuda版本号,这个cuda估计是驱动希望的版本
    cat /usr/local/cuda/version.txt # 实际的驱动版本号

7. TypeError: can’t concat str to bytes

label = np.asarray([123.0, 432.0],dtype=np.float32)
s = label.tostring() + ‘’

基于python2.7编写的数据处理代码,在python3中报上述错误。原因是两者解码方式有别:

python3中:b’\x00\xc3\tG\x00\x9a\x0bG’,前面有字样‘b’
python2中:’\x00\x00\xf6B\x00\x00\xd8C’

解决方式:

label.tostring().decode(‘utf8’,‘ignore’)
# 如果解码后发现汉字乱码,就要改成gbk格式试一下:decode(“gbk”)

8. Shell脚本

因为是远程服务器,顺手就在windows(pycharm)下编写了test.sh文件,拿到linux上运行是会报错的,因为两者使用的格式不同。我们需要把dos格式转换为unix格式:

  1. 在vim中打开tesh.sh,命令模式下 :set ff=unix (set ff?可查看当前文件格式)
  2. apt-get install dos2unix,然后 dos2unix test.sh

坑就在这里, 格式转换可能并不靠谱,结果就是会各种解析错误。
【解决】直接在linux中用vim等编辑器

\*/ 存疑bug

1. mxnet导入报错

import mxnet as mx
报错 Segmentation fault:11

同样的服务器地址,同样的docker配置,安装同样的anaconda和mxnet版本,之前都okay,现在装完就报Seg错误。一气之下不管了,结果环境第二天自己又好了… (问题解决的可能原因:对2.1中的导入报错,进行了相应安装,即可能跟opencv有关)

1.1 关于mxnet的其他补充

os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘2,3’
os.environ[‘MXNET_CPU_WORKER_NTHREADS’] = ‘20’ # Set to a larger number to use more threads.

你可能感兴趣的:(开发环境)