sudo apt-get install python3-tk
For python 2.7和3.6:
sudo apt-get install python3.6-tk # python2.7-tk
In python 3 Tkinter is renamed to tkinter
https://stackoverflow.com/questions/6084416/tkinter-module-not-found-on-ubuntu
https://blog.csdn.net/qq_33144323/article/details/80556954
https://www.jianshu.com/p/e041692438cc
python安装opencv:pip install opencv-python
在python3中,依次报错:
ImportError: libSM.so.6: cannot open shared object file: No such file or directory
ImportError: libXrender.so.1: cannot open shared object file: No such file or directory
ImportError: libXext.so.6: cannot open shared object file: No such file or directory
分别安装对应的软件包可解决:
apt-get install libsm6
apt-get install libxrender1
apt-get install libxext-dev
# apt-get install libxext6
在python2中,报错
ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory
对应解决方式:
apt-get install libglib2.0-0
----- libGL.so.1: cannot open shared object file (link)
sudo apt-get update
sudo apt-get install libgl1-mesa-glx
----- error while loading shared libraries: libjpeg.so.8: cannot open shared object file
----- error while loading shared libraries: libpng12.so.0: cannot open shared object file
----- error while loading shared libraries: libtiff.so.5: cannot open shared object file
----- error while loading shared libraries: libjasper.so.1: cannot open shared object file
apt-get install libjpeg8 libpng12.0 libtiff5 libjasper1
----- OSError: libgtk-x11-2.0.so.0: cannot open shared object file
apt-get update
apt-get install libgtk2.0
pip install pyflann
# --------------------- 1. 若使用时报错
sudo 2to3 -w [pyflann安装路径]
# --------------------- 2. 如何查看pyflann安装路径
1.试试 whereis pyflann
2.终极 pip uninstall pyflann(列出路径后,输入n拒绝卸载)
apt-get install screen # 可能需要先执行 apt-get update
apt-get install htop
pip install gpustat
两类内存报错:
RuntimeError: cuda runtime error (2) : out of memory ...
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB ...
对于第一种:pytorch版本升级、避免中间变量累积、pin_memory置False …
对于第二种:batchsize调小、选小模型…
Note: 小标题中的“突然”,是指原本可以正常训练的代码,在未知原因下,运行代码时会报第一种内存错误。
遇到的神奇现象: 同样的代码在服务器的0, 1号GPU上可运行,在2,3号服务器上却又不能运行,在2号GPU上单卡运行又ok,在其他服务器上也是可以运行的…
回顾:纠结了很久,回过头来看又如此简单,解决思路是从bug的traceback中启发的,即pin_memory参数的设置,由True改为False即可,顺道查看了pin_memory(锁页内存)的概念 1,2。
x = torch.tensor([1,2,3]), y=torch.tensor([4,5,6]).cuda(0)
(1)tensor to numpy: x.numpy() or y.cpu().numpy
(2)numpy to tensor: torch.from_numpy()
(2.1)a=torch.from_numpy(b),b改变会影响a,a的dtype=torch.float64
(2.2)a=torch.FloatTensor(b),b改变不影响a,a的dtype=torch.float32
(3)tensor to list: list(x.numpy())
(4)共享内存:z=x[0] or z=x[torch.tensor(0)],即改变x[0]会改变z
(5)不共享内存:z=x[[0]] or z=x[torch.tensor([0])],即改变x[0]不会改变z
(6)作为索引的tensor须为Long型,上述因0是整数,默认是Long型。也可以显式定义z=x[torch.LongTensor([0.0])]
(7)torch.max(x)返回最大值tensor(3),torch.max(x,dim=0)返回最大值及其索引(tensor(3),tensor(2))
(8)对要更新的参数及其学习率等,可手动在OPTIMIZER.param_groups中添加修改,务必将参数作为列表等的元素,使可迭代!如OPTIMIZER.param_groups[0][‘params’] = [ HEAD.weight ]
(9) 对于pytorch,train_loader = torch.utils.data.DataLoader(),不用reset;对于mxnet,train_loader = mx.io.PrefetchingIter(train_loader),务必在每个epoch开始时要reset()
(10)可以直接将tensor赋值给numpy ndarray,反过来不行
tensor: a.shape = a.size(),都对应形状
numpy: a.shape是形状,a.size是元素个数
(11)扩展单张图片维度 image = cv2.imread(img_path) # image是np.ndarray
法一
image = torch.tensor(image)
img = image.view(1, *image.size())
法二
img = image[np.newaxis, :, :, :]
法三
image = torch.tensor(image)
img = img.unsqueeze(dim=0)
官方教程:https://pytorch.org/tutorials/beginner/saving_loading_models.html#
for k, v in optimizer.state.items(): # key is Parameter, val is a dict={key='momentum_buffer':tensor(...)}
if 'momentum_buffer' not in v:
continue
optimizer.state[k]['momentum_buffer'] = optimizer.state[k]['momentum_buffer'].cuda()
【其他】对于lr_scheduler,可以保存其_step_count, last_epoch这些参数,恢复训练时,可以基于恢复的optimizer重定义lr_scheduler(重要!若中断前保存了整个lr_scheduler,恢复时这个scheduler其实是用不了的,因为它是指向的原来的optimizer,所以需要基于新定义,也即新恢复的optimizer重新设置),然后用上述保存的参数更新lr_scheduler中的对应参数。示例代码如下:checkpoint = torch.load(checkpoint_path)
...
optimizer.load_state_dict(checkpoint['optimizer'])
lr_scheduler_saved = checkpoint['lr_scheduler']
lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer)
lr_scheduler._step_count = lr_scheduler_saved._step_count
lr_scheduler.last_epoch = lr_scheduler_saved.last_epoch
NVIDIA driver on your system is too old (found version 9020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
根据提示,到官网下载对应的驱动并运行
sh NVIDIA-Linux-x86_64-430.40.run
第一步之前,其实应该先卸载旧版本驱动(以为会自动覆盖,结果bug,只好分别手动卸载再重装)
摘要
sudo nvidia-uninstall
sudo apt-get purge nvidia-*
sudo apt-get remove --purge nvidia-*
sudo add-apt-repository ppa:graphics-drivers/ppa
sh NVIDIA-Linux-x86_64-430.40.run --uninstall
参考 1 2 3
其他设置
sudo /sbin/telinit 3 # To get out of X
sudo /sbin/telinit 5 # to return to X afterwards
ref: 1
关于驱动版本
nvidia-smi # 可看到驱动版本号,并且也显示了一个cuda版本号,这个cuda估计是驱动希望的版本
cat /usr/local/cuda/version.txt # 实际的驱动版本号
label = np.asarray([123.0, 432.0],dtype=np.float32)
s = label.tostring() + ‘’
基于python2.7编写的数据处理代码,在python3中报上述错误。原因是两者解码方式有别:
python3中:b’\x00\xc3\tG\x00\x9a\x0bG’,前面有字样‘b’
python2中:’\x00\x00\xf6B\x00\x00\xd8C’
解决方式:
label.tostring().decode(‘utf8’,‘ignore’)
# 如果解码后发现汉字乱码,就要改成gbk格式试一下:decode(“gbk”)
因为是远程服务器,顺手就在windows(pycharm)下编写了test.sh文件,拿到linux上运行是会报错的,因为两者使用的格式不同。我们需要把dos格式转换为unix格式:
坑就在这里, 格式转换可能并不靠谱,结果就是会各种解析错误。
【解决】直接在linux中用vim等编辑器
import mxnet as mx
报错 Segmentation fault:11
同样的服务器地址,同样的docker配置,安装同样的anaconda和mxnet版本,之前都okay,现在装完就报Seg错误。一气之下不管了,结果环境第二天自己又好了… (问题解决的可能原因:对2.1中的导入报错,进行了相应安装,即可能跟opencv有关)
os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘2,3’
os.environ[‘MXNET_CPU_WORKER_NTHREADS’] = ‘20’ # Set to a larger number to use more threads.