在GPU刨过的坑

这两天回学校这两天先是把自己的linux系统给强制删除了,然后导致重启无法正常引导进入win,最后win也救不活了,还不好重装系统,各种文件损坏,简单粗暴的翻车血泪史。捷径路上总是不满陷阱。
装好win和linux后,从mac回到我的小本本,依然最亲linux啊哈哈,当然mac也很好。
重新在服务器上配置我的环境,该装的装,实习期间慢慢地习惯脱离pycharm了,其实要装的东西不多,就装个tensorflow-gpu的环境,但是以前装的时候没做好笔记,所以又把之前的坑探索一遍,好在随着gpu普遍,问题解决起来不算麻烦,论做笔记的重要性。

一号坑:
tensorflow-gpu 的安装,之前忘了cuda和cudnn版本的匹配要求,不然老是找不到libcudnn.so.6文件,参考要求到NVIDIA官网上下载相应的cudnn。
cuda10.1和tensorflow-gpu1.13不兼容,可以装cuda10.0

cuda版本对应兼容的系统版本 https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
在GPU刨过的坑_第1张图片
其他系统到tensorflow官网上获取最新信息

二号坑:
根据gpu等配置装了tensorflow1.4,最后也成功装上了,但是出现一下bug,在cpu版本的tensorflow没有出现问题,而GPU版本上报错了,具体问题如下:
1.制作tfrecords的时候, img_test = tf.reshape(img, [224, 224, 3]) ###### 过滤条件 防止后面读取加载时候出现不能reshape,该语句出现了error:Tried to convert ‘tensor’ to a tensor and failed. Error: Argument must be a dense tensor: - got shape [224, 224, 3], but wanted [].

最后只能使用cpu版本的tensorflow制作tfrecords了。

2.在训练resnet的时候也出现问题了,即使已经加上

self.sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=True))

error:
Traceback (most recent call last):
File “resnet_train_gpu.py”, line 137, in
resnet_train.train()
File “resnet_train_gpu.py”, line 60, in train
self.logits, self.end_points = resnet_v2_50(self.x_input, 2)
File “/home/wyl/Documents/coding/pornographic/resnet/resnet_wei_gpu.py”, line 196, in resnet_v2_50
include_root_block=True,reuse=reuse,scope=scope)
File “/home/wyl/Documents/coding/pornographic/resnet/resnet_wei_gpu.py”, line 163, in resnet_v2
activation_fn=None,normalizer_fn=None,scope=‘logits’)
File “/home/wyl/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py”, line 181, in func_with_args
return func(*args, **current_args)
File “/home/wyl/anaconda3/envs/python3/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py”, line 1011, in convolution
input_rank)
ValueError: (‘Convolution not supported for input with rank’, 2)
最后得出结论是 tensorflow版本不一样,之前在cpu上用的是tensorflow1.13,GPU版本用tensorflow1.4,有些网络参数不同导致的,于是乎直接换了个tensorflow、models/research/slim/nets/resnet_v2.py 能跑的就可以了。

三号坑:
如果直接运行脚本的话会出现
ResourceExhaustedError,分析现在主机上有四块卡,有一块有任务,而tensorflow训练程序在没有特别说明的情况下,默认吃掉所有的显存,用掉所有的卡。
所以要指定GPU的使用如下:

CUDA_VISIBLE_DEVICES=1,2,3 python train_porn_gpu.py ../config/config_gpu.cfg 

在GPU刨过的坑_第2张图片
就可以了。
在GPU刨过的坑_第3张图片
GPU就是快,写bug时候心情都顺畅许多哈哈。

四号坑:
error:
W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key conduct_encoder/layer9-conv9/E_weights not found in checkpoint
找不到训练好的网络参数变量

通过tf_checkpoint.py检查保存的checkpoint文件的数据是否正常

import tensorflow as tf

checkpoint_file = '../checkpoints_gpu/resnet_gpu.ckpt'

reader = tf.train.NewCheckpointReader(checkpoint_file)
#print(reader.debug_string().decode("utf-8"))
var_to_shape_map = reader.get_variable_to_shape_map()
for key in var_to_shape_map:
    print("tensor_name: ", key)   # 打印变量名
    #print(reader.get_tensor(key)) # 打印变量值

正常,最后发现是saver.restore的图不对,之前使用的是saver = tf.train.Saver()默认的,应该使用模型保存的图,
即:

saver = tf.train.import_meta_graph("{}.meta".format(self.checkpoint_file))

五号坑:
error:
因为其他模块的要求python3.6的环境,所以将py3.7换成py3.6,然后就出现

The NVIDIA driver is too old PyTorch version that has been compiled with your version of the CUDA

之前没注意过pytorch也有版本要求,还以为是我的cuda driver 有问题呢,还重新安装了好几次。看tensorflow cuda能用起来,才想到是不是版本不适引起的。
python 3.6.9: cuda 10.0,torch 1.2.0,torchversion0.4.0 可用
最后测试下是否可用:

import torch
print(torch.cuda.is_available())

六号坑:
接着,五号坑

Traceback (most recent call last):
  File "demo.py", line 9, in 
    from tracker.multitracker import JDETracker
  File "/home/smart4s/Documents/workplaces/Towards-Realtime-MOT/tracker/multitracker.py", line 10, in 
    from utils.utils import *
  File "/home/smart4s/Documents/workplaces/Towards-Realtime-MOT/utils/utils.py", line 13, in 
    import maskrcnn_benchmark.layers.nms as nms
  File "/home/smart4s/Documents/workplaces/maskrcnn-benchmark/maskrcnn_benchmark/layers/__init__.py", line 10, in 
    from .nms import nms
  File "/home/smart4s/Documents/workplaces/maskrcnn-benchmark/maskrcnn_benchmark/layers/nms.py", line 3, in 
    from maskrcnn_benchmark import _C
ImportError: /home/smart4s/Documents/workplaces/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNK2at11ATenOpTable11reportErrorEN3c1012TensorTypeIdE

有同学建议我换成
torch 1.3.0, torchvision 0.4.1,且换了cuda-10.1
然后我就换了,新的issue如下:
RuntimeError: Not compiled with GPU support (nms at

import torch
print(torch.cuda.is_available())
True


/home/smart4s/anaconda3/envs/py36/lib/python3.6/site-packages/sklearn/utils/linear_assignment_.py:21: DeprecationWarning: The linear_assignment_ module is deprecated in 0.21 and will be removed from 0.23. Use scipy.optimize.linear_sum_assignment instead.
DeprecationWarning)
Namespace(cfg=‘cfg/yolov3.cfg’, conf_thres=0.5, img_size=(1088, 608), input_video=‘demo_input/MOT16-13.mp4’, iou_thres=0.5, min_box_area=200, nms_thres=0.4, output_format=‘video’, output_root=‘output/’, track_buffer=30, weights=‘model/jde.1088x608.uncertainty.pt’)

2019-11-05 04:14:09 [INFO]: start tracking…
Lenth of the video: 750 frames
2019-11-05 04:14:16 [INFO]: Processing frame 0 (100000.00 fps)
Traceback (most recent call last):
File “demo.py”, line 67, in
track(opt)
File “demo.py”, line 38, in track
save_dir=frame_dir, show_image=False, frame_rate=frame_rate)
File “/home/smart4s/Documents/workplaces/Towards-Realtime-MOT/track.py”, line 54, in eval_seq
online_targets = tracker.update(blob, img0)
File “/home/smart4s/Documents/workplaces/Towards-Realtime-MOT/tracker/multitracker.py”, line 185, in update
dets = non_max_suppression(pred.unsqueeze(0), self.opt.conf_thres, self.opt.nms_thres)[0].cpu()
File “/home/smart4s/Documents/workplaces/Towards-Realtime-MOT/utils/utils.py”, line 457, in non_max_suppression
nms_indices = nms(pred[:, :4], pred[:, 4], nms_thres)
File “/home/smart4s/anaconda3/envs/py36/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/amp.py”, line 22, in wrapper
return orig_fn(*args, **kwargs)
RuntimeError: Not compiled with GPU support (nms at /home/smart4s/Documents/workplaces/maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f8ace07a813 in /home/smart4s/anaconda3/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0x129 (0x7f8ab3d24af9 in /home/smart4s/Documents/workplaces/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: + 0x1e178 (0x7f8ab3d35178 in /home/smart4s/Documents/workplaces/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x1a83d (0x7f8ab3d3183d in /home/smart4s/Documents/workplaces/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)

frame #32: __libc_start_main + 0xe7 (0x7f8ae1db3b97 in /lib/x86_64-linux-gnu/libc.so.6)

这次依然是maskrcnn-benchmark的问题,又是同学的建议,让我重新装了maskrcnn-benchmark,我之前只是uninstall,忘了换了cuda版本需要重新编译,所以这次把maskrcnn-benchmark uninstall后,rm -rf maskrcnn-benchmark 目录重新 git clone ,重新编译就ok了。

你可能感兴趣的:(学习笔记,tensorflow)