could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED Process finished with exit code -1073741819

解决:运行参数#--gpu_memory_fraction=0.9       注释掉问题就解决了!

报错信息如下:

INFO:tensorflow:global_step/sec: 0
2019-03-12 19:11:20.300266: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-03-12 19:11:20.301051: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED

Process finished with exit code -1073741819 (0xC0000005)

以下为全部打印信息

WARNING:tensorflow:From D:/work/SSD-Tensorflow-master/train_ssd_network.py:202: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step

# =========================================================================== #
# Training | Evaluation flags:
# =========================================================================== #
{'adadelta_rho': ,
 'adagrad_initial_accumulator_value': ,
 'adam_beta1': ,
 'adam_beta2': ,
 'batch_size': ,
 'checkpoint_exclude_scopes': ,
 'checkpoint_model_scope': ,
 'checkpoint_path': ,
 'clone_on_cpu': ,
 'dataset_dir': ,
 'dataset_name': ,
 'dataset_split_name': ,
 'end_learning_rate': ,
 'ftrl_initial_accumulator_value': ,
 'ftrl_l1': ,
 'ftrl_l2': ,
 'ftrl_learning_rate_power': ,
 'gpu_memory_fraction': ,
 'h': ,
 'help': ,
 'helpfull': ,
 'helpshort': ,
 'ignore_missing_vars': ,
 'label_smoothing': ,
 'labels_offset': ,
 'learning_rate': ,
 'learning_rate_decay_factor': ,
 'learning_rate_decay_type': ,
 'log_every_n_steps': ,
 'loss_alpha': ,
 'match_threshold': ,
 'max_number_of_steps': ,
 'model_name': ,
 'momentum': ,
 'moving_average_decay': ,
 'negative_ratio': ,
 'num_classes': ,
 'num_clones': ,
 'num_epochs_per_decay': ,
 'num_preprocessing_threads': ,
 'num_readers': ,
 'opt_epsilon': ,
 'optimizer': ,
 'preprocessing_name': ,
 'rmsprop_decay': ,
 'rmsprop_momentum': ,
 'save_interval_secs': ,
 'save_summaries_secs': ,
 'train_dir': ,
 'train_image_size': ,
 'trainable_scopes': ,
 'weight_decay': }

# =========================================================================== #
# SSD net parameters:
# =========================================================================== #
{'anchor_offset': 0.5,
 'anchor_ratios': [[2, 0.5],
                   [2, 0.5, 3, 0.3333333333333333],
                   [2, 0.5, 3, 0.3333333333333333],
                   [2, 0.5, 3, 0.3333333333333333],
                   [2, 0.5],
                   [2, 0.5]],
 'anchor_size_bounds': [0.15, 0.9],
 'anchor_sizes': [(2.0, 45.0),
                  (45.0, 99.0),
                  (99.0, 153.0),
                  (153.0, 207.0),
                  (207.0, 261.0),
                  (261.0, 315.0)],
 'anchor_steps': [8, 16, 32, 64, 100, 300],
 'feat_layers': ['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
 'feat_shapes': [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
 'img_shape': (300, 300),
 'no_annotation_label': 2,
 'normalizations': [20, -1, -1, -1, -1, -1],
 'num_classes': 2,
 'prior_scaling': [0.1, 0.1, 0.2, 0.2]}

# =========================================================================== #
# Training | Evaluation dataset files:
# =========================================================================== #
['.\\tfrecords\\voc_2007_train_000.tfrecord',
 '.\\tfrecords\\voc_2007_train_001.tfrecord',
 '.\\tfrecords\\voc_2007_train_002.tfrecord',
 '.\\tfrecords\\voc_2007_train_003.tfrecord']

INFO:tensorflow:Fine-tuning from ./checkpoints/vgg_16.ckpt. Ignoring missing vars: False
WARNING:tensorflow:From C:\Users\11327\AppData\Roaming\Python\Python36\site-packages\tensorflow\contrib\slim\python\slim\learning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-12 19:35:08.779143: I c:\users\user\source\repos\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-03-12 19:35:08.976953: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.62GiB
2019-03-12 19:35:08.977310: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2019-03-12 19:35:09.609884: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-12 19:35:09.610104: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958]      0 
2019-03-12 19:35:09.610250: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0:   N 
2019-03-12 19:35:09.610476: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7372 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-03-12 19:35:09.611373: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:903] failed to allocate 7.20G (7730940928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Restoring parameters from ./checkpoints/vgg_16.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
2019-03-12 19:35:16.465249: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-03-12 19:35:16.466070: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED

Process finished with exit code -1073741819 (0xC0000005)

使用参数如下:

--train_dir=./logs/
--dataset_dir=./tfrecords/
--dataset_name=pascalvoc_2007
--dataset_split_name=train
--model_name=ssd_300_vgg
--checkpoint_path=./checkpoints/vgg_16.ckpt
--checkpoint_model_scope=vgg_16
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--save_summaries_secs=60
--save_interval_secs=600
--weight_decay=0.0005
--optimizer=adam
--learning_rate=0.001
--learning_rate_decay_factor=0.94
--batch_size=16
--gpu_memory_fraction=0.9

使用参数如下:

python3 train_ssd_network.py \    --train_dir=/media/comway/data/dial_SSD/SSD-Tensorflow-master/train_log/ \   #训练生成模型的存放路径    --dataset_dir=/media/comway/data/dial_SSD/SSD-Tensorflow-master/dialvoc-train-tfrecords \  #数据存放路径    --dataset_name=pascalvoc_2007 \  #数据名的前缀,我觉得应该通过这个调用是2007还是2012    --dataset_split_name=train \  #是加载训练集还是测试集    --model_name=ssd_300_vgg \  #加载的模型的名字    --checkpoint_path=/media/comway/data/dial_SSD/SSD-Tensorflow-master/checkpoints/ssd_300_vgg.ckpt \  #所加载模型的路径 --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \    --trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \    --save_summaries_secs=60 \#每60s保存一下日志    --save_interval_secs=600 \  #每600s保存一下模型    --weight_decay=0.0005 \   #正则化的权值衰减的系数    --optimizer=adam \  #选取的最优化函数    --learning_rate=0.001 \  #学习率    --learning_rate_decay_factor=0.94 \  #学习率的衰减因子    --batch_size=16 \       --gpu_memory_fraction=0.9  #指定占用gpu内存的百分比
参考博客:https://blog.csdn.net/comway_Li/article/details/85239484里的最后一个方案,反正第一个方案也用了,就是报错。

操作:1,之前也用这个参数了,但是不知道什么情况就是报错。后来很多次关机,重启电脑,重启pycharm,修改参数,重新解压改预训练模型文件。

2,我还往回改了下,听师兄说,回到最原始的错误,说不定报错就没了,我就往回改了,果然如此。程序终于可以继续训练起来了。

3,把英伟达的进程全部关了再重启电脑,关了电脑管家等保护软件。(小白也不知道影不影响,反正就这么做了)

4,主要还是与运行程序里的分类数21全部改为自己的类数。不要遗漏个别num_class数值。

最终训练自己的数据集所用参数(整理版=运行版):
--train_dir=./logs/
--dataset_dir=./tfrecords/
--dataset_name=pascalvoc_2007
--dataset_split_name=train
--model_name=ssd_300_vgg
--checkpoint_path=./checkpoints/ssd_300_vgg.ckpt
#--checkpoint_model_scope=vgg_16
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--save_summaries_secs=60
--save_interval_secs=600
--weight_decay=0.0005
--optimizer=adam
--learning_rate=0.001
--learning_rate_decay_factor=0.94
--batch_size=16
#--gpu_memory_fraction=0.9

 

你可能感兴趣的:(could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED Process finished with exit code -1073741819)