解决:运行参数#--gpu_memory_fraction=0.9 注释掉问题就解决了!
报错信息如下:
INFO:tensorflow:global_step/sec: 0
2019-03-12 19:11:20.300266: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-03-12 19:11:20.301051: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Process finished with exit code -1073741819 (0xC0000005)
以下为全部打印信息
WARNING:tensorflow:From D:/work/SSD-Tensorflow-master/train_ssd_network.py:202: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
# =========================================================================== #
# Training | Evaluation flags:
# =========================================================================== #
{'adadelta_rho':
'adagrad_initial_accumulator_value':
'adam_beta1':
'adam_beta2':
'batch_size':
'checkpoint_exclude_scopes':
'checkpoint_model_scope':
'checkpoint_path':
'clone_on_cpu':
'dataset_dir':
'dataset_name':
'dataset_split_name':
'end_learning_rate':
'ftrl_initial_accumulator_value':
'ftrl_l1':
'ftrl_l2':
'ftrl_learning_rate_power':
'gpu_memory_fraction':
'h':
'help':
'helpfull':
'helpshort':
'ignore_missing_vars':
'label_smoothing':
'labels_offset':
'learning_rate':
'learning_rate_decay_factor':
'learning_rate_decay_type':
'log_every_n_steps':
'loss_alpha':
'match_threshold':
'max_number_of_steps':
'model_name':
'momentum':
'moving_average_decay':
'negative_ratio':
'num_classes':
'num_clones':
'num_epochs_per_decay':
'num_preprocessing_threads':
'num_readers':
'opt_epsilon':
'optimizer':
'preprocessing_name':
'rmsprop_decay':
'rmsprop_momentum':
'save_interval_secs':
'save_summaries_secs':
'train_dir':
'train_image_size':
'trainable_scopes':
'weight_decay':
# =========================================================================== #
# SSD net parameters:
# =========================================================================== #
{'anchor_offset': 0.5,
'anchor_ratios': [[2, 0.5],
[2, 0.5, 3, 0.3333333333333333],
[2, 0.5, 3, 0.3333333333333333],
[2, 0.5, 3, 0.3333333333333333],
[2, 0.5],
[2, 0.5]],
'anchor_size_bounds': [0.15, 0.9],
'anchor_sizes': [(2.0, 45.0),
(45.0, 99.0),
(99.0, 153.0),
(153.0, 207.0),
(207.0, 261.0),
(261.0, 315.0)],
'anchor_steps': [8, 16, 32, 64, 100, 300],
'feat_layers': ['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
'feat_shapes': [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
'img_shape': (300, 300),
'no_annotation_label': 2,
'normalizations': [20, -1, -1, -1, -1, -1],
'num_classes': 2,
'prior_scaling': [0.1, 0.1, 0.2, 0.2]}
# =========================================================================== #
# Training | Evaluation dataset files:
# =========================================================================== #
['.\\tfrecords\\voc_2007_train_000.tfrecord',
'.\\tfrecords\\voc_2007_train_001.tfrecord',
'.\\tfrecords\\voc_2007_train_002.tfrecord',
'.\\tfrecords\\voc_2007_train_003.tfrecord']
INFO:tensorflow:Fine-tuning from ./checkpoints/vgg_16.ckpt. Ignoring missing vars: False
WARNING:tensorflow:From C:\Users\11327\AppData\Roaming\Python\Python36\site-packages\tensorflow\contrib\slim\python\slim\learning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-12 19:35:08.779143: I c:\users\user\source\repos\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-03-12 19:35:08.976953: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.62GiB
2019-03-12 19:35:08.977310: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2019-03-12 19:35:09.609884: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-12 19:35:09.610104: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958] 0
2019-03-12 19:35:09.610250: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: N
2019-03-12 19:35:09.610476: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7372 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-03-12 19:35:09.611373: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:903] failed to allocate 7.20G (7730940928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Restoring parameters from ./checkpoints/vgg_16.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
2019-03-12 19:35:16.465249: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-03-12 19:35:16.466070: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Process finished with exit code -1073741819 (0xC0000005)
使用参数如下:
--train_dir=./logs/
--dataset_dir=./tfrecords/
--dataset_name=pascalvoc_2007
--dataset_split_name=train
--model_name=ssd_300_vgg
--checkpoint_path=./checkpoints/vgg_16.ckpt
--checkpoint_model_scope=vgg_16
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--save_summaries_secs=60
--save_interval_secs=600
--weight_decay=0.0005
--optimizer=adam
--learning_rate=0.001
--learning_rate_decay_factor=0.94
--batch_size=16
--gpu_memory_fraction=0.9
使用参数如下:
python3 train_ssd_network.py \ --train_dir=/media/comway/data/dial_SSD/SSD-Tensorflow-master/train_log/ \ #训练生成模型的存放路径 --dataset_dir=/media/comway/data/dial_SSD/SSD-Tensorflow-master/dialvoc-train-tfrecords \ #数据存放路径 --dataset_name=pascalvoc_2007 \ #数据名的前缀,我觉得应该通过这个调用是2007还是2012 --dataset_split_name=train \ #是加载训练集还是测试集 --model_name=ssd_300_vgg \ #加载的模型的名字 --checkpoint_path=/media/comway/data/dial_SSD/SSD-Tensorflow-master/checkpoints/ssd_300_vgg.ckpt \ #所加载模型的路径 --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \ --trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \ --save_summaries_secs=60 \#每60s保存一下日志 --save_interval_secs=600 \ #每600s保存一下模型 --weight_decay=0.0005 \ #正则化的权值衰减的系数 --optimizer=adam \ #选取的最优化函数 --learning_rate=0.001 \ #学习率 --learning_rate_decay_factor=0.94 \ #学习率的衰减因子 --batch_size=16 \ --gpu_memory_fraction=0.9 #指定占用gpu内存的百分比
参考博客:https://blog.csdn.net/comway_Li/article/details/85239484里的最后一个方案,反正第一个方案也用了,就是报错。
操作:1,之前也用这个参数了,但是不知道什么情况就是报错。后来很多次关机,重启电脑,重启pycharm,修改参数,重新解压改预训练模型文件。
2,我还往回改了下,听师兄说,回到最原始的错误,说不定报错就没了,我就往回改了,果然如此。程序终于可以继续训练起来了。
3,把英伟达的进程全部关了再重启电脑,关了电脑管家等保护软件。(小白也不知道影不影响,反正就这么做了)
4,主要还是与运行程序里的分类数21全部改为自己的类数。不要遗漏个别num_class数值。
最终训练自己的数据集所用参数(整理版=运行版):
--train_dir=./logs/
--dataset_dir=./tfrecords/
--dataset_name=pascalvoc_2007
--dataset_split_name=train
--model_name=ssd_300_vgg
--checkpoint_path=./checkpoints/ssd_300_vgg.ckpt
#--checkpoint_model_scope=vgg_16
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--save_summaries_secs=60
--save_interval_secs=600
--weight_decay=0.0005
--optimizer=adam
--learning_rate=0.001
--learning_rate_decay_factor=0.94
--batch_size=16
#--gpu_memory_fraction=0.9