conda虚拟环境:python==3.6.10
conda install tensorflow-gpu keras
【注】看了一些教程,说最好用pip安装,conda会捆绑很多东西,删除的时候也会连带删除很多包,会使整个环境都不能用。但是我试过pip安装,不能使用GPU训练,目前没有解决。用conda可以自动安装cudatoolkit、cudnn,并且可以自动对应版本。(小声bb:反正用的虚拟环境,坏了直接删了重建一个)
自动安装了
Traceback (most recent call last):
File "train_g_unet.py", line 121, in <module>
train(args)
File "train_g_unet.py", line 62, in train
parallel_model = multi_gpu_model(model, gpus=len(gpu_num))
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 150, in multi_gpu_model
available_devices = _get_available_devices()
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 16, in _get_available_devices
return K.tensorflow_backend._get_available_gpus() + ['/cpu:0']
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 506, in _get_available_gpus
_LOCAL_DEVICES = tf.config.experimental_list_devices()
AttributeError: module 'tensorflow_core._api.v2.config' has no attribute 'experimental_list_devices'
lib/python3.6/site-packages/keras/backend/tensorflow_backend.py
,修改第506行# 原始代码:
_LOCAL_DEVICES = tf.config.experimental_list_devices()
# 修改后:
devices = tf.config.list_logical_devices()
_LOCAL_DEVICES = [x.name for x in devices]
Traceback (most recent call last):
File "train_g_unet.py", line 121, in <module>
train(args)
File "train_g_unet.py", line 104, in train
workers=2)
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/engine/training_generator.py", line 100, in fit_generator
callbacks.set_model(callback_model)
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/callbacks/callbacks.py", line 68, in set_model
callback.set_model(model)
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/keras/callbacks/tensorboard_v2.py", line 116, in set_model
super(TensorBoard, self).set_model(model)
File "/media/s2/cyq/anaconda3/envs/keras/lib/python3.6/site-packages/tensorflow_core/python/keras/callbacks.py", line 1532, in set_model
self.log_dir, self.model._get_distribution_strategy()) # pylint: disable=protected-access
AttributeError: 'Model' object has no attribute '_get_distribution_strategy'
lib/python3.6/site-packages/tensorflow_core/python/keras/callbacks.py
,修改第1532行和1732行# 1529行左右 : # distributed_file_utils.write_dirpath()
# In case this callback is used via native Keras, _get_distribution_strategy does not exist.
if hasattr(self.model, '_get_distribution_strategy'):
# TensorBoard callback involves writing a summary file in a
# possibly distributed settings.
self._log_write_dir = distributed_file_utils.write_dirpath(
self.log_dir, self.model._get_distribution_strategy()) # pylint: disable=protected-access
else:
self._log_write_dir = self.log_dir
# 1732行: # distributed_file_utils.remove_temp_dirpath()
# In case this callback is used via native Keras, _get_distribution_strategy does not exist.
if hasattr(self.model, '_get_distribution_strategy'):
# Safely remove the unneeded temp files.
distributed_file_utils.remove_temp_dirpath(
self.log_dir, self.model._get_distribution_strategy()) # pylint: disable=protected-access
修改之后使用多GPU训练,以前用过的batchsize数,一直OOM,超出内存,改不好放弃了
最后只能卸载tensorflow,tensorflow-gpu,keras
conda uninstall tensorflow tensorflow-gpu keras
降级安装tensorflow 1.14.0
conda install tensorflow-gpu==1.14.0
解决