使用多GPU训练时遇到的问题

问题描述

本次训练采用的是双GPU一起训练,代码如下
self.model = multi_gpu_model(self.model, gpus=2)
self.model.compile(loss=self.loss,
                   optimizer=optimizers.RMSprop(lr=1e-4),
                   metrics=['acc'])
然后在第一Epoch结束时,保存最优模型时出现如下错误:
Epoch 00001: val_acc improved from -inf to 0.71111, saving model to ./model/model-ep001-loss0.644-val_loss0.649.h5
Traceback (most recent call last):
...
...
File "/usr/local/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/usr/local/lib/python3.6/copy.py", line 220, in 
y = [deepcopy(a, memo) for a in x]
File "/usr/local/lib/python3.6/copy.py", line 169, in deepcopy
rv = reductor(4)
TypeError: can't pickle module objects

解决方案①

之前的保存模型处代码为:
# 只保存最优模型,以val_acc为最优依据
checkpoint = callbacks.ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max',period=1)
tensorboard = callbacks.TensorBoard(log_dir=logDir)
callback_lists=[tensorboard,checkpoint]
出错原因: 网络结构中使用到了Lambda层,这与ModelCheckpoint中的save_weights_only有冲突,save_weights_only默认为False,将其改为True即可
checkpoint = callbacks.ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=True, mode='max',period=1)

解决方案②

保存模型时使用最开始的模型,就是指的送往multi_gpu_model的模型,例multi_model
=multi_gpu_model(ori_model,gpus=2)就是指的ori_model,保存时不要使用
multi_model而是使用ori_model进行保存即可,即ori_model.save('first.h5')即可

解决方案③

如果使用了ModelCheckpoint作为回调时,则需要重新定义下召回函数,保存开始的模型.
原本的checkpoint如下:
checkpoint = callbacks.ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max',period=1)
更改为:
multi_model = multi_gpu_model(ori_model,gpus=2)
class ParallelModelCheckpoint(callbacks.ModelCheckpoint):
    def __init__(self,model,filepath, monitor='val_acc', verbose=1,
                 save_best_only=True, save_weights_only=False,
                 mode='max', period=1):
        self.single_model = model
        super(ParallelModelCheckpoint,self).__init__(filepath, monitor, verbose,save_best_only, save_weights_only,mode, period)

    def set_model(self, model):
        super(ParallelModelCheckpoint,self).set_model(self.single_model)
checkpoint = ParallelModelCheckpoint(ori_model, filepath)

引发的问题

当使用了解决方案①中的设置save_best_only为True即只是保存权重后,就不能够
再通过原来的models.load_model(model_name)方式进行模型加载.而是需要先加载
模型结构再加载权重的方式,即为:
# _build_model()即为模型的搭建过程
self.model = self._build_model()
self.model = multi_gpu_model(self.model, gpus=self.gpu)
self.model.load_weights(weights_name)

你可能感兴趣的:(Error)