保存模型的时候一切正常,但是加载的时候就会出现“Key Variable_xxx not found in checkpoint”错误。首先要分析错误原因,一般情况下model.ckpt文件肯定都有的,都是加载的时候出的问题。所以先把ckpt文件中的变量打印出来看看。这里有个前提条件,定义变量的时候需要指定name参数,不然打印出来的都是“Variable_xxx:0”之类的!
import os
from tensorflow.python import pywrap_tensorflow
current_path = os.getcwd()
model_dir = os.path.join(current_path, 'model')
checkpoint_path = os.path.join(model_dir,'embedding.ckpt-0') # 保存的ckpt文件名,不一定是这个
# Read data from checkpoint file
reader = pywrap_tensorflow.NewCheckpointReader(checkpoint_path)
var_to_shape_map = reader.get_variable_to_shape_map()
# Print tensor name and values
for key in var_to_shape_map:
print("tensor_name: ", key)
# print(reader.get_tensor(key)) # 打印变量的值,对我们查找问题没啥影响,打印出来反而影响找问题
我的输出:
tensor_name: w_1_1/Adam_1
tensor_name: w_2/Adam_1
tensor_name: b_2
tensor_name: w_1_1
tensor_name: w_out/Adam_1
tensor_name: b_1_1/Adam_1
tensor_name: w_out
tensor_name: w_1
tensor_name: b_out
tensor_name: b_2/Adam
tensor_name: b_1
tensor_name: b_out/Adam_1
tensor_name: b_1_1/Adam
tensor_name: w_1_1/Adam
tensor_name: b_1_1
tensor_name: w_2/Adam
tensor_name: w_2
tensor_name: w_out/Adam
tensor_name: beta1_power
tensor_name: b_out/Adam
tensor_name: b_2/Adam_1
tensor_name: beta2_power
这就很明显了,我的网络里只有”b_1,b_2,w_1,w_2”这种变量,由于使用了tf.train.AdamOptimizer()来更新梯度,所以在保存检查点的时候如果不指定则是全局保存,把优化的变量“w_out/Adam”这种命名规则的变量也一并保存了,自然在恢复的时候就会出现找不到XX变量。解决办法,在声明 saver = tf.train.Saver()的时候带上参数,即需要保存的变量
def ann_net(w_alpha=0.01, b_alpha=0.1):
# 隐藏层_1
w_1 = tf.Variable(w_alpha * tf.random_normal(shape=(input_size, hidden1_size)), name='w_1')
b_1 = tf.Variable(b_alpha * tf.random_normal(shape=[hidden1_size]),name='b_1')
hidden1_output = tf.nn.tanh(tf.add(tf.matmul(X, w_1), b_1))
hidden1_output = tf.nn.dropout(hidden1_output, keep_prob)
# 隐藏层_2
shp1 = hidden1_output.get_shape()
w_2 = tf.Variable(w_alpha * tf.random_normal(shape=(shp1[1].value, hidden2_size)), name='w_2')
b_2 = tf.Variable(b_alpha * tf.random_normal(shape=[hidden2_size]),name='b_2')
hidden2_output = tf.nn.tanh(tf.add(tf.matmul(hidden1_output, w_2), b_2))
hidden2_output = tf.nn.dropout(hidden2_output, keep_prob)
# 输出层
shp2 = hidden2_output.get_shape()
w_output = tf.Variable(w_alpha * tf.random_normal(shape=(shp2[1].value, embeding_size)), name='w_out')
b_output = tf.Variable(b_alpha * tf.random_normal(shape=[embeding_size]),name='b_out')
output = tf.add(tf.matmul(hidden2_output, w_output), b_output)
variables_dict = {'b_2': b_2, 'w_out': w_output, 'w_1': w_1, 'b_out': b_output, 'b_1': b_1, 'w_2': w_2}
return output,variables_dict
在train()函数里,使用variables_dict初始化saver
with tf.device('/cpu:0'):
saver = tf.train.Saver(var_dict)
with tf.Session(config=tf.ConfigProto(device_count={'cpu': 0})) as sess:
sess.run(tf.global_variables_initializer())
step = 0
ckpt = tf.train.get_checkpoint_state('model/')
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)
step = int(ckpt.model_checkpoint_path.rsplit('-',1)[1])
print("Model restored.")
# 训练代码
# ... ...
saver.save(sess, 'model/embedding.model', global_step=step)
如果是从网上down的模型比如vgg-16之类的,只想加载前面的几层,而且用自己定义的变量,方法一样,指定一个变量列表或者字典,传给tf.train.Saver()。
如果是LSTM,道理也一样,不过系统存储的时候有tf自己的规则,LSTM默认的variable_scope叫做“bidirectional_rnn”如果没额外操作过的话变量前会自动带上这个名字,所以保存的模型里的名字就类似于下面这样:
tensor_name: train/train_1/fc_b/Adam
tensor_name: train_1/fc_b
tensor_name: train/fc_b
tensor_name: train/train/bidirectional_rnn/fw/basic_lstm_cell/kernel/Adam
tensor_name: train/bidirectional_rnn/fw/basic_lstm_cell/kernel
tensor_name: train/train/bidirectional_rnn/bw/basic_lstm_cell/bias/Adam
tensor_name: train/beta2_power
tensor_name: train/train/fc_w/Adam
tensor_name: train_1/beta1_power
tensor_name: train/train/bidirectional_rnn/bw/basic_lstm_cell/bias/Adam_1
tensor_name: train/train/bidirectional_rnn/fw/basic_lstm_cell/bias/Adam_1
tensor_name: train/beta1_power
tensor_name: train/train_1/fc_w/Adam_1
tensor_name: train/train/bidirectional_rnn/bw/basic_lstm_cell/kernel/Adam_1
tensor_name: train_1/beta2_power
tensor_name: train/train/fc_w/Adam_1
tensor_name: train/bidirectional_rnn/bw/basic_lstm_cell/kernel
tensor_name: train/train/fc_b/Adam
tensor_name: train/bidirectional_rnn/bw/basic_lstm_cell/bias
tensor_name: train/fc_w
tensor_name: train_1/fc_w
tensor_name: train/bidirectional_rnn/fw/basic_lstm_cell/bias
tensor_name: train/train/fc_b/Adam_1
tensor_name: train/train/bidirectional_rnn/fw/basic_lstm_cell/kernel/Adam_1
tensor_name: train/train/bidirectional_rnn/bw/basic_lstm_cell/kernel/Adam
tensor_name: train/train/bidirectional_rnn/fw/basic_lstm_cell/bias/Adam
tensor_name: train/train_1/fc_b/Adam_1
tensor_name: train/train_1/fc_w/Adam
前面的”train”是我添加的variable_scope,所以恢复的时候可以这样:
include = ['train/fc_b', 'train/fc_w',
'train/bidirectional_rnn/bw/basic_lstm_cell/bias',
'train/bidirectional_rnn/bw/basic_lstm_cell/kernel',
'train/bidirectional_rnn/fw/basic_lstm_cell/bias',
'train/bidirectional_rnn/fw/basic_lstm_cell/kernel']
variables_to_restore = tf.contrib.slim.get_variables_to_restore(include=include)
saver = tf.train.Saver(variables_to_restore)
with tf.Session(config=tf.ConfigProto(device_count={'cpu': 0})) as sess:
sess.run(tf.global_variables_initializer())
# ... ...