很久没有动tensorflow了,最近实验做个分词的工具(这不是重点),以前都是在单个gpu上面运行,突然想尝试在多核GPU下跑一跑。
在网上随便找了篇帖子:https://blog.csdn.net/winycg/article/details/79759294参照着改一改,代码如下:
参数定义:
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
bi-lstm定义:
def bi_lstm(X_inputs):
embedding = tf.get_variable("embedding", [vocab_size, embedding_size], dtype=tf.float32, trainable=False)
# X_inputs.shape = [batchsize, timestep_size] -> inputs.shape = [batchsize, timestep_size, embedding_size]
inputs = tf.nn.embedding_lookup(embedding, X_inputs)
cell_fw = rnn.MultiRNNCell([rnn.DropoutWrapper(cell=tf.nn.rnn_cell.LSTMCell(hidden_size, forget_bias=1.0, state_is_tuple=True, name='fw_lstm_cell'), input_keep_prob=1.0, output_keep_prob=keep_prob) for _ in range(layer_num)], state_is_tuple=True)
cell_bw = rnn.MultiRNNCell([rnn.DropoutWrapper(cell=tf.nn.rnn_cell.LSTMCell(hidden_size, forget_bias=1.0, state_is_tuple=True, name='bw_lstm_cell'), input_keep_prob=1.0, output_keep_prob=keep_prob) for _ in range(layer_num)], state_is_tuple=True)
# **4.初始状态
initial_state_fw = cell_fw.zero_state(batch_size, tf.float32)
initial_state_bw = cell_fw.zero_state(batch_size, tf.float32)
# **5.bi-lstm 计算
with tf.variable_scope('bidirection_rnn'):
# *** 下面分别计算两个网络的output 和state
# forward direction
outputs_fw = list()
state_fw = initial_state_fw
with tf.variable_scope('fw'):
for timestep in range(timestep_size):
if timestep > 0:
tf.get_variable_scope().reuse_variables()
(output_fw, state_fw) = cell_fw(inputs[:, timestep, :], state_fw)
outputs_fw.append(output_fw)
# backward direction
outputs_bw = list()
state_bw = initial_state_bw
with tf.variable_scope('bw'):
inputs = tf.reverse(inputs, [1])
for timestep in range(timestep_size):
if timestep > 0:
tf.get_variable_scope().reuse_variables()
(output_bw, state_bw) = cell_bw(inputs[:, timestep, :], state_bw)
outputs_bw.append(output_bw)
# *** 然后把 output_bw 在 timestep 维度进行翻转
# 把两个oupputs 拼成 [timestep_size, batch_size, hidden_size*2]
output = tf.concat([outputs_fw, outputs_bw], 2)
# output.shape 必须和 y_input.shape=[batch_size,timestep_size] 对齐
output = tf.transpose(output, perm=[1, 0, 2])
output = tf.reshape(output, [-1, hidden_size * 2])
# ***********************************************************
softmax_w = weight_variable([hidden_size * 2, class_num])
softmax_b = bias_variable([class_num])
logits = tf.matmul(output, softmax_w) + softmax_b
return logits
合并梯度:
def average_gradients(tower_grads):
average_grads=[]
for grad_and_vars in zip(*tower_grads):
grads=[]
for g, _ in grad_and_vars:
expend_g=tf.expand_dims(g,0)
grads.append(expend_g)
grad=tf.concat(grads,0)
grad=tf.reduce_mean(grad,0)
v=grad_and_vars[0][1]
grad_and_var=(grad,v)
average_grads.append(grad_and_var)
return average_grads
训练模块
def train(data_engine):
with tf.device("/cpu:0"):
tower_grads = []
X_inputs = tf.placeholder(tf.int32, [None, timestep_size], name='X_input')
y_inputs = tf.placeholder(tf.int32, [None, timestep_size], name='y_input')
elr = tf.train.exponential_decay(lr, global_step, decay_steps, decay_rate, staircase=True, name=None)
optimizer = tf.train.AdamOptimizer(learning_rate=elr)
with tf.variable_scope(tf.get_variable_scope()):
for i in range(gpu_nums):
with tf.device("/gpu:%d" % i):
with tf.name_scope("tower_%d" % i):
_x = X_inputs[i * batch_size:(i + 1) * batch_size]
_y = y_inputs[i * batch_size:(i + 1) * batch_size]
logits = bi_lstm(_x)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.reshape(_y, [-1]), logits=logits))
tf.get_variable_scope().reuse_variables()
grads = optimizer.compute_gradients(loss)
tower_grads.append(grads)
if i == 0:
logits_test = bi_lstm(_x)
test_v = tf.cast(tf.argmax(tf.reshape(logits_test, [-1, timestep_size, class_num]), 2), tf.int32)
grads = average_gradients(tower_grads)
train_op = optimizer.apply_gradients(grads, global_step=global_step)
# 梯度下降计算
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for iteration in range(whole_epoch):
x1, y1 = data_engine.train_next_batch(batch_size * gpu_nums)
_, t_loss = sess.run([train_op, loss], feed_dict={X_inputs: x1, y_inputs: y1, keep_prob: 0.5, lr: 0.01})
if iteration % print_step == 0:
print('iteration: ', iteration)
x2, y2 = data_engine.validate_next_batch(batch_size)
y_pre = sess.run(test_v, feed_dict={X_inputs: x2, y_inputs: y2, keep_prob: 1.0})
print('loss: ', t_loss)
nozero_evaluate(y2, y_pre)
但是在运行的时候发现梯度合并报错:
Traceback (most recent call last):
File "run.py", line 13, in
train(data_engine)
File "/4T/home/leijp/cut2/target/net2.py", line 132, in train
grads = average_gradients(tower_grads)
File "/4T/home/leijp/cut2/target/net2.py", line 39, in average_gradients
expend_g=tf.expand_dims(g,0)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 137, in expand_dims
return gen_array_ops.expand_dims(input, axis, name)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2088, in expand_dims
"ExpandDims", input=input, dim=axis, name=name)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 528, in _apply_op_helper
(input_name, err))
ValueError: Tried to convert 'input' to a tensor and failed. Error: None values not supported.
把梯度打印出来:
(None, ),
(None, ),
(None, ),
(None, ),
发现在GPU:1计算梯度的时候,梯度竟然为None,不明所以,于是开始网上查,还好网上有一篇类似的错误:https://stackoverflow.com/questions/37593275/multi-gpu-tower-valueerror-none-values-not-supported?answertab=active#tab-top
下面有人回答应该是变量作用域的问题,于是我把参数定义代码改了一下:
def weight_variable(shape):
# initial = tf.truncated_normal(shape, stddev=0.1)
# return tf.Variable(initial)
return tf.get_variable(name="weights", shape=shape, initializer=tf.truncated_normal_initializer(mean=0, stddev=0.1))
def bias_variable(shape):
# initial = tf.constant(0.1, shape=shape)
# return tf.Variable(initial)
return tf.get_variable(name="bias", shape=shape, initializer=tf.constant_initializer(0.1))
再跑一次,就成功了。
因为之前一直是单核计算,没有涉及到多少变量重用,所以就没怎么关注作用域的问题,稀里糊涂用了这么久。后面认真的学习了下,发现一般来说,使用tf.get_variable()
要比使用tf.Variable()
来进行变量定义更保险,因为只要在复用代码前加一句tf.get_variable_scope().reuse_variables()
就可以让之前定义的变量重用,这样两个GPU就能共享同一份权值。
后面我输出梯度的时候,变成了这样, 以"weights"为例:
# 梯度
GPU:0
GPU:1
# 权值
GPU:0
GPU:1
梯度是两个不同的梯度,权值是同一份权值,这与多核GPU,数据并行的思路是一致的。
但为什么权值被重用了,而梯度却是各一份呢?
原来变量的定义是在tf.variable_scope()
下,求解梯度过程是在tf.name_scope()
下。
tf.variable_scope()
下相同的scope_name可以让变量有相同的命名,包括tf.get_variable()
得到的变量,还有tf.Variable()
的变量,不加tf.get_variable_scope().reuse_variables()
的话就不能重用。
而tf.name_scope()
让变量有相同的命名,只是限于tf.Variable()
的变量,而且scope_name不同的话,定义的tf.Variable()域也会不同,所以产生的梯度自然不是同一份。