孪生LSTM(Siamese LSTM)模型可以很容易来表征两个短语或者两个句子的相似性,输入数据是相似或不相似的短语对或句子对,输出是两个词语的相似性,对应的隐层可以视为词语的语义表示。
下载代码
从github上下载代码:https://github.com/dhwajraj/deep-siamese-text-similarity
安装gensim工具包
升级和修改源代码(由于代码是基于1.0之前的tensorflow版本写的)
修改如下:
-
下载脚本 tf_upgrade.py, 运行如下命令转换整个目录
python tf_upgrade.py --intree deep-siamese-text-similarity-master/ --outtree new_deep_siamese
运行发现还是报错,修改siamese_network.py 如下:
class SiameseLSTM(object):
"""
A LSTM based deep Siamese network for text similarity.
Uses an character embedding layer, followed by a biLSTM and Energy Loss layer.
"""
def BiRNN(self, x, dropout, scope, embedding_size, sequence_length):
n_input=embedding_size
n_steps=sequence_length
n_hidden=n_steps
n_layers=3
# Prepare data shape to match `bidirectional_rnn` function requirements
# Current data input shape: (batch_size, n_steps, n_input) (?, seq_len, embedding_size)
# Required shape: 'n_steps' tensors list of shape (batch_size, n_input)
# Permuting batch_size and n_steps
x = tf.transpose(x, [1, 0, 2])
# Reshape to (n_steps*batch_size, n_input)
x = tf.reshape(x, [-1, n_input])
# Split to get a list of 'n_steps' tensors of shape (batch_size, n_input)
x = tf.split(axis=0, num_or_size_splits=n_steps, value=x)
print(x)
# Define lstm cells with tensorflow
# Forward direction cell
with tf.name_scope("fw"+scope),tf.variable_scope("fw"+scope):
print(tf.get_variable_scope().name)
def lstm_fw_cell():
fw_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden, forget_bias=1.0, state_is_tuple=True)
return tf.contrib.rnn.DropoutWrapper(fw_cell,output_keep_prob=dropout)
lstm_fw_cell_m = tf.contrib.rnn.MultiRNNCell([lstm_fw_cell() for _ in range(n_layers)], state_is_tuple=True)
# Backward direction cell
with tf.name_scope("bw"+scope),tf.variable_scope("bw"+scope):
print(tf.get_variable_scope().name)
def lstm_bw_cell():
bw_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden, forget_bias=1.0, state_is_tuple=True)
return tf.contrib.rnn.DropoutWrapper(bw_cell,output_keep_prob=dropout)
lstm_bw_cell_m = tf.contrib.rnn.MultiRNNCell([lstm_bw_cell() for _ in range(n_layers)], state_is_tuple=True)
# Get lstm cell output
#try:
with tf.name_scope("bw"+scope),tf.variable_scope("bw"+scope):
outputs, _, _ = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell_m, lstm_bw_cell_m, x, dtype=tf.float32)
# except Exception: # Old TensorFlow version only returns outputs not states
# outputs = tf.nn.bidirectional_rnn(lstm_fw_cell_m, lstm_bw_cell_m, x,
# dtype=tf.float32)
return outputs[-1]
def contrastive_loss(self, y,d,batch_size):
tmp= y *tf.square(d)
#tmp= tf.mul(y,tf.square(d))
tmp2 = (1-y) *tf.square(tf.maximum((1 - d),0))
return tf.reduce_sum(tmp +tmp2)/batch_size/2
def __init__(
self, sequence_length, vocab_size, embedding_size, hidden_units, l2_reg_lambda, batch_size):
# Placeholders for input, output and dropout
self.input_x1 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x1")
self.input_x2 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x2")
self.input_y = tf.placeholder(tf.float32, [None], name="input_y")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
# Keeping track of l2 regularization loss (optional)
l2_loss = tf.constant(0.0, name="l2_loss")
# Embedding layer
with tf.name_scope("embedding"):
self.W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
trainable=True,name="W")
self.embedded_chars1 = tf.nn.embedding_lookup(self.W, self.input_x1)
#self.embedded_chars_expanded1 = tf.expand_dims(self.embedded_chars1, -1)
self.embedded_chars2 = tf.nn.embedding_lookup(self.W, self.input_x2)
#self.embedded_chars_expanded2 = tf.expand_dims(self.embedded_chars2, -1)
# Create a convolution + maxpool layer for each filter size
with tf.name_scope("output"):
self.out1=self.BiRNN(self.embedded_chars1, self.dropout_keep_prob, "side1", embedding_size, sequence_length)
self.out2=self.BiRNN(self.embedded_chars2, self.dropout_keep_prob, "side2", embedding_size, sequence_length)
self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.out1,self.out2)),1,keep_dims=True))
self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.out1),1,keep_dims=True)),tf. sqrt(tf.reduce_sum(tf.square(self.out2),1,keep_dims=True))))
self.distance = tf.reshape(self.distance, [-1], name="distance")
with tf.name_scope("loss"):
self.loss = self.contrastive_loss(self.input_y,self.distance, batch_size)
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.distance, self.input_y)
self.accuracy=tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
运行代码
下载代码中已经有测试数据person_match.train和person_match.train2,为相似人名数据集,可以用来直接测试和验证算法。
训练
python train.py
训练的时候会输出很多内容,可以重定向到一个文件中。训练时间较长,半个小时大概迭代50000步。
训练完之后,会多出目录runs和文件validation.txt0
./runs下保存的是训练好的模型,大概占用了450M空间, validation.txt0文件是训练时分离出来的验证文件,格式为
测试
python eval.py --model runs/1494830689/checkpoints/model-9000 --vocab_filepath ./runs/1494830689/checkpoints/vocab --eval_filepath validation.txt0
#Output
#Accuracy: 0.62197
输出会有一大堆,用model-9000的准确率为0.62197,如果换成model-81000 准确率可以达到 0.737626,而model-150000的准确率为0.745707。