TensorFlow-RNN文本情感分析
- 之前用全连接神经网络写过一个文本情感分析 http://blog.csdn.net/weiwei9363/article/details/78357670
- 现在,利用TensorFlow搭建一个RNN网络对文本进行情感分析
- 完整代码以及详细的介绍(Solution) https://github.com/jiemojiemo/deep-learning/tree/master/sentiment-rnn
- 训练数据 https://github.com/jiemojiemo/deep-learning/tree/master/sentiment-network
Step 1 数据处理
import numpy as np
with open('reviews.txt', 'r') as f:
reviews = f.read()
with open('labels.txt', 'r') as f:
labels = f.read()
reviews[:2000]
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life such as teachers . my years in the teaching profession lead me to believe that bromwell high s satire is much closer to reality than is teachers . the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn t \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane violent mob by the crazy chantings of it s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly . \nhomelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter . most people think of the homeless as just a lost cause while worrying about things such as racism the war on iraq pressuring kids to succeed technology the elections inflation or worrying if they ll be next to end up on the streets . br br but what if y'
from string import punctuation
all_text = ''.join([c for c in reviews if c not in punctuation])
reviews = all_text.split('\n')
all_text = ' '.join(reviews)
words = all_text.split()
all_text[:2000]
'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my years in the teaching profession lead me to believe that bromwell high s satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i m here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn t story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a formal orchestra audience is turned into an insane violent mob by the crazy chantings of it s singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the cryptic dialogue would make shakespeare seem easy to a third grader on a technical level it s better than you might think with some good cinematography by future great vilmos zsigmond future stars sally kirkland and frederic forrest can be seen briefly homelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most people think of the homeless as just a lost cause while worrying about things such as racism the war on iraq pressuring kids to succeed technology the elections inflation or worrying if they ll be next to end up on the streets br br but what if you were given a bet to live on the st'
words[:100]
['bromwell',
'high',
'is',
'a',
'cartoon',
'comedy',
'it',
'ran',
'at',
'the',
'same',
'time',
'as',
'some',
'other',
'programs',
'about',
.....
.....
'at',
'high']
Step 2 将文本转换为数字
- 神经网络无法处理字符串,因此要将字符串转换为数字
- 具体方法就是给每个单词贴上一个数字下标
- 同时,为了接下来的训练,我们将训练数据中的每一条review从字符串转为数字
from collections import Counter
def get_vocab_to_int(words):
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = { word : i for i, word in enumerate(vocab, 1)}
return vocab_to_int
def get_reviews_ints(vocab_to_int, reviews):
reviews_ints = []
for each in reviews:
reviews_ints.append( [ vocab_to_int[word] for word in each.split()] )
return reviews_ints
vocab_to_int = get_vocab_to_int(words)
reviews_ints = get_reviews_ints(vocab_to_int, reviews)
get_reviews_ints(vocab_to_int, ['i love this moive'])
[[10, 115, 11, 59320]]
len(vocab_to_int)
74072
Step 3 输出标签编码
- 标签中包含’negative’和’positive’两类,我们将’negative’转换为0,’positive’为1
labels = np.array([0 if label=='negative' else 1 for label in labels.split('\n')])
Step 4 清理垃圾数据
- 出于不知名的原因,在reviews_ints中居然有长度为0的数据存在,这是无意义的数据,进行清除
- 同时,最长的review有2514个单词,这对于我们网络而言实在是太长了,要砍掉一部分
review_lens = Counter([len(x) for x in reviews_ints])
print('Zero-length reviews:{}'.format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))
Zero-length reviews:1
Maximum review length: 2514
non_zeros_idx = [ ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
len(non_zeros_idx)
25000
reviews_ints = [ reviews_ints[ii] for ii in non_zeros_idx]
labels = np.array( [ labels[ii] for ii in non_zeros_idx] )
Step 5 多退少补
- 上面提到,有的review太长了,要裁剪,有的又太短了,要填充
- 我们固定每次输入字符序列长度为200, 对超过200的review进行裁剪,少于200的review在左边填0
- 例如,’i love this movie’是[10, 115, 11, 59320],那么需要在左边填196个0,变成这样:[0,0,…,0, 10, 115, 11, 59320]
seq_len = 200
features = np.zeros((len(reviews_ints), seq_len), dtype=int)
for i,review in enumerate(reviews_ints):
features[i, -len(review):] = np.array(review)[:seq_len]
Step 6 建立训练集、测试集、验证集
split_frac = 0.8
split_index = int(len(features)*split_frac)
train_x, val_x = features[:split_index], features[split_index:]
train_y, val_y = labels[:split_index], labels[split_index:]
test_index = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_index], val_x[test_index:]
val_y, test_y = val_y[:test_index], val_y[test_index:]
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape),
"\nValidation set: \t{}".format(val_x.shape),
"\nTest set: \t\t{}".format(test_x.shape))
Feature Shapes:
Train set: (20000, 200)
Validation set: (2500, 200)
Test set: (2500, 200)
Step 7 建立网络
设置基本参数
lstm_size = 256
lstm_layers = 1
batch_size = 512
learning_rate = 0.001
定义输入输出
n_words = len(vocab_to_int)
graph = tf.Graph()
with graph.as_default():
inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
添加Embeding层
embed_size = 300
with graph.as_default():
embedding = tf.Variable(tf.truncated_normal((n_words, embed_size), stddev=0.01))
embed = tf.nn.embedding_lookup(embedding, inputs_)
建立LSTM层
with graph.as_default():
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
drop = tf.contrib.rnn.DropoutWrapper(lstm, keep_prob)
cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
initial_state = cell.zero_state(batch_size, tf.float32)
RNN 向前传播
with graph.as_default():
outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
outputs
定义输出
with graph.as_default():
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(labels_, predictions)
optimizer = tf.train.AdamOptimizer().minimize(cost)
验证准确率
with graph.as_default():
correct_pred = tf.equal( tf.cast(tf.round(predictions), tf.int32), labels_ )
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
获取Batch
def get_batches(x, y, batch_size=100):
n_batches = len(x) // batch_size
x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
for ii in range(0, len(x), batch_size):
yield x[ii:ii+batch_size], y[ii:ii+batch_size]
训练
epochs = 10
with graph.as_default():
saver = tf.train.Saver()
with tf.Session(graph=graph) as sess:
tf.global_variables_initializer().run()
iteration = 1
for e in range(epochs):
state = sess.run(initial_state)
for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
feed = {inputs_ : x,
labels_ : y[:,None],
keep_prob : 0.5,
initial_state : state}
loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
if iteration % 5 == 0:
print('Epoch: {}/{}'.format(e, epochs),
'Iteration: {}'.format(iteration),
'Train loss: {}'.format(loss))
if iteration % 25 == 0:
val_acc = []
val_state = sess.run(cell.zero_state(batch_size, tf.float32))
for x, y in get_batches(val_x, val_y, batch_size):
feed = {inputs_ : x,
labels_ : y[:,None],
keep_prob : 1,
initial_state : val_state}
batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
val_acc.append(batch_acc)
print('Val acc: {:.3f}'.format(np.mean(val_acc)))
iteration += 1
saver.save(sess, "checkpoints/sentiment.ckpt")
Epoch: 0/10 Iteration: 5 Train loss: 0.24799075722694397
Epoch: 0/10 Iteration: 10 Train loss: 0.24164661765098572
Epoch: 0/10 Iteration: 15 Train loss: 0.23779860138893127
Epoch: 0/10 Iteration: 20 Train loss: 0.23155733942985535
Epoch: 0/10 Iteration: 25 Train loss: 0.19295498728752136
Val acc: 0.694
Epoch: 0/10 Iteration: 30 Train loss: 0.16817498207092285
Epoch: 0/10 Iteration: 35 Train loss: 0.14103104174137115
Epoch: 1/10 Iteration: 40 Train loss: 0.4157596230506897
Epoch: 1/10 Iteration: 45 Train loss: 0.25596609711647034
Epoch: 1/10 Iteration: 50 Train loss: 0.14873309433460236
Val acc: 0.759
Epoch: 1/10 Iteration: 55 Train loss: 0.2219633162021637
Epoch: 1/10 Iteration: 60 Train loss: 0.22595466673374176
Epoch: 1/10 Iteration: 65 Train loss: 0.22170156240463257
Epoch: 1/10 Iteration: 70 Train loss: 0.21362364292144775
Epoch: 1/10 Iteration: 75 Train loss: 0.21025851368904114
Val acc: 0.637
Epoch: 2/10 Iteration: 80 Train loss: 0.197884202003479
Epoch: 2/10 Iteration: 85 Train loss: 0.18369686603546143
Epoch: 2/10 Iteration: 90 Train loss: 0.15401005744934082
Epoch: 2/10 Iteration: 95 Train loss: 0.08480044454336166
Epoch: 2/10 Iteration: 100 Train loss: 0.21809038519859314
Val acc: 0.555
Epoch: 2/10 Iteration: 105 Train loss: 0.2156117707490921
Epoch: 2/10 Iteration: 110 Train loss: 0.2078854888677597
Epoch: 2/10 Iteration: 115 Train loss: 0.17866834998130798
Epoch: 3/10 Iteration: 120 Train loss: 0.2278885841369629
Epoch: 3/10 Iteration: 125 Train loss: 0.23644667863845825
Val acc: 0.574
Epoch: 3/10 Iteration: 130 Train loss: 0.15737152099609375
Epoch: 3/10 Iteration: 135 Train loss: 0.2996417284011841
Epoch: 3/10 Iteration: 140 Train loss: 0.3013457655906677
Epoch: 3/10 Iteration: 145 Train loss: 0.29811352491378784
Epoch: 3/10 Iteration: 150 Train loss: 0.29609352350234985
Val acc: 0.539
Epoch: 3/10 Iteration: 155 Train loss: 0.29265934228897095
Epoch: 4/10 Iteration: 160 Train loss: 0.3259274959564209
Epoch: 4/10 Iteration: 165 Train loss: 0.1977640688419342
Epoch: 4/10 Iteration: 170 Train loss: 0.10309533774852753
Epoch: 4/10 Iteration: 175 Train loss: 0.20305077731609344
Val acc: 0.722
Epoch: 4/10 Iteration: 180 Train loss: 0.21348100900650024
Epoch: 4/10 Iteration: 185 Train loss: 0.1976686418056488
Epoch: 4/10 Iteration: 190 Train loss: 0.17928491532802582
Epoch: 4/10 Iteration: 195 Train loss: 0.17746716737747192
Epoch: 5/10 Iteration: 200 Train loss: 0.12238124758005142
Val acc: 0.814
Epoch: 5/10 Iteration: 205 Train loss: 0.07527816295623779
Epoch: 5/10 Iteration: 210 Train loss: 0.05444170534610748
Epoch: 5/10 Iteration: 215 Train loss: 0.028456348925828934
Epoch: 5/10 Iteration: 220 Train loss: 0.02309001237154007
Epoch: 5/10 Iteration: 225 Train loss: 0.02358683943748474
Val acc: 0.544
Epoch: 5/10 Iteration: 230 Train loss: 0.0281759575009346
Epoch: 6/10 Iteration: 235 Train loss: 0.36734506487846375
Epoch: 6/10 Iteration: 240 Train loss: 0.27041739225387573
Epoch: 6/10 Iteration: 245 Train loss: 0.06518629193305969
Epoch: 6/10 Iteration: 250 Train loss: 0.27379676699638367
Val acc: 0.683
Epoch: 6/10 Iteration: 255 Train loss: 0.17366482317447662
Epoch: 6/10 Iteration: 260 Train loss: 0.11729621887207031
Epoch: 6/10 Iteration: 265 Train loss: 0.156696617603302
Epoch: 6/10 Iteration: 270 Train loss: 0.15894444286823273
Epoch: 7/10 Iteration: 275 Train loss: 0.14083260297775269
Val acc: 0.653
Epoch: 7/10 Iteration: 280 Train loss: 0.131819948554039
Epoch: 7/10 Iteration: 285 Train loss: 0.1406235545873642
Epoch: 7/10 Iteration: 290 Train loss: 0.12142431735992432
Epoch: 7/10 Iteration: 295 Train loss: 0.10793609172105789
Epoch: 7/10 Iteration: 300 Train loss: 0.1138591319322586
Val acc: 0.778
Epoch: 7/10 Iteration: 305 Train loss: 0.10069040209054947
Epoch: 7/10 Iteration: 310 Train loss: 0.08547944575548172
Epoch: 8/10 Iteration: 315 Train loss: 0.0743105486035347
Epoch: 8/10 Iteration: 320 Train loss: 0.08303466439247131
Epoch: 8/10 Iteration: 325 Train loss: 0.07770203053951263
Val acc: 0.749
Epoch: 8/10 Iteration: 330 Train loss: 0.05231660231947899
Epoch: 8/10 Iteration: 335 Train loss: 0.05823827162384987
Epoch: 8/10 Iteration: 340 Train loss: 0.06528615206480026
Epoch: 8/10 Iteration: 345 Train loss: 0.06311675161123276
Epoch: 8/10 Iteration: 350 Train loss: 0.07824704796075821
Val acc: 0.809
Epoch: 9/10 Iteration: 355 Train loss: 0.04236128553748131
Epoch: 9/10 Iteration: 360 Train loss: 0.03875266760587692
Epoch: 9/10 Iteration: 365 Train loss: 0.045075297355651855
Epoch: 9/10 Iteration: 370 Train loss: 0.05201151967048645
Epoch: 9/10 Iteration: 375 Train loss: 0.051657453179359436
Val acc: 0.805
Epoch: 9/10 Iteration: 380 Train loss: 0.040323011577129364
Epoch: 9/10 Iteration: 385 Train loss: 0.03481965512037277
Epoch: 9/10 Iteration: 390 Train loss: 0.061715394258499146
测试
test_acc = []
with tf.Session(graph=graph) as sess:
saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
test_state = sess.run(cell.zero_state(batch_size, tf.float32))
for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
feed = {inputs_: x,
labels_: y[:, None],
keep_prob: 1,
initial_state: test_state}
batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
test_acc.append(batch_acc)
print("Test accuracy: {:.3f}".format(np.mean(test_acc)))
INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.785