TensorFlow-RNN循环神经网络 Example 2:文本情感分析

TensorFlow-RNN文本情感分析

  • 之前用全连接神经网络写过一个文本情感分析 http://blog.csdn.net/weiwei9363/article/details/78357670
  • 现在,利用TensorFlow搭建一个RNN网络对文本进行情感分析
  • 完整代码以及详细的介绍(Solution) https://github.com/jiemojiemo/deep-learning/tree/master/sentiment-rnn
  • 训练数据 https://github.com/jiemojiemo/deep-learning/tree/master/sentiment-network

Step 1 数据处理

import numpy as np
# 读取数据
with open('reviews.txt', 'r') as f:
    reviews = f.read()
with open('labels.txt', 'r') as f:
    labels = f.read()
# 每一个 \n 表示一条review
reviews[:2000]
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  \nhomelessness  or houselessness as george carlin stated  has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school  work  or vote for the matter . most people think of the homeless as just a lost cause while worrying about things such as racism  the war on iraq  pressuring kids to succeed  technology  the elections  inflation  or worrying if they  ll be next to end up on the streets .  br    br   but what if y'
from string import punctuation

# 去除标点符号
all_text = ''.join([c for c in reviews if c not in punctuation])
# 每一个\n表示一条review
reviews = all_text.split('\n')

all_text = ' '.join(reviews)
# 获得所有单词
words = all_text.split()
all_text[:2000]
'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t    story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers  unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting  even those from the era should be turned off  the cryptic dialogue would make shakespeare seem easy to a third grader  on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond  future stars sally kirkland and frederic forrest can be seen briefly    homelessness  or houselessness as george carlin stated  has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school  work  or vote for the matter  most people think of the homeless as just a lost cause while worrying about things such as racism  the war on iraq  pressuring kids to succeed  technology  the elections  inflation  or worrying if they  ll be next to end up on the streets   br    br   but what if you were given a bet to live on the st'
words[:100]
['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 .....
 .....
 'at',
 'high']

Step 2 将文本转换为数字

  • 神经网络无法处理字符串,因此要将字符串转换为数字
  • 具体方法就是给每个单词贴上一个数字下标
  • 同时,为了接下来的训练,我们将训练数据中的每一条review从字符串转为数字
from collections import Counter
def get_vocab_to_int(words):

    # 统计每个单词出现的次数
    counts = Counter(words)

    # 按出现次数,从多到少排序
    vocab = sorted(counts, key=counts.get, reverse=True)

    # 建立单词到数字的映射,也就是给单词贴上一个数字下标,在网络中用数字标签表示单词
    # 例如,'apple'在网络中就是一个数字,比如是500.
    # 数字标签从 1 开始, 0 作特殊作用(下面会说)
    vocab_to_int = { word : i for i, word in enumerate(vocab, 1)}

    return vocab_to_int
def get_reviews_ints(vocab_to_int, reviews):
    # 将review转换为数字,也就是将review中每个单词,通过vocab_to_int转换为数字
    # 例如,"I love this moive" 可能被转换为 [5 36 45 12354]
    reviews_ints = []
    for each in reviews:
        reviews_ints.append( [ vocab_to_int[word] for word in each.split()] )

    return reviews_ints
vocab_to_int = get_vocab_to_int(words)

reviews_ints = get_reviews_ints(vocab_to_int, reviews)
# 举个例子 看看"i love this moive" 被转换为什么样
get_reviews_ints(vocab_to_int, ['i love this moive'])
[[10, 115, 11, 59320]]
# 共有74072个不重复的单词
len(vocab_to_int)
74072

Step 3 输出标签编码

  • 标签中包含’negative’和’positive’两类,我们将’negative’转换为0,’positive’为1
labels = np.array([0 if label=='negative' else 1 for label in labels.split('\n')])

Step 4 清理垃圾数据

  • 出于不知名的原因,在reviews_ints中居然有长度为0的数据存在,这是无意义的数据,进行清除
  • 同时,最长的review有2514个单词,这对于我们网络而言实在是太长了,要砍掉一部分
review_lens = Counter([len(x) for x in reviews_ints])
print('Zero-length reviews:{}'.format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))
Zero-length reviews:1
Maximum review length: 2514
# 获取长度不为0的review的下标
non_zeros_idx = [ ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
len(non_zeros_idx)
25000
# 将长度为0的review从reviews_ints中清除
reviews_ints = [ reviews_ints[ii] for ii in non_zeros_idx]
labels = np.array( [ labels[ii] for ii in non_zeros_idx] )

Step 5 多退少补

  • 上面提到,有的review太长了,要裁剪,有的又太短了,要填充
  • 我们固定每次输入字符序列长度为200, 对超过200的review进行裁剪,少于200的review在左边填0
  • 例如,’i love this movie’是[10, 115, 11, 59320],那么需要在左边填196个0,变成这样:[0,0,…,0, 10, 115, 11, 59320]
# 字符序列长度
seq_len = 200
# 大小为 reviews的数量 * seq_len
features = np.zeros((len(reviews_ints), seq_len), dtype=int)
for i,review in enumerate(reviews_ints):
    features[i, -len(review):] = np.array(review)[:seq_len]

Step 6 建立训练集、测试集、验证集

# 用于训练的比例
split_frac = 0.8
# 将训练集划分出来
split_index = int(len(features)*split_frac)
train_x, val_x = features[:split_index], features[split_index:]
train_y, val_y = labels[:split_index], labels[split_index:]

# 除去训练集,剩下的部分被分为测试集和验证集,一半一半
test_index = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_index], val_x[test_index:]
val_y, test_y = val_y[:test_index], val_y[test_index:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))
            Feature Shapes:
Train set:      (20000, 200) 
Validation set:     (2500, 200) 
Test set:       (2500, 200)

Step 7 建立网络

设置基本参数

# LSTM 个数
lstm_size = 256
# LSTM 层数
lstm_layers = 1
batch_size = 512
learning_rate = 0.001

定义输入输出

n_words = len(vocab_to_int)

# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
    # 输入变量,就是一条reviews,
    # 大小为[None, None],第一个None表示batch_size,可以改为batch_size
    # 第二个None,表示输入review的长度,可以改成seq_len
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    # 输入标签
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    # dropout的概率,例如 0.8 表示80%不进行dropout
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

添加Embeding层

embed_size = 300

with graph.as_default():
    embedding = tf.Variable(tf.truncated_normal((n_words, embed_size), stddev=0.01))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

建立LSTM层

with graph.as_default():

    # 建立lstm层。这一层中,有 lstm_size 个 LSTM 单元
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)

    # 添加dropout
    drop = tf.contrib.rnn.DropoutWrapper(lstm, keep_prob)

    # 如果一层lsmt不够,多来几层
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)

    # 对于每一条输入数据,都要有一个初始状态
    # 每次输入batch_size 个数据,因此有batch_size个初始状态
    initial_state = cell.zero_state(batch_size, tf.float32)

RNN 向前传播

with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# outputs 大小为 (512, ?, 256)
# 512 为batch_size
# ? 为 seq_len
# 256 为lstm单元个数
outputs

定义输出

with graph.as_default():
    # 我们只关心lstm最后的输出结果,因此outputs[:, -1]获取每条review最后一个单词的lstm层的输出
    # outputs[:, -1] 大小为 batch_size * lstm_size
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)

    optimizer = tf.train.AdamOptimizer().minimize(cost)

验证准确率

with graph.as_default():
    correct_pred = tf.equal( tf.cast(tf.round(predictions), tf.int32), labels_ )
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

获取Batch

def get_batches(x, y, batch_size=100):
    n_batches = len(x) // batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]

    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

训练

epochs = 10

# 持久化,保存训练的模型
with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    tf.global_variables_initializer().run()
    iteration = 1

    for e in range(epochs):
        state = sess.run(initial_state)

        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_ : x,
                    labels_ : y[:,None],
                    keep_prob : 0.5,
                    initial_state : state}

            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)

            if iteration % 5 == 0:
                print('Epoch: {}/{}'.format(e, epochs),
                      'Iteration: {}'.format(iteration),
                      'Train loss: {}'.format(loss))

            if iteration % 25 == 0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))

                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_ : x,
                            labels_ : y[:,None], 
                            keep_prob : 1,
                            initial_state : val_state}

                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)

                print('Val acc: {:.3f}'.format(np.mean(val_acc)))

            iteration += 1

    saver.save(sess, "checkpoints/sentiment.ckpt")
Epoch: 0/10 Iteration: 5 Train loss: 0.24799075722694397
Epoch: 0/10 Iteration: 10 Train loss: 0.24164661765098572
Epoch: 0/10 Iteration: 15 Train loss: 0.23779860138893127
Epoch: 0/10 Iteration: 20 Train loss: 0.23155733942985535
Epoch: 0/10 Iteration: 25 Train loss: 0.19295498728752136
Val acc: 0.694
Epoch: 0/10 Iteration: 30 Train loss: 0.16817498207092285
Epoch: 0/10 Iteration: 35 Train loss: 0.14103104174137115
Epoch: 1/10 Iteration: 40 Train loss: 0.4157596230506897
Epoch: 1/10 Iteration: 45 Train loss: 0.25596609711647034
Epoch: 1/10 Iteration: 50 Train loss: 0.14873309433460236
Val acc: 0.759
Epoch: 1/10 Iteration: 55 Train loss: 0.2219633162021637
Epoch: 1/10 Iteration: 60 Train loss: 0.22595466673374176
Epoch: 1/10 Iteration: 65 Train loss: 0.22170156240463257
Epoch: 1/10 Iteration: 70 Train loss: 0.21362364292144775
Epoch: 1/10 Iteration: 75 Train loss: 0.21025851368904114
Val acc: 0.637
Epoch: 2/10 Iteration: 80 Train loss: 0.197884202003479
Epoch: 2/10 Iteration: 85 Train loss: 0.18369686603546143
Epoch: 2/10 Iteration: 90 Train loss: 0.15401005744934082
Epoch: 2/10 Iteration: 95 Train loss: 0.08480044454336166
Epoch: 2/10 Iteration: 100 Train loss: 0.21809038519859314
Val acc: 0.555
Epoch: 2/10 Iteration: 105 Train loss: 0.2156117707490921
Epoch: 2/10 Iteration: 110 Train loss: 0.2078854888677597
Epoch: 2/10 Iteration: 115 Train loss: 0.17866834998130798
Epoch: 3/10 Iteration: 120 Train loss: 0.2278885841369629
Epoch: 3/10 Iteration: 125 Train loss: 0.23644667863845825
Val acc: 0.574
Epoch: 3/10 Iteration: 130 Train loss: 0.15737152099609375
Epoch: 3/10 Iteration: 135 Train loss: 0.2996417284011841
Epoch: 3/10 Iteration: 140 Train loss: 0.3013457655906677
Epoch: 3/10 Iteration: 145 Train loss: 0.29811352491378784
Epoch: 3/10 Iteration: 150 Train loss: 0.29609352350234985
Val acc: 0.539
Epoch: 3/10 Iteration: 155 Train loss: 0.29265934228897095
Epoch: 4/10 Iteration: 160 Train loss: 0.3259274959564209
Epoch: 4/10 Iteration: 165 Train loss: 0.1977640688419342
Epoch: 4/10 Iteration: 170 Train loss: 0.10309533774852753
Epoch: 4/10 Iteration: 175 Train loss: 0.20305077731609344
Val acc: 0.722
Epoch: 4/10 Iteration: 180 Train loss: 0.21348100900650024
Epoch: 4/10 Iteration: 185 Train loss: 0.1976686418056488
Epoch: 4/10 Iteration: 190 Train loss: 0.17928491532802582
Epoch: 4/10 Iteration: 195 Train loss: 0.17746716737747192
Epoch: 5/10 Iteration: 200 Train loss: 0.12238124758005142
Val acc: 0.814
Epoch: 5/10 Iteration: 205 Train loss: 0.07527816295623779
Epoch: 5/10 Iteration: 210 Train loss: 0.05444170534610748
Epoch: 5/10 Iteration: 215 Train loss: 0.028456348925828934
Epoch: 5/10 Iteration: 220 Train loss: 0.02309001237154007
Epoch: 5/10 Iteration: 225 Train loss: 0.02358683943748474
Val acc: 0.544
Epoch: 5/10 Iteration: 230 Train loss: 0.0281759575009346
Epoch: 6/10 Iteration: 235 Train loss: 0.36734506487846375
Epoch: 6/10 Iteration: 240 Train loss: 0.27041739225387573
Epoch: 6/10 Iteration: 245 Train loss: 0.06518629193305969
Epoch: 6/10 Iteration: 250 Train loss: 0.27379676699638367
Val acc: 0.683
Epoch: 6/10 Iteration: 255 Train loss: 0.17366482317447662
Epoch: 6/10 Iteration: 260 Train loss: 0.11729621887207031
Epoch: 6/10 Iteration: 265 Train loss: 0.156696617603302
Epoch: 6/10 Iteration: 270 Train loss: 0.15894444286823273
Epoch: 7/10 Iteration: 275 Train loss: 0.14083260297775269
Val acc: 0.653
Epoch: 7/10 Iteration: 280 Train loss: 0.131819948554039
Epoch: 7/10 Iteration: 285 Train loss: 0.1406235545873642
Epoch: 7/10 Iteration: 290 Train loss: 0.12142431735992432
Epoch: 7/10 Iteration: 295 Train loss: 0.10793609172105789
Epoch: 7/10 Iteration: 300 Train loss: 0.1138591319322586
Val acc: 0.778
Epoch: 7/10 Iteration: 305 Train loss: 0.10069040209054947
Epoch: 7/10 Iteration: 310 Train loss: 0.08547944575548172
Epoch: 8/10 Iteration: 315 Train loss: 0.0743105486035347
Epoch: 8/10 Iteration: 320 Train loss: 0.08303466439247131
Epoch: 8/10 Iteration: 325 Train loss: 0.07770203053951263
Val acc: 0.749
Epoch: 8/10 Iteration: 330 Train loss: 0.05231660231947899
Epoch: 8/10 Iteration: 335 Train loss: 0.05823827162384987
Epoch: 8/10 Iteration: 340 Train loss: 0.06528615206480026
Epoch: 8/10 Iteration: 345 Train loss: 0.06311675161123276
Epoch: 8/10 Iteration: 350 Train loss: 0.07824704796075821
Val acc: 0.809
Epoch: 9/10 Iteration: 355 Train loss: 0.04236128553748131
Epoch: 9/10 Iteration: 360 Train loss: 0.03875266760587692
Epoch: 9/10 Iteration: 365 Train loss: 0.045075297355651855
Epoch: 9/10 Iteration: 370 Train loss: 0.05201151967048645
Epoch: 9/10 Iteration: 375 Train loss: 0.051657453179359436
Val acc: 0.805
Epoch: 9/10 Iteration: 380 Train loss: 0.040323011577129364
Epoch: 9/10 Iteration: 385 Train loss: 0.03481965512037277
Epoch: 9/10 Iteration: 390 Train loss: 0.061715394258499146

测试

test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))
INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.785

你可能感兴趣的:(deep-learning,tensorflow)