【Keras-LSTM】IMDb

文章目录

  • 1 LSTM 简介
  • 2 下载数据集
  • 3 数据预处理
  • 4 建立模型
  • 5 训练模型
  • 6 评估模型的准确率
  • 7 预测概率和预测结果
  • 8 查看测试集中的文本和其预测结果
  • 9 测试新的影评
  • 10 保存模型


MLP或者CNN都只能依照当前的状态进行识别,如果处理时间序列的问题,就需要RNN、LSTM模型了。本博客使用 LSTM 对 IMDb 数据集进行分析预测,用MLP进行预测可以参考这篇博客 【Keras-MLP】IMDb,用RNN模型进行分类的可以参考【Keras-RNN】IMDb

LSTM(long short term memory)也是一种时间递归神经网络,专门设计来解决RNNN的长期依赖问题。LSTM的设计是为了克服传统RNN的学习远处连接信息的能力。简单说RNN有短期记忆,但是没有长期记忆。
话不多说,直接开始正文

1 LSTM 简介

  参考/转载

  • 谷歌大脑科学家亲解 LSTM:一个关于“遗忘”与“记忆”的故事
  • 一步步教你理解LSTM
  • 完全图解RNN、RNN变体、Seq2Seq、Attention机制
  • Understanding LSTM Networks

  LSTM全名是Long Short-Term Memory,长短时记忆网络,可以用来处理时序数据,在自然语言处理和语音识别等领域应用广泛。和原始的循环神经网络RNN相比,LSTM解决了RNN的梯度消失问题,可以处理长序列数据,成为当前最流行的RNN变体。

  RNN
【Keras-LSTM】IMDb_第1张图片

  也就是
【Keras-LSTM】IMDb_第2张图片

  LSTM把RNN的 neural (图中的A)换成如下形式
【Keras-LSTM】IMDb_第3张图片

  单输入变成4输入,3个门由信号控制,sigmoid后信号在0-1之间

  来看个整体结构
【Keras-LSTM】IMDb_第4张图片

  元件说明
【Keras-LSTM】IMDb_第5张图片

  公式推导如下

  Forget Gate:
【Keras-LSTM】IMDb_第6张图片

  Input Gate:
【Keras-LSTM】IMDb_第7张图片

  Update Memory:
【Keras-LSTM】IMDb_第8张图片

  Output Gate:
【Keras-LSTM】IMDb_第9张图片


Q: 对于第二个输入门和第三个输出门,为什么要用tanh函数,不用别的函数呢?
A: 可以用其它函数,比如relu,效果可能更好,用 tanh 的主要原因是它的结果在-1到1之间,LSTM记忆的范围更大

2 下载数据集

import urllib.request
import os
import tarfile

#下载数据集
url="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath="data/aclImdb_v1.tar.gz"
if not os.path.isfile(filepath):
    result=urllib.request.urlretrieve(url,filepath)
    print('downloaded:',result)
# 解压
if not os.path.exists("data/aclImdb"):
    tfile = tarfile.open("data/aclImdb_v1.tar.gz", 'r:gz')
    result=tfile.extractall('data/')

3 数据预处理

同 【Keras-MLP】IMDb

from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import re
re_tag = re.compile(r'<[^>]+>')

def rm_tags(text):
    return re_tag.sub('', text)

import os
def read_files(filetype):
    path = "data/aclImdb/"
    file_list=[]

    positive_path=path + filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list+=[positive_path+f]
    
    negative_path=path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list+=[negative_path+f]
        
    print('read',filetype, 'files:',len(file_list))
       
    all_labels = ([1] * 12500 + [0] * 12500) 
    
    all_texts  = []
    
    for fi in file_list:
        with open(fi,encoding='utf8') as file_input:
            all_texts += [rm_tags(" ".join(file_input.readlines()))]
            
    return all_labels,all_texts

开始处理

# 读文件
y_train,train_text=read_files("train")
y_test,test_text=read_files("test")

# 建立单词和数字映射的字典
token = Tokenizer(num_words=3800)
token.fit_on_texts(train_text)

#将影评的单词映射到数字
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq  = token.texts_to_sequences(test_text)

# 让所有影评保持在380个数字
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=380)

output

read train files: 25000
read test files: 25000

4 建立模型

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation,Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

model = Sequential()

model.add(Embedding(output_dim=32,
                    input_dim=3800, 
                    input_length=380))
model.add(Dropout(0.2))

# 加LSTM
model.add(LSTM(32))

model.add(Dense(units=256,
                activation='relu' ))
model.add(Dropout(0.2))

model.add(Dense(units=1,
                activation='sigmoid' ))

model.summary()

output

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_1 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               8448      
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
=================================================================
Total params: 138,625
Trainable params: 138,625
Non-trainable params: 0
_________________________________________________________________

参数计算
3800*32 = 121600

LSTM 参数量计算参考

  • https://www.cnblogs.com/wdmx/p/9284037.html
  • https://cloud.tencent.com/developer/news/388498

32*4(32+32+1) = 8320

32*256+256 = 8448

256*1 + 1 = 257

5 训练模型

model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

train_history =model.fit(x_train, y_train,batch_size=100, 
                         epochs=10,verbose=2,
                         validation_split=0.2)

参数说明请参考
【Keras-MLP】MNIST 或者 Keras中文文档

output

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 74s - loss: 0.4944 - acc: 0.7517 - val_loss: 0.4402 - val_acc: 0.7676
Epoch 2/10
 - 70s - loss: 0.2842 - acc: 0.8844 - val_loss: 0.3079 - val_acc: 0.8626
Epoch 3/10
 - 69s - loss: 0.2373 - acc: 0.9067 - val_loss: 0.5062 - val_acc: 0.7886
Epoch 4/10
 - 69s - loss: 0.2094 - acc: 0.9186 - val_loss: 0.3603 - val_acc: 0.8424
Epoch 5/10
 - 71s - loss: 0.1953 - acc: 0.9253 - val_loss: 0.4827 - val_acc: 0.7898
Epoch 6/10
 - 70s - loss: 0.1873 - acc: 0.9267 - val_loss: 0.3809 - val_acc: 0.8552
Epoch 7/10
 - 71s - loss: 0.1794 - acc: 0.9326 - val_loss: 0.4820 - val_acc: 0.8036
Epoch 8/10
 - 72s - loss: 0.1582 - acc: 0.9418 - val_loss: 0.6176 - val_acc: 0.7926
Epoch 9/10
 - 70s - loss: 0.1442 - acc: 0.9479 - val_loss: 0.4762 - val_acc: 0.8340
Epoch 10/10
 - 71s - loss: 0.1375 - acc: 0.9489 - val_loss: 0.5651 - val_acc: 0.8022

可视化结果

%pylab inline
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

调用查看精度变化

show_train_history(train_history,'acc','val_acc')

【Keras-LSTM】IMDb_第10张图片

调用查看损失变化

show_train_history(train_history,'loss','val_loss')

【Keras-LSTM】IMDb_第11张图片

6 评估模型的准确率

scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]

output

25000/25000 [==============================] - 42s 2ms/step
0.85292

Note:scores[0] 为损失

7 预测概率和预测结果

查看输出的概率

probility=model.predict(x_test)
probility[:10]

output

array([[0.99887246],
       [0.9926156 ],
       [0.9984492 ],
       [0.8551011 ],
       [0.99635917],
       [0.9980083 ],
       [0.9960477 ],
       [0.9684047 ],
       [0.87130046],
       [0.26892343]], dtype=float32)

查看输出的结果,大于0.5的为1.小于0.5的为0

predict=model.predict_classes(x_test)
predict[:10]

output

array([[1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0]], dtype=int32)

8 查看测试集中的文本和其预测结果

SentimentDict={1:'正面的',0:'负面的'}
def display_test_Sentiment(i):
    print(test_text[i])
    print('label:',SentimentDict[y_test[i]],
          '预测结果:',SentimentDict[predict[i][0]])

调用

display_test_Sentiment(2)

output

BLACK WATER is a thriller that manages to completely transcend it’s limitations (it’s an indie flick) by continually subverting expectations to emerge as an intense experience.In the tradition of all good animal centered thrillers ie Jaws, The Edge, the original Cat People, the directors know that restraint and what isn’t shown are the best ways to pack a punch. The performances are real and gripping, the crocdodile is extremely well done, indeed if the Black Water website is to be believed that’s because they used real crocs and the swamp location is fabulous.If you are after a B-grade gore fest croc romp forget Black Water but if you want a clever, suspenseful ride that will have you fearing the water and wondering what the hell would I do if i was up that tree then it’s a must see.
label: 正面的 预测结果: 正面的

9 测试新的影评

http://www.imdb.com/title/tt2771200

def predict_review(input_text):
	# 影评转换为数字列表
    input_seq = token.texts_to_sequences([input_text])
    # 截断数字列表使得所有输入长度为380
    pad_input_seq  = sequence.pad_sequences(input_seq , maxlen=380)
    # 预测分类结果
    predict_result=model.predict_classes(pad_input_seq)
    # 输出结果
    print(SentimentDict[predict_result[0][0]])

调用

predict_review(’’’
It’s hard to believe that the same talented director who made the influential cult action classic The Road Warrior had anything to do with this disaster.
Road Warrior was raw, gritty, violent and uncompromising, and this movie is the exact opposite. It’s like Road Warrior for kids who need constant action in their movies.
This is the movie. The good guys get into a fight with the bad guys, outrun them, they break down in their vehicle and fix it. Rinse and repeat. The second half of the movie is the first half again just done faster.
The Road Warrior may have been a simple premise but it made you feel something, even with it’s opening narration before any action was even shown. And the supporting characters were given just enough time for each of them to be likable or relatable.
In this movie there is absolutely nothing and no one to care about. We’re supposed to care about the characters because… well we should. George Miller just wants us to, and in one of the most cringe worthy moments Charlize Theron’s character breaks down while dramatic music plays to try desperately to make us care.
Tom Hardy is pathetic as Max. One of the dullest leading men I’ve seen in a long time. There’s not one single moment throughout the entire movie where he comes anywhere near reaching the same level of charisma Mel Gibson did in the role. Gibson made more of an impression just eating a tin of dog food. I’m still confused as to what accent Hardy was even trying to do.
I was amazed that Max has now become a cartoon character as well. Gibson’s Max was a semi-realistic tough guy who hurt, bled, and nearly died several times. Now he survives car crashes and tornadoes with ease?
In the previous movies, fuel and guns and bullets were rare. Not anymore. It doesn’t even seem Post-Apocalyptic. There’s no sense of desperation anymore and everything is too glossy looking. And the main villain’s super model looking wives with their perfect skin are about as convincing as apocalyptic survivors as Hardy’s Australian accent is. They’re so boring and one-dimensional, George Miller could have combined them all into one character and you wouldn’t miss anyone.
Some of the green screen is very obvious and fake looking, and the CGI sandstorm is laughably bad. It wouldn’t look out of place in a Pixar movie.
There’s no tension, no real struggle, or any real dirt and grit that Road Warrior had. Everything George Miller got right with that masterpiece he gets completely wrong here.
‘’’)

output

负面的

10 保存模型

model_json = model.to_json()
with open("SaveModel/Imdb_RNN_model.json", "w") as json_file:
    json_file.write(model_json)

model.save_weights("SaveModel/Imdb_RNN_model.h5")
print("Saved model to disk")

声明

声明:代码源于《TensorFlow+Keras深度学习人工智能实践应用》 林大贵版,引用、转载请注明出处,谢谢,如果对书本感兴趣,买一本看看吧!!!

你可能感兴趣的:(【Keras-LSTM】IMDb)