tflearn 基于循环神经网络(LSTM)计算文本情感值

修改自官方教程

对于中文文本,可以先用hanziconv把繁体转简体,英文转小写,再用结巴分词把句子转成词序列,根据词汇表转成词ID序列

这个模型我用来分类淘宝的好评、差评,正确率达到了88.7%

# -*- coding: utf-8 -*-

"""
tflearn教程,用LSTM循环神经网络分类文本
https://github.com/tflearn/tflearn/blob/master/examples/nlp/lstm.py
"""

import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb

# 词汇表词数
VOCAB_LEN = 10000
# 输入最长词数
SEQUENCE_LEN = 100
# 词向量特征数
WORD_FEATURE_DIM = 128
# 文本特征数
DOC_FEATURE_DIM = 128


# 载入IMDB数据集,下载地址http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl
# 限制词汇表长度VOCAB_LEN个词
train, test, _ = imdb.load_data(path='imdb.pkl', n_words=VOCAB_LEN,
                                valid_portion=0.1)
# X为词ID序列,Y为标签
train_x, train_y = train
test_x, test_y = test

# 数据预处理
# 把输入填充0或截断到长度=SEQUENCE_LEN
train_x = pad_sequences(train_x, maxlen=SEQUENCE_LEN, value=0.)
test_x = pad_sequences(test_x, maxlen=SEQUENCE_LEN, value=0.)
# 把标签转为2维向量
train_y = to_categorical(train_y, 2)
test_y = to_categorical(test_y, 2)

# 构造神经网络
# 输入层,SEQUENCE_LEN个神经元
net = tflearn.input_data([None, SEQUENCE_LEN])
# 嵌入层,输入维度为VOCAB_LEN,输出维度为WORD_FEATURE_DIM
net = tflearn.embedding(net, input_dim=VOCAB_LEN, output_dim=WORD_FEATURE_DIM)
# LSTM层(循环神经网络),DOC_FEATURE_DIM个神经元
# 后面跟保持概率为0.8的dropout层,防止过拟合
net = tflearn.lstm(net, DOC_FEATURE_DIM, dropout=0.8)
# 全连接层,softmax分类器,分成2个类,输出评论为好评、差评的概率
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy')
model = tflearn.DNN(net, tensorboard_verbose=0)

# 训练10代
model.fit(train_x, train_y, validation_set=(test_x, test_y), show_metric=True,
          n_epoch=10, batch_size=32)
# | Adam | epoch: 010 | loss: 0.04662 - acc: 0.9880 | val_loss: 0.91889 - val_acc: 0.7992 -- iter: 22500/22500

# 预测
x = [
    [17, 25, 10, 406, 26, 14, 56, 61, 62, 323, 4],
    [6691, 1, 10, 333, 10, 17, 27, 4, 34, 181, 6, 1418, 256, 4],
]
x = pad_sequences(x, maxlen=SEQUENCE_LEN, value=0.)
# 计算评论分类概率
print(model.predict(x))
# [[0.9944102  0.00558983]
#  [0.00333996 0.99666   ]]
# 计算评论分类标签
print(model.predict_label(x))
# 返回分类索引数组,按概率降序排列
# [[0 1]
#  [1 0]]

tflearn 基于循环神经网络(LSTM)计算文本情感值_第1张图片

你可能感兴趣的:(自然语言处理,机器学习,tflearn,循环神经网络,自然语言处理,文本分类,情感计算)