- 问题描述
预测用户的电影评论是积极还是消极的。 - 流程
1.文本数据-token化:将单词转化为int型的index
%%time
from tensorflow.python.keras.preprocessing.text import Tokenizer
num_words = 10000
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(data_text)
x_train_tokens = tokenizer.texts_to_sequences(x_train_text)
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)
%%time
可以获得单元格运行花费的时间
2.padding和truncating
将不同长度的文本转化为长度一致的输入,并获得idx2word的字典
pad = "pre"
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
padding=pad, truncating=pad)
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
padding=pad, truncating=pad)
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))
3.建立RNN模型
第一层是Embedding,接三层GRU,最后是FC层
model = Sequential()
embedding_size = 8
model.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
input_length=max_tokens,
name="layer_embedding"))
model.add(GRU(units=16, return_sequences=True))
model.add(GRU(units=8, return_sequences=True))
model.add(GRU(units=4))
model.add(Dense(1, activation="sigmoid"))
optimizer = Adam(lr=1e-3)
model.compile(loss="binary_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])
model.fit(x_train_pad, y_train,
validation_split=0.05,
epochs=1,
batch_size=64)
# GRU的参数:
# units: 正整数,输出空间的维度。
# return_sequences: 布尔值。是返回输出序列中的最后一个输出,还是全部序列。
4.新数据的预测
text1 = "This movie is fantastic! I really like it because it is so good!"
text2 = "Good movie!"
text3 = "Maybe I like this movie."
text4 = "Meh ..."
text5 = "If I were a drunk teenager then this movie might be good."
text6 = "Bad movie!"
text7 = "Not a good movie!"
text8 = "This movie really sucks! Can I get my money back please?"
texts = [text1, text2, text3, text4, text5, text6, text7, text8]
tokens = tokenizer.texts_to_sequences(texts)
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
padding=pad,
truncating=pad)
model.predict(tokens_pad)
5.查看embedding层的权重
layer_embedding = model.get_layer("layer_embedding")
weights_embedding = layer_embedding.get_weights()[0]
token_good = tokenizer.word_index["good"]
token_great = tokenizer.word_index["great"]
print(weights_embedding[token_good])
print(weights_embedding[token_great])