tensorflow2实现imdb影评分类

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
'''
tf.keras.datasets.imdb.load_data(num_words,maxlen=None,index_from=3)
num_words: max number of words to include. Words are ranked by how often they occur and only the most frequent words are kept
maxlen: sequences longer than this will be filtered out
index_from: index actual words with this index and higher(default is 3)

Returns (x_train,y_train),(x_test,y_test)
x_train,x_test都是25000个句子,每个句子用列表表示,列表中的数据是整数值,也就是这些单词对应的在word2id中的id,
这里需要注意的是这些整数值最大的数值不会超过num_words,所以如果num_words很小,那么这25000个句子中将会出现大量相同的id,
通常是0,1,2代表unk,start,end等,在做embedding时就是将每一个句子的每一个单词转换成embedding_dim维的向量
所以num_words最好是越大越好,否则将会出现大量的unk,pad等词的embedding
Raises ValueError: in case maxlen is so low than no input sequence could be kept.


tf.keras.datasets.imdb.get_word_index() Returns the word index dictionary(word2id)#是一个很大的word2id字典,记录了所有的单词以及相应的id

tf.keras.preprocessing.sequence.pad_sequences(sequences,maxlen=None,dtype='int32',padding="pre",value=0.0)
sequences: list of lists,where each element is a sequence
maxlen: int maxinum length of all sequences(if not provided it will be automatically computed) It would be better provide maxlen because
the maxinum length of all sequences may not be the maxlen you wanted.
padding: "pre" or "post", usually set it "post"
value: padding value, usually set it value of word2id['']

Returns: Numpy array with shape (len(sequences),maxlen)
'''
vocab_size=30000
max_seq_length=300
(x_train,y_train),(x_test,y_test)=tf.keras.datasets.imdb.load_data(num_words=vocab_size,maxlen=max_seq_length,index_from=3)
word2id=tf.keras.datasets.imdb.get_word_index()
assert x_train.shape==x_test.shape==y_train.shape==y_test.shape==(25000,)
#both training set and test set have 25000 sentences
#x_train,x_test are list of lists ,where each element is sequence(It can't exactly describe it shape because each sentence' length is different)
#y_train,y_test are integer of lists, where each element is int value either 0 or 1 represents (negative or positive)
word2id[""]=0
word2id[""]=1
word2id[""]=2
word2id[""]=3
id2word={v:k for k,v in word2id.items()}

def review_sentence(wordid_list):
    return " ".join([id2word[id_] for id_ in wordid_list])

paded_x_train=tf.keras.preprocessing.pad_sequences(sequences=x_train,maxlen=max_seq_length,padding="post",value=id2word[''])
paded_x_test=tf.keras.preprocessing.pad_sequences(sequences=x_test,maxlen=max_seq_length,padding="post",value=id2word[''])

'''
tf.keras.layers.Embedding(input_dim,output_dim,input_length)
input_dim: int>0. Size of vocabulary i.e. Maximum integer index+1(because index beginning at 0)
output_dim: int>=0. Dimension of dense embedding.
input_length: Length of input sequences. Usually is max_seq_length
Note that the largest integer in the input should be no larger than vocab size(vocab size is the amount of words in all sentences).
Because we need to get each word embedding, so all words in the input should be indexable.

tf.keras.layers.GlobalAveragePooling1D(data_format="channel_last")
channel_last corresponds to inputs with shape (batch_size,max_seq_length,features)
Output shape: 2D tensor with shape (batch_size,features)
GAP:
    It allows you have the input image be any size, not just a solid shape.
    It does through taking an average of every feature map, so the output dim must be feature_dim.
    For example, with a 15*15*8 input tensor of feature maps, we take the average of each 15*15 matrix slice(在每个15*15的图片上取出来一个平均值), giving us an 8 dimensional vector. Notice how that the size of matrix slices change, the input tensor may be 32*32*8, it 
    take the average of each 32*32 matrix slice, so we still get an 8 dimensional vector as an output from the global average pooling layer.

tf.keras.losses.binary_crossentropy(y_true,y_pred,from_logits=False)
Use this cross-entropy loss when there are only two label classes(assumed to be 0 and 1) y_pred and y_true should have the same shape [batch_size,1]
from_logits=False defaultly, which means that y_pred contains probabilities(i.e. value in [0,1])
tf.keras.losses.Categorical_crossentropy(y_true,y_label,from_logits=False)
Using this cross-entropy loss when there are two or more label classes. We excepted labels to be provided in one-hot representation.
If you want to provide labels as integers, you should use SparseCategoricalCrossEntropy loss. The shape of both y_pred and y_true are [batch_size,num_classes]

上面介绍了Embedding,GlobalAveragePooling1D, 以及损失函数的定义


'''
embedding_dim=20
batch_size=32
epochs=30
model=tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size,output_dim=embedding_dim,input_length=max_seq_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(units=128,activation='relu'),
    tf.keras.layers.Dense(units=1,activation='sigmoid')
    ])

model.compile(loss='categorical_crossentropy',metrics=['accuracy'],optimizer='adam')

history=model.fit(x=paded_x_train,y=y_train,batch_size=batch_size,epochs=30,verbose=2,validation_split=0.2,shuffle=True)

def plot_result(history,label,epochs,min_value,max_value):
    data={}
    data[label]=history[label]#label is "accuracy" or "loss"
    data['val_'+label]=history['val_'+label]
    pd.DataFrame(data).plot(figsize=(8,5))
    plt.grid(True)
    plt.axis([0,epochs,min_value,max_value])
    plt.show()

plot_result(history.history,"accuracy",epochs,0,1)

plot_result(history.history,"loss",epochs,0,1)
#history.history是一个字典,记录着整个训练过程中的loss,accuracy,val_accuracy,val_loss的变化值。
model.evaluate(paded_x_test,y_test,batch_size=batch_size)

你可能感兴趣的:(tensorflow2实现imdb影评分类)