这段时间在做kaggle的一个NLP比赛,虽然刚做不久但看了一些kernel和discussion后收获颇多,打算写几篇博客记录一下方便自己后面复习。
文本作为一种长度不相同的数据,要作为模型的输入需要进行一定的处理。简而言之就是想办法让它们的长度一致。keras.preprocessing.text中有Tokenizer模块,可以帮助你把英文句子转换成数值序列,再用pad_sequence可以把长度填充成一致的向量。为什么要这么做呢?这是为了方便后面进行word embedding,把序列中的编号转换成对应的词向量。
以kaggle举办的quora文本分类比赛的数据集作为样本进行说明:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import pandas as pd
def load_and_prec(max_features=95000, max_len=70):
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)
## split to train and val
train_df, val_df = train_test_split(train_df, test_size=0.08, random_state=2018)
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)
## Pad the sentences
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)
## Get the target values
train_y = train_df['target'].values
val_y = val_df['target'].values
return train_X, val_X, test_X, train_y, val_y, tokenizer.word_index
这是函数返回的train_X和word_index,分别表示句子的字典序列编号和编号字典。其中train_X用0来填充没达到max_len的句子;word_index则表示每个单词所对应的编号。word_index在加载embedding matrix的时候需要用到,按照编号顺序排列好词向量。这样在给模型传train_X的时候,Embedding层就能根据embedding matrix进行映射。
以glove包为例,下面给出加载glove词典的函数,把word_index作为参数传入,词向量就能按其顺序排成矩阵并返回。
import numpy as np
def load_glove(word_index, max_features=95000):
EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]
# word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector
return embedding_matrix
文本序列是一维的数据,但每个词汇经过word embedding后都是一个词向量,而句子长度是max_len。所以经过Embedding层之后每个句子都是一个max_len*embedding_size的矩阵。所以可以使用CNN网络进行训练。
def model_cnn(embedding_matrix):
filter_sizes = [1,2,3,5]
num_filters = 36
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Reshape((maxlen, embed_size, 1))(x)
maxpool_pool = []
for i in range(len(filter_sizes)):
conv = Conv2D(num_filters, kernel_size=(filter_sizes[i], embed_size),
kernel_initializer='he_normal', activation='elu')(x)
maxpool_pool.append(MaxPool2D(pool_size=(maxlen - filter_sizes[i] + 1, 1))(conv))
z = Concatenate(axis=1)(maxpool_pool)
z = Flatten()(z)
z = Dropout(0.1)(z)
outp = Dense(1, activation="sigmoid")(z)
model = Model(inputs=inp, outputs=outp)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
epochs = 2
for e in range(epochs):
model.fit(train_X, train_y, batch_size=512, epochs=1, validation_data=(val_X, val_y))
pred_val_y = model.predict([val_X], batch_size=1024, verbose=0)
pred_test_y = model.predict([test_X], batch_size=1024, verbose=0)