[比赛分享] Kaggle-Toxic Comment [Keras多二分类,优质Comment语料, Pre-trained词向量的使用]

摘要

最近在看一个Kaggle的比赛【Toxic Comment】

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

比赛目标是判断文字评论是否为毒评论

同时毒评论具体细化成了六个类别
【’toxic’, ‘severe_toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity_hate’】

本博客主要分享学习到的新姿势


Keras 之居然可以同时做多个2分类

使用Bi-LSTM实现的Baseline[0.051],居然是同时做6个2分类,以前居然不知道还可以这么操作!

代码如下:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout
from keras.preprocessing import text, sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint

max_features = 20000
maxlen = 100

train = pd.read_csv('../data/train/train.csv')
test = pd.read_csv('../data/test/test.csv')
subm = pd.read_csv('../data/sample_submission.csv/sample_submission.csv')
train = train.sample(frac=1)

list_sentences_train = train["comment_text"].fillna("CVxTz").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("CVxTz").values


tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

X_t = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)

def get_model():
    embed_size = 128
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size)(inp)
    x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1))(x)
    x = GlobalMaxPool1D()(x)
    x = Dropout(0.1)(x)
    x = Dense(50, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(6, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model


model = get_model()
batch_size = 32
epochs = 3


file_path="weights_base.best.hdf5"
# checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

# early = EarlyStopping(monitor="val_loss", mode="min", patience=20)
early = EarlyStopping(monitor="val_acc", mode="max", patience=20)


callbacks_list = [checkpoint, early] #early
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=callbacks_list)
# model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

model.load_weights(file_path)
y_test = model.predict(X_te)


sample_submission = pd.read_csv("../input/sample_submission.csv")
sample_submission[list_classes] = y_test

sample_submission.to_csv("baseline.csv", index=False)

优质的各种Comment语料

  1. Comment

    • YouTube Comments(excellent for supplementing the threat and identity_hate columns)
    • Reddit Comments(roughly a terabyte of data, divided by year)
  2. Toxic word dictionary

    • http://www.bannedwordlist.com/
    • https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
    • https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
    • https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/
    • https://kaggle2.blob.core.windows.net/forum-message-attachments/4810/badwords.txt
    • https://gist.github.com/ryanlewis/a37739d710ccdb4b406d
  3. Pre-trained word embeddings

    • Google’s word2vec embedding: [Word2Vec] [DownloadLink]
    • Glove word vectors: [Glove]
    • Facebook’s fastText embeddings: [FastText]
    • [DeepMoji]: To understand how language is used to express emotions
  4. WikiPedia

    • Wikipedia database reports: https://en.wikipedia.org/wiki/Wikipedia:Database_reports
    • Wikimedia logs: https://meta.wikimedia.org/w/index.php?title=Special%3ALog
  5. Other

    • https://github.com/conversationai/perspectiveapi
    • https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction
    • Google NLP Model: https://cloud.google.com/natural-language/

使用Pre-trained词向量

https://github.com/MoyanZitto/keras-cn/blob/master/docs/legacy/blog/word_embedding.md

使用方法如下:

GLOVE

GLOVE_DIR = 'D:\glove.6B'
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'), encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Google Word2Vec

from gensim.models.keyedvectors import KeyedVectors
w2v_bin = 'D:\GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(w2v_bin, binary=True)

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = model[word] if word in model else None
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Fast Text

def get_embeddings_FastText():
    from gensim.models.keyedvectors import KeyedVectors
    w2v_bin = '../pre-trained/FastText_wiki.en/wiki.en.vec'
    model = KeyedVectors.load_word2vec_format(w2v_bin, binary=False)

    embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = model[word] if word in model else None
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    return embedding_matrix

最后用在Keras中

Embedding(len(word_index) + 1,
                  EMBEDDING_DIM,
                  weights=[embedding_matrix],
                  input_length=MAX_SEQUENCE_LENGTH,
                  trainable=False)

Categorical_crossentropy VS Binary_crossentropy

引用第一名的解释如下:

In this case, it should be binary_crossentropy and not categorical_crossentropy. categorical_crossentropy assumes that all the probabilities of classes sum to 1 (a multi-class scenario where every sample has exactly 1 class). In this competition, we have a multi-label scenario, because a sample can have any number of classes (or none at all), so binary_crossentropy independently optimises each class.

你可能感兴趣的:(python,机器学习,自然语言处理)