Kaggle竞赛:Quora Insincere Questions Classification 总结与心得感想

这次Quora的文本分类题,4000支参赛队伍中个人solo最终只在LB上达到了20%,一方面是因为第一次参加NLP方面的比赛,完全是个小白,另一方面是自己在比赛途中也有不少懈怠,因此想做一些技术上以及客观上的总结警醒自己。

比赛是通过文本训练集来预测Quora上的问题是真诚的还是不真诚的问题,比赛链接https://www.kaggle.com/c/quora-insincere-questions-classification

关于技术上的问题:

在比赛中通过学习各路大神的kernel,同时在很多问题一知半解的情况下,查阅各种文献资料,我对NLP的文本分类有了一个大致的了解,分词,语言模型,词向量等有了初步认识。语言模型n-gram,预训练词向量word embedding,以及常用的lstm网络和GRU网络等等。文本预处理的各种方法,以及attention layer等模型方案。

客观上存在的问题:

NLP的比赛和普通的数据挖掘比赛有很大的不同,普通的数据挖掘比赛最重要的需要挖掘到好的特征,其次是使用合适的模型;而NLP更注重模型本身,所以现有的模型中,深度学习的模型在NLP里得到广泛应用。我自己在这个比赛途中同时也在进行另外一个数据挖掘的比赛,一心二用导致了自己不够专注。做到一定程度的时候卡在一个地方,就发生了懈怠情绪,这也是需要自己改正的一个地方。

感悟与收获:

最重要的收获是感觉自己NLP终于入了门,同时了解了各种前沿的论文对于NLP建模的影响,需要阅读更多的论文,因为基本上好的nlp模型都是从现有论文中衍生的(当然很多大神是通过比赛验证自己的模型然后再发Paper),这和从数据中衍生的数据挖掘出的特征真的是有很大的区别。同时这次比赛借鉴了很多别人的方案,在以后更需要做的是站在巨人的肩膀上做出自己的一些想法。在这次比赛后,发现nlp真是一个巨大无比的坑,还有太多需要学习的地方,继续加油,保持危机感。

以下附比赛代码以及附注:

源码链接:https://github.com/yyhhlancelot/Kaggle_Quora_Insincere_Question_Classification

首先载入需要用的包:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import keras
import os
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
import gc
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, MaxPooling1D, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D, BatchNormalization, PReLU
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import concatenate, add
from keras.callbacks import *

预处理阶段:

清理符号:

def clean_text(x):
    puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', ')', '↓', '、', '│', '(', '»', ',', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]
    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x

使用正则表达式清理数字:

import re
def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

清理错误拼写:

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


mispell_dict = {'colour':'color','centre':'center','didnt':'did not','doesnt':'does not',
                'isnt':'is not','shouldnt':'should not','favourite':'favorite','travelling':'traveling',
                'counselling':'counseling','theatre':'theater','cancelled':'canceled','labour':'labor',
                'organisation':'organization','wwii':'world war 2','citicise':'criticize','instagram': 'social medium',
                'whatsapp': 'social medium','snapchat': 'social medium',"ain't": "is not", 
                "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", 
                "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", 
                "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would",
                "he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", 
                "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", 
                "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                "i'd": "i would", "i'd've": "i would have", "i'll": "i will","i'll've": "i will have",
                "i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", 
                "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                "it's": "it is","let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                "might've": "might have","mightn't": "might not","mightn't've": "might not have", 
                "must've": "must have", "mustn't": "must not", "mustn't've": "must not have",
                "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                "oughtn't": "ought not","oughtn't've": "ought not have", "shan't": "shall not", 
                "sha'n't": "shall not","shan't've": "shall not have","she'd": "she would", 
                "she'd've": "she would have","she'll": "she will","she'll've": "she will have", 
                "she's": "she is","should've": "should have","shouldn't": "should not","shouldn't've": "should not have", 
                "so've": "so have","so's": "so as","this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", 
                "there'd": "there would", "there'd've": "there would have", "there's": "there is", 
                "here's": "here is","they'd": "they would", "they'd've": "they would have",
                "they'll": "they will", "they'll've": "they will have", 
                "they're": "they are", "they've": "they have", "to've": "to have", 
                "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", 
                "we'll": "we will", "we'll've": "we will have", "we're": "we are", 
                "we've": "we have", "weren't": "were not", "what'll": "what will", 
                "what'll've": "what will have", "what're": "what are",  "what's": "what is",
                "what've": "what have", "when's": "when is", "when've": "when have", 
                "where'd": "where did", "where's": "where is", "where've": "where have", 
                "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
                "who've": "who have", "why's": "why is", "why've": "why have", 
                "will've": "will have", "won't": "will not", "won't've": "will not have", 
                "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", 
                "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                "y'all're": "you all are","y'all've": "you all have","you'd": "you would", 
                "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                "you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 
                'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 
                'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 
                'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 
                'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 
                'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 
                'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do',
                'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 
                'mastrubation': 'masturbation', 'mastrubate': 'masturbate', 
                "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 
                'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 
                'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', 
                "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 
                'demonitization': 'demonetization', 'demonetisation': 'demonetization'
                }

mispellings, mispellings_re = _get_mispell(mispell_dict)

def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)

文本预处理:

train_df = pd.read_csv("../input/train.csv")

test_df = pd.read_csv("../input/test.csv")

print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)
embed_size = 300 #词向量维度
max_features = 95000 #设置词典大小
max_len = 70 #设置输入的长度
# lower
train_df['question_text'] = train_df['question_text'].apply(lambda x : x.lower())
test_df['question_text'] = test_df['question_text'].apply(lambda x : x.lower())

# clean the text
train_df["question_text"] = train_df["question_text"].apply(lambda x : clean_text(x))
test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_text(x))

# clean numbers
train_df["question_text"] = train_df["question_text"].apply(lambda x: clean_numbers(x))
test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_numbers(x))

# clean spellings
train_df['question_text'] = train_df['question_text'].apply(lambda x: replace_typical_misspell(x))
test_df['question_text'] = test_df['question_text'].apply(lambda x: replace_typical_misspell(x))

# fill up the missing values
train_X = train_df['question_text'].fillna("_##_").values
test_X = test_df['question_text'].fillna("_##_").values

# tokenize the sentences
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)

# pad the sentences
train_X = pad_sequences(train_X, maxlen = max_len)
test_X = pad_sequences(test_X, maxlen = max_len)

# the target values
train_y = train_df['target'].values
np.random.seed(666)
trn_idx = np.random.permutation(len(train_X))

train_X = train_X[trn_idx]
train_y = train_y[trn_idx]

载入词向量:

def load_glove(word_index):
#     EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
    EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8'))

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 

def load_fasttext(word_index):    
#     EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8') if len(o)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector

    return embedding_matrix

def load_para(word_index):
#     EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
    EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/paragram_300_sl999/paragram_300_sl999.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

注意力机制(Attention Layer):

class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            model.add(LSTM(64, return_sequences=True))
            model.add(Attention())
        """
        self.supports_masking = True
        #self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        # eij = K.dot(x, self.W) TF backend doesn't support it

        # features_dim = self.W.shape[0]
        # step_dim = x._keras_shape[1]

        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
    #print weigthted_input.shape
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        #return input_shape[0], input_shape[-1]
        return input_shape[0],  self.features_dim

胶囊网络:

def squash(x, axis=-1):
    # s_squared_norm is really small
    # s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()
    # scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm)
    # return scale * x
    s_squared_norm = K.sum(K.square(x), axis, keepdims=True)
    scale = K.sqrt(s_squared_norm + K.epsilon())
    return x / scale

# A Capsule Implement with Pure Keras
class Capsule(Layer):
    def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True,
                 activation='default', **kwargs):
        super(Capsule, self).__init__(**kwargs)
        self.num_capsule = num_capsule
        self.dim_capsule = dim_capsule
        self.routings = routings
        self.kernel_size = kernel_size
        self.share_weights = share_weights
        if activation == 'default':
            self.activation = squash
        else:
            self.activation = Activation(activation)

    def build(self, input_shape):
        super(Capsule, self).build(input_shape)
        input_dim_capsule = input_shape[-1]
        if self.share_weights:
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(1, input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     # shape=self.kernel_size,
                                     initializer='glorot_uniform',
                                     trainable=True)
        else:
            input_num_capsule = input_shape[-2]
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(input_num_capsule,
                                            input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     initializer='glorot_uniform',
                                     trainable=True)

    def call(self, u_vecs):
        if self.share_weights:
            u_hat_vecs = K.conv1d(u_vecs, self.W)
        else:
            u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1])

        batch_size = K.shape(u_vecs)[0]
        input_num_capsule = K.shape(u_vecs)[1]
        u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule,
                                            self.num_capsule, self.dim_capsule))
        u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3))
        # final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule]

        b = K.zeros_like(u_hat_vecs[:, :, :, 0])  # shape = [None, num_capsule, input_num_capsule]
        for i in range(self.routings):
            b = K.permute_dimensions(b, (0, 2, 1))  # shape = [None, input_num_capsule, num_capsule]
            c = K.softmax(b)
            c = K.permute_dimensions(c, (0, 2, 1))
            b = K.permute_dimensions(b, (0, 2, 1))
            outputs = self.activation(tf.keras.backend.batch_dot(c, u_hat_vecs, [2, 2]))
            if i < self.routings - 1:
                b = tf.keras.backend.batch_dot(outputs, u_hat_vecs, [2, 3])

        return outputs

    def compute_output_shape(self, input_shape):
        return (None, self.num_capsule, self.dim_capsule)
    
def capsule():
    K.clear_session()       
    inp = Input(shape=(maxlen,))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix], trainable=False)(inp)
    x = SpatialDropout1D(rate=0.2)(x)
    x = Bidirectional(CuDNNGRU(100, return_sequences=True, 
                                kernel_initializer=initializers.glorot_normal(seed=12300), recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)

    x = Capsule(num_capsule=10, dim_capsule=10, routings=4, share_weights=True)(x)
    x = Flatten()(x)

    x = Dense(100, activation="relu", kernel_initializer=glorot_normal(seed=12300))(x)
    x = Dropout(0.12)(x)
    x = BatchNormalization()(x)

    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer=Adam(),)
    return model

def f1_smart(y_true, y_pred):
    args = np.argsort(y_pred)
    tp = y_true.sum()
    fs = (tp - np.cumsum(y_true[args[:-1]])) / np.arange(y_true.shape[0] + tp - 1, tp, -1)
    res_idx = np.argmax(fs)
    return 2 * fs[res_idx], (y_pred[args[res_idx]] + y_pred[args[res_idx + 1]]) / 2

建模:

通过对各路大神的模型进行比较,我找出了几种比较高效的模型

首先是使用了注意力机制的双向LSTM/GRU模型:

def model_lstm_atten(embedding_matrix):
    inp = Input(shape = (max_len,))
    x = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)
    x = SpatialDropout1D(0.1)(x)
    x = Bidirectional(CuDNNLSTM(40, return_sequences = True))(x)
    y = Bidirectional(CuDNNGRU(40, return_sequences = True))(x)
    atten_1 = Attention(max_len)(x)
    atten_2 = Attention(max_len)(y)
    avg_pool = GlobalAveragePooling1D()(y)
    max_pool = GlobalMaxPooling1D()(y)
    
    conc = concatenate([atten_1, atten_2, avg_pool, max_pool])
    conc = Dense(16, activation = "relu")(conc)
    conc = Dropout(0.1)(conc)
    outp = Dense(1, activation = "sigmoid")(conc)
    
    model = Model(inputs = inp, outputs = outp)
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])
    
    return model

使用了注意力机制和胶囊网络的双向LSTM/GRU模型:

def model_atten_capsule(embedding_matrix): 
    '''0.7'''
    inp_x = Input(shape = (max_len, ))
    inp_features = Input(shape = (6, ))
    x = Embedding(max_features,embed_size, weights = [embedding_matrix], trainable = False)(inp_x)
    x = SpatialDropout1D(0.1)(x)
    lstm = Bidirectional(CuDNNLSTM(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),
                                   recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)
    gru = Bidirectional(CuDNNGRU(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),
                                 recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(lstm)
#     x = Bidirectional(CuDNNLSTM(64, return_sequences = True))(x)
    content3 = Capsule(num_capsule = 10, dim_capsule = 10, routings = 4, share_weights = True)(gru)
    content3 = Dropout(0.1)(content3)
#     content3 = Reshape(-1, )(content3)
    content3 = Flatten()(content3)
    content3 = Dense(1, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(content3)
    ### 修改了content3
    atten_lstm = Attention(max_len)(lstm)
    atten_gru = Attention(max_len)(gru)
    avg_pool = GlobalAveragePooling1D()(gru)
    max_pool = GlobalMaxPooling1D()(gru)
    
   
    conc = concatenate([atten_lstm, atten_gru, content3, avg_pool, max_pool, inp_features]) #
    
    ### 修改了dense
    conc = Dense(16, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(conc)
    x = BatchNormalization()(conc)
    x = Dropout(0.1)(x)
    outp = Dense(1)(x)
    
    model = Model(inputs = [inp_x, inp_features], outputs = outp)
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])
    
    return model

CNN模型:

def model_cnn(embedding_matrix):
    filter_sizes = [1, 2, 3, 5]
    num_filters = 36
    
    inp = Input(shape = (max_len,))
    x = Embedding(max_features, embed_size, weights = [embedding_matrix])(inp)
    x = Reshape((max_len, embed_size, 1))(x)
    
    maxpool_pool = []
    
    for i in range(len(filter_sizes)):
        conv = Conv2D(num_filters, kernel_size = (filter_sizes[i], embed_size), kernel_initializer = 'he_normal', activation = 'elu')(x)
        maxpool_pool.append(MaxPool2D(pool_size = (max_len - filter_sizes[i] + 1, 1))(conv))
    
    z = Concatenate(axis = 1)(maxpool_pool)
    z = Flatten()(z)
    z = Dropout(0.1)(z)
    outp = Dense(1, activation = "sigmoid")(z)
    
    model = Model(inputs = inp, outputs = outp)
    model.summary()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    
    return model

以及我在网上找到了腾讯之前提出的dpCNN模型并进行了一些针对性的修改:

def model_dpcnn(embedding_matrix):
    filter_nr = 64 
    filter_size = 3
    max_pool_size = 3
    max_pool_strides = 2
    dense_nr = 256
    spatial_dropout = 0.1
    dense_dropout = 0.2
    train_embed = False
    conv_kern_reg = regularizers.l2(0.00001)
    conv_bias_reg = regularizers.l2(0.00001)
    
    inp = Input(shape = (max_len, ))
    emb_comment = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)
#     emb_comment = SpatialDropout1D(0.1)(emb_comment)
    
    #block1
    block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)
    block1 = BatchNormalization()(block1)
    block1 = PReLU()(block1)
    block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1)
    block1 = BatchNormalization()(block1)
    block1 = PReLU()(block1)
    #we pass embedded comment through conv1d with filter size 1 because it needs to have the same shape as block output
    #if you choose filter_nr = embed_size (300 in this case) you don't have to do this part and can add emb_comment directly to block1_output
    resize_emb = Conv1D(filter_nr, kernel_size = 1, padding = 'same', activation = 'linear', 
                       kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)
    resize_emb = PReLU()(resize_emb)
    
    block1_output = add([block1, resize_emb])
    block1_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block1_output)
    
    
    #block2
    block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1_output)
    block2 = BatchNormalization()(block2)
    block2 = PReLU()(block2)
    block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2)
    block2 = BatchNormalization()(block2)
    block2 = PReLU()(block2)
    
    block2_output = add([block2, block1_output])
    block2_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block2_output)
    
    #block3
    block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2_output)
    block3 = BatchNormalization()(block3)
    block3 = PReLU()(block3)
    block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3)
    block3 = BatchNormalization()(block3)
    block3 = PReLU()(block3)
    
    block3_output = add([block3, block2_output])
    block3_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block3_output)
    
    #block4
    block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3_output)
    block4 = BatchNormalization()(block4)
    block4 = PReLU()(block4)
    block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4)
    block4 = BatchNormalization()(block4)
    block4 = PReLU()(block4)
    
    block4_output = add([block4, block3_output])
    block4_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block4_output)
    
    #block5
    block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4_output)
    block5 = BatchNormalization()(block5)
    block5 = PReLU()(block5)
    block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5)
    block5 = BatchNormalization()(block5)
    block5 = PReLU()(block5)
    
    block5_output = add([block5, block4_output])
    block5_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block5_output)
    
#     #block6
#     block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
#                    kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output)
#     block6 = BatchNormalization()(block6)
#     block6 = PReLU()(block6)
#     block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
#                    kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block6)
#     block6 = BatchNormalization()(block6)
#     block6 = PReLU()(block6)
    
#     block6_output = add([block6, block5_output])
#     block6_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block6_output)
    
    #block7
    block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output)
    block7 = BatchNormalization()(block7)
    block7 = PReLU()(block7)
    block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block7)
    block7 = BatchNormalization()(block7)
    block7 = PReLU()(block7)
    
    block7_output = add([block7, block5_output])
    outp = GlobalMaxPooling1D()(block7_output)
#     output = block7_output
    
    outp = Dense(dense_nr, activation = 'linear')(outp)
    outp = BatchNormalization()(outp)
    outp = Dropout(0.1)(outp)
    outp = Dense(1, activation = 'sigmoid')(outp)
    
    model = Model(inputs = inp, outputs = outp)
    model.summary()
    model.compile(loss = 'binary_crossentropy',
                 optimizer = 'adam',
                 metrics = ['accuracy'])
    return model

同时还有一些模型,大致的思路也是用了注意力机制和胶囊网络以及双向LSTM或者GRU,只是模型的构造不同,这里就没有举出。

词向量处理与训练:

embedding_matrix_1 = load_glove(tokenizer.word_index)
# embedding_matrix_2 = load_fasttext(tokenizer.word_index)
embedding_matrix_3 = load_para(tokenizer.word_index)

embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_3], axis = 0)
# embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_2], axis = 0)
# np.shape(embedding_matrix)
del embedding_matrix_1, embedding_matrix_3
gc.collect()

查找最佳阈值:

def threshold_search(y_true, y_prob):
    best_thresh = 0
    best_score = 0
    for thresh in np.arange(0.1, 0.701, 0.01):
        thresh = np.round(thresh, 2)
        score = metrics.f1_score(y_true, (y_prob >= thresh).astype(int))
        print("F1 score at threshold {} is {}".format(thresh, metrics.f1_score(y_true, (y_prob >= thresh).astype(int))))
        if score > best_score : 
            best_score = score
            best_thresh = thresh
    return best_thresh
def train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = None, callback = None):
    if dev_features is None:
        model.fit(dev_X, dev_y, batch_size = 512, epochs = epochs, validation_data = (val_X, val_y), callbacks = callback, verbose = 0)
        pred_test_y_temp = model.predict(test_X, batch_size = 1024)
#     pred_test_y_temp = model.predict(np.concatenate((test_X, test_features), axis = 1), batch_size = 1024)
    else:
        model.fit([dev_X, dev_features], dev_y, batch_size = 512, epochs = epochs, validation_data = ([val_X, val_features], val_y), callbacks = callback, verbose = 0)
        pred_test_y_temp = model.predict([test_X, test_features], batch_size = 1024)
    return pred_test_y_temp

这里我用了四折交叉验证,并使用了其中一个模型来训练:

## ADDITION TRAIN lstm_atten
num_splits = 4
skf = StratifiedKFold(n_splits = num_splits, shuffle = True, random_state = 2333)
pred_test_y = 0
thresh_use = 0
val_score = 0
for dev_index, val_index in skf.split(train_X, train_y):
    dev_X, val_X = train_X[dev_index, :], train_X[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
#     dev_features, val_features = train_features[dev_index, :], train_features[val_index, :]
    model = model_lstm_atten(embedding_matrix)
    pred_test_y_temp = train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = 2, callback = [clr,])
    pred_val_y = model.predict(val_X, batch_size = 1024)
    best_thresh = threshold_search(val_y, pred_val_y)
    val_score_temp = metrics.f1_score(val_y, (pred_val_y > best_thresh).astype(int))
    print("val temp best f1 score is {0} and best thresh is {1}".format(val_score_temp, best_thresh))
    
    thresh_use += best_thresh
    pred_test_y += pred_test_y_temp
    val_score += val_score_temp
    keras.backend.clear_session()
pred_test_y /= num_splits
thresh_use /= num_splits
val_score /= num_splits
output.append([pred_test_y, thresh_use, val_score, 'lstm atten glove+para'])

提交

sub = pd.read_csv('../input/sample_submission.csv')
sub.prediction = (pred_test_y > thresh_use).astype(int)
sub.to_csv("submission.csv", index=False)

由于这个比赛的结果是通过Kernel进行线上提交,也就是说运行时间不能超过两个小时,所以要合理利用时间,在此基础上可能需要对策略进行修改,比如交叉验证的折数,epoch的数量的等。以上基本上就是该次比赛我的全部流程。希望下次能比这次更好。

你可能感兴趣的:(自然语言处理,竞赛总结,Quora,文本分类,心得感悟,NLP,Kaggle)