yyhhlancelot

Kaggle竞赛：Quora Insincere Questions Classification 总结与心得感想

这次Quora的文本分类题，4000支参赛队伍中个人solo最终只在LB上达到了20%，一方面是因为第一次参加NLP方面的比赛，完全是个小白，另一方面是自己在比赛途中也有不少懈怠，因此想做一些技术上以及客观上的总结警醒自己。

比赛是通过文本训练集来预测Quora上的问题是真诚的还是不真诚的问题，比赛链接https://www.kaggle.com/c/quora-insincere-questions-classification

关于技术上的问题：

在比赛中通过学习各路大神的kernel，同时在很多问题一知半解的情况下，查阅各种文献资料，我对NLP的文本分类有了一个大致的了解，分词，语言模型，词向量等有了初步认识。语言模型n-gram，预训练词向量word embedding，以及常用的lstm网络和GRU网络等等。文本预处理的各种方法，以及attention layer等模型方案。

客观上存在的问题：

NLP的比赛和普通的数据挖掘比赛有很大的不同，普通的数据挖掘比赛最重要的需要挖掘到好的特征，其次是使用合适的模型；而NLP更注重模型本身，所以现有的模型中，深度学习的模型在NLP里得到广泛应用。我自己在这个比赛途中同时也在进行另外一个数据挖掘的比赛，一心二用导致了自己不够专注。做到一定程度的时候卡在一个地方，就发生了懈怠情绪，这也是需要自己改正的一个地方。

感悟与收获：

最重要的收获是感觉自己NLP终于入了门，同时了解了各种前沿的论文对于NLP建模的影响，需要阅读更多的论文，因为基本上好的nlp模型都是从现有论文中衍生的（当然很多大神是通过比赛验证自己的模型然后再发Paper），这和从数据中衍生的数据挖掘出的特征真的是有很大的区别。同时这次比赛借鉴了很多别人的方案，在以后更需要做的是站在巨人的肩膀上做出自己的一些想法。在这次比赛后，发现nlp真是一个巨大无比的坑，还有太多需要学习的地方，继续加油，保持危机感。

以下附比赛代码以及附注：

源码链接：https://github.com/yyhhlancelot/Kaggle_Quora_Insincere_Question_Classification

首先载入需要用的包：

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import keras
import os
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
import gc
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, MaxPooling1D, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D, BatchNormalization, PReLU
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import concatenate, add
from keras.callbacks import *

预处理阶段：

清理符号：

def clean_text(x):
    puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]
    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x

使用正则表达式清理数字：

import re
def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

清理错误拼写：

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


mispell_dict = {'colour':'color','centre':'center','didnt':'did not','doesnt':'does not',
                'isnt':'is not','shouldnt':'should not','favourite':'favorite','travelling':'traveling',
                'counselling':'counseling','theatre':'theater','cancelled':'canceled','labour':'labor',
                'organisation':'organization','wwii':'world war 2','citicise':'criticize','instagram': 'social medium',
                'whatsapp': 'social medium','snapchat': 'social medium',"ain't": "is not", 
                "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", 
                "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", 
                "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would",
                "he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", 
                "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", 
                "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                "i'd": "i would", "i'd've": "i would have", "i'll": "i will","i'll've": "i will have",
                "i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", 
                "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                "it's": "it is","let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                "might've": "might have","mightn't": "might not","mightn't've": "might not have", 
                "must've": "must have", "mustn't": "must not", "mustn't've": "must not have",
                "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                "oughtn't": "ought not","oughtn't've": "ought not have", "shan't": "shall not", 
                "sha'n't": "shall not","shan't've": "shall not have","she'd": "she would", 
                "she'd've": "she would have","she'll": "she will","she'll've": "she will have", 
                "she's": "she is","should've": "should have","shouldn't": "should not","shouldn't've": "should not have", 
                "so've": "so have","so's": "so as","this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", 
                "there'd": "there would", "there'd've": "there would have", "there's": "there is", 
                "here's": "here is","they'd": "they would", "they'd've": "they would have",
                "they'll": "they will", "they'll've": "they will have", 
                "they're": "they are", "they've": "they have", "to've": "to have", 
                "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", 
                "we'll": "we will", "we'll've": "we will have", "we're": "we are", 
                "we've": "we have", "weren't": "were not", "what'll": "what will", 
                "what'll've": "what will have", "what're": "what are",  "what's": "what is",
                "what've": "what have", "when's": "when is", "when've": "when have", 
                "where'd": "where did", "where's": "where is", "where've": "where have", 
                "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
                "who've": "who have", "why's": "why is", "why've": "why have", 
                "will've": "will have", "won't": "will not", "won't've": "will not have", 
                "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", 
                "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                "y'all're": "you all are","y'all've": "you all have","you'd": "you would", 
                "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                "you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 
                'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 
                'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 
                'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 
                'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 
                'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 
                'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do',
                'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 
                'mastrubation': 'masturbation', 'mastrubate': 'masturbate', 
                "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 
                'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 
                'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', 
                "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 
                'demonitization': 'demonetization', 'demonetisation': 'demonetization'
                }

mispellings, mispellings_re = _get_mispell(mispell_dict)

def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)

文本预处理：

train_df = pd.read_csv("../input/train.csv")

test_df = pd.read_csv("../input/test.csv")

print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)

embed_size = 300 #词向量维度
max_features = 95000 #设置词典大小
max_len = 70 #设置输入的长度

# lower
train_df['question_text'] = train_df['question_text'].apply(lambda x : x.lower())
test_df['question_text'] = test_df['question_text'].apply(lambda x : x.lower())

# clean the text
train_df["question_text"] = train_df["question_text"].apply(lambda x : clean_text(x))
test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_text(x))

# clean numbers
train_df["question_text"] = train_df["question_text"].apply(lambda x: clean_numbers(x))
test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_numbers(x))

# clean spellings
train_df['question_text'] = train_df['question_text'].apply(lambda x: replace_typical_misspell(x))
test_df['question_text'] = test_df['question_text'].apply(lambda x: replace_typical_misspell(x))

# fill up the missing values
train_X = train_df['question_text'].fillna("_##_").values
test_X = test_df['question_text'].fillna("_##_").values

# tokenize the sentences
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)

# pad the sentences
train_X = pad_sequences(train_X, maxlen = max_len)
test_X = pad_sequences(test_X, maxlen = max_len)

# the target values
train_y = train_df['target'].values

np.random.seed(666)
trn_idx = np.random.permutation(len(train_X))

train_X = train_X[trn_idx]
train_y = train_y[trn_idx]

载入词向量：

def load_glove(word_index):
#     EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
    EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8'))

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 

def load_fasttext(word_index):    
#     EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8') if len(o)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector

    return embedding_matrix

def load_para(word_index):
#     EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
    EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/paragram_300_sl999/paragram_300_sl999.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    # word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

注意力机制(Attention Layer):

class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            model.add(LSTM(64, return_sequences=True))
            model.add(Attention())
        """
        self.supports_masking = True
        #self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        # eij = K.dot(x, self.W) TF backend doesn't support it

        # features_dim = self.W.shape[0]
        # step_dim = x._keras_shape[1]

        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
    #print weigthted_input.shape
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        #return input_shape[0], input_shape[-1]
        return input_shape[0],  self.features_dim

胶囊网络：

def squash(x, axis=-1):
    # s_squared_norm is really small
    # s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()
    # scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm)
    # return scale * x
    s_squared_norm = K.sum(K.square(x), axis, keepdims=True)
    scale = K.sqrt(s_squared_norm + K.epsilon())
    return x / scale

# A Capsule Implement with Pure Keras
class Capsule(Layer):
    def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True,
                 activation='default', **kwargs):
        super(Capsule, self).__init__(**kwargs)
        self.num_capsule = num_capsule
        self.dim_capsule = dim_capsule
        self.routings = routings
        self.kernel_size = kernel_size
        self.share_weights = share_weights
        if activation == 'default':
            self.activation = squash
        else:
            self.activation = Activation(activation)

    def build(self, input_shape):
        super(Capsule, self).build(input_shape)
        input_dim_capsule = input_shape[-1]
        if self.share_weights:
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(1, input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     # shape=self.kernel_size,
                                     initializer='glorot_uniform',
                                     trainable=True)
        else:
            input_num_capsule = input_shape[-2]
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(input_num_capsule,
                                            input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     initializer='glorot_uniform',
                                     trainable=True)

    def call(self, u_vecs):
        if self.share_weights:
            u_hat_vecs = K.conv1d(u_vecs, self.W)
        else:
            u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1])

        batch_size = K.shape(u_vecs)[0]
        input_num_capsule = K.shape(u_vecs)[1]
        u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule,
                                            self.num_capsule, self.dim_capsule))
        u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3))
        # final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule]

        b = K.zeros_like(u_hat_vecs[:, :, :, 0])  # shape = [None, num_capsule, input_num_capsule]
        for i in range(self.routings):
            b = K.permute_dimensions(b, (0, 2, 1))  # shape = [None, input_num_capsule, num_capsule]
            c = K.softmax(b)
            c = K.permute_dimensions(c, (0, 2, 1))
            b = K.permute_dimensions(b, (0, 2, 1))
            outputs = self.activation(tf.keras.backend.batch_dot(c, u_hat_vecs, [2, 2]))
            if i < self.routings - 1:
                b = tf.keras.backend.batch_dot(outputs, u_hat_vecs, [2, 3])

        return outputs

    def compute_output_shape(self, input_shape):
        return (None, self.num_capsule, self.dim_capsule)
    
def capsule():
    K.clear_session()       
    inp = Input(shape=(maxlen,))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix], trainable=False)(inp)
    x = SpatialDropout1D(rate=0.2)(x)
    x = Bidirectional(CuDNNGRU(100, return_sequences=True, 
                                kernel_initializer=initializers.glorot_normal(seed=12300), recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)

    x = Capsule(num_capsule=10, dim_capsule=10, routings=4, share_weights=True)(x)
    x = Flatten()(x)

    x = Dense(100, activation="relu", kernel_initializer=glorot_normal(seed=12300))(x)
    x = Dropout(0.12)(x)
    x = BatchNormalization()(x)

    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer=Adam(),)
    return model

def f1_smart(y_true, y_pred):
    args = np.argsort(y_pred)
    tp = y_true.sum()
    fs = (tp - np.cumsum(y_true[args[:-1]])) / np.arange(y_true.shape[0] + tp - 1, tp, -1)
    res_idx = np.argmax(fs)
    return 2 * fs[res_idx], (y_pred[args[res_idx]] + y_pred[args[res_idx + 1]]) / 2

建模：

通过对各路大神的模型进行比较，我找出了几种比较高效的模型

首先是使用了注意力机制的双向LSTM/GRU模型：

def model_lstm_atten(embedding_matrix):
    inp = Input(shape = (max_len,))
    x = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)
    x = SpatialDropout1D(0.1)(x)
    x = Bidirectional(CuDNNLSTM(40, return_sequences = True))(x)
    y = Bidirectional(CuDNNGRU(40, return_sequences = True))(x)
    atten_1 = Attention(max_len)(x)
    atten_2 = Attention(max_len)(y)
    avg_pool = GlobalAveragePooling1D()(y)
    max_pool = GlobalMaxPooling1D()(y)
    
    conc = concatenate([atten_1, atten_2, avg_pool, max_pool])
    conc = Dense(16, activation = "relu")(conc)
    conc = Dropout(0.1)(conc)
    outp = Dense(1, activation = "sigmoid")(conc)
    
    model = Model(inputs = inp, outputs = outp)
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])
    
    return model

使用了注意力机制和胶囊网络的双向LSTM/GRU模型：

def model_atten_capsule(embedding_matrix): 
    '''0.7'''
    inp_x = Input(shape = (max_len, ))
    inp_features = Input(shape = (6, ))
    x = Embedding(max_features,embed_size, weights = [embedding_matrix], trainable = False)(inp_x)
    x = SpatialDropout1D(0.1)(x)
    lstm = Bidirectional(CuDNNLSTM(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),
                                   recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)
    gru = Bidirectional(CuDNNGRU(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),
                                 recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(lstm)
#     x = Bidirectional(CuDNNLSTM(64, return_sequences = True))(x)
    content3 = Capsule(num_capsule = 10, dim_capsule = 10, routings = 4, share_weights = True)(gru)
    content3 = Dropout(0.1)(content3)
#     content3 = Reshape(-1, )(content3)
    content3 = Flatten()(content3)
    content3 = Dense(1, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(content3)
    ### 修改了content3
    atten_lstm = Attention(max_len)(lstm)
    atten_gru = Attention(max_len)(gru)
    avg_pool = GlobalAveragePooling1D()(gru)
    max_pool = GlobalMaxPooling1D()(gru)
    
   
    conc = concatenate([atten_lstm, atten_gru, content3, avg_pool, max_pool, inp_features]) #
    
    ### 修改了dense
    conc = Dense(16, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(conc)
    x = BatchNormalization()(conc)
    x = Dropout(0.1)(x)
    outp = Dense(1)(x)
    
    model = Model(inputs = [inp_x, inp_features], outputs = outp)
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])
    
    return model

CNN模型：

def model_cnn(embedding_matrix):
    filter_sizes = [1, 2, 3, 5]
    num_filters = 36
    
    inp = Input(shape = (max_len,))
    x = Embedding(max_features, embed_size, weights = [embedding_matrix])(inp)
    x = Reshape((max_len, embed_size, 1))(x)
    
    maxpool_pool = []
    
    for i in range(len(filter_sizes)):
        conv = Conv2D(num_filters, kernel_size = (filter_sizes[i], embed_size), kernel_initializer = 'he_normal', activation = 'elu')(x)
        maxpool_pool.append(MaxPool2D(pool_size = (max_len - filter_sizes[i] + 1, 1))(conv))
    
    z = Concatenate(axis = 1)(maxpool_pool)
    z = Flatten()(z)
    z = Dropout(0.1)(z)
    outp = Dense(1, activation = "sigmoid")(z)
    
    model = Model(inputs = inp, outputs = outp)
    model.summary()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    
    return model

以及我在网上找到了腾讯之前提出的dpCNN模型并进行了一些针对性的修改：

def model_dpcnn(embedding_matrix):
    filter_nr = 64 
    filter_size = 3
    max_pool_size = 3
    max_pool_strides = 2
    dense_nr = 256
    spatial_dropout = 0.1
    dense_dropout = 0.2
    train_embed = False
    conv_kern_reg = regularizers.l2(0.00001)
    conv_bias_reg = regularizers.l2(0.00001)
    
    inp = Input(shape = (max_len, ))
    emb_comment = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)
#     emb_comment = SpatialDropout1D(0.1)(emb_comment)
    
    #block1
    block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)
    block1 = BatchNormalization()(block1)
    block1 = PReLU()(block1)
    block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1)
    block1 = BatchNormalization()(block1)
    block1 = PReLU()(block1)
    #we pass embedded comment through conv1d with filter size 1 because it needs to have the same shape as block output
    #if you choose filter_nr = embed_size (300 in this case) you don't have to do this part and can add emb_comment directly to block1_output
    resize_emb = Conv1D(filter_nr, kernel_size = 1, padding = 'same', activation = 'linear', 
                       kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)
    resize_emb = PReLU()(resize_emb)
    
    block1_output = add([block1, resize_emb])
    block1_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block1_output)
    
    
    #block2
    block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1_output)
    block2 = BatchNormalization()(block2)
    block2 = PReLU()(block2)
    block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2)
    block2 = BatchNormalization()(block2)
    block2 = PReLU()(block2)
    
    block2_output = add([block2, block1_output])
    block2_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block2_output)
    
    #block3
    block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2_output)
    block3 = BatchNormalization()(block3)
    block3 = PReLU()(block3)
    block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3)
    block3 = BatchNormalization()(block3)
    block3 = PReLU()(block3)
    
    block3_output = add([block3, block2_output])
    block3_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block3_output)
    
    #block4
    block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3_output)
    block4 = BatchNormalization()(block4)
    block4 = PReLU()(block4)
    block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4)
    block4 = BatchNormalization()(block4)
    block4 = PReLU()(block4)
    
    block4_output = add([block4, block3_output])
    block4_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block4_output)
    
    #block5
    block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4_output)
    block5 = BatchNormalization()(block5)
    block5 = PReLU()(block5)
    block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5)
    block5 = BatchNormalization()(block5)
    block5 = PReLU()(block5)
    
    block5_output = add([block5, block4_output])
    block5_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block5_output)
    
#     #block6
#     block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
#                    kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output)
#     block6 = BatchNormalization()(block6)
#     block6 = PReLU()(block6)
#     block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
#                    kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block6)
#     block6 = BatchNormalization()(block6)
#     block6 = PReLU()(block6)
    
#     block6_output = add([block6, block5_output])
#     block6_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block6_output)
    
    #block7
    block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output)
    block7 = BatchNormalization()(block7)
    block7 = PReLU()(block7)
    block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',
                   kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block7)
    block7 = BatchNormalization()(block7)
    block7 = PReLU()(block7)
    
    block7_output = add([block7, block5_output])
    outp = GlobalMaxPooling1D()(block7_output)
#     output = block7_output
    
    outp = Dense(dense_nr, activation = 'linear')(outp)
    outp = BatchNormalization()(outp)
    outp = Dropout(0.1)(outp)
    outp = Dense(1, activation = 'sigmoid')(outp)
    
    model = Model(inputs = inp, outputs = outp)
    model.summary()
    model.compile(loss = 'binary_crossentropy',
                 optimizer = 'adam',
                 metrics = ['accuracy'])
    return model

同时还有一些模型，大致的思路也是用了注意力机制和胶囊网络以及双向LSTM或者GRU，只是模型的构造不同，这里就没有举出。

词向量处理与训练：

embedding_matrix_1 = load_glove(tokenizer.word_index)
# embedding_matrix_2 = load_fasttext(tokenizer.word_index)
embedding_matrix_3 = load_para(tokenizer.word_index)

embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_3], axis = 0)
# embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_2], axis = 0)
# np.shape(embedding_matrix)
del embedding_matrix_1, embedding_matrix_3
gc.collect()

查找最佳阈值：

def threshold_search(y_true, y_prob):
    best_thresh = 0
    best_score = 0
    for thresh in np.arange(0.1, 0.701, 0.01):
        thresh = np.round(thresh, 2)
        score = metrics.f1_score(y_true, (y_prob >= thresh).astype(int))
        print("F1 score at threshold {} is {}".format(thresh, metrics.f1_score(y_true, (y_prob >= thresh).astype(int))))
        if score > best_score : 
            best_score = score
            best_thresh = thresh
    return best_thresh

def train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = None, callback = None):
    if dev_features is None:
        model.fit(dev_X, dev_y, batch_size = 512, epochs = epochs, validation_data = (val_X, val_y), callbacks = callback, verbose = 0)
        pred_test_y_temp = model.predict(test_X, batch_size = 1024)
#     pred_test_y_temp = model.predict(np.concatenate((test_X, test_features), axis = 1), batch_size = 1024)
    else:
        model.fit([dev_X, dev_features], dev_y, batch_size = 512, epochs = epochs, validation_data = ([val_X, val_features], val_y), callbacks = callback, verbose = 0)
        pred_test_y_temp = model.predict([test_X, test_features], batch_size = 1024)
    return pred_test_y_temp

这里我用了四折交叉验证，并使用了其中一个模型来训练：

## ADDITION TRAIN lstm_atten
num_splits = 4
skf = StratifiedKFold(n_splits = num_splits, shuffle = True, random_state = 2333)
pred_test_y = 0
thresh_use = 0
val_score = 0
for dev_index, val_index in skf.split(train_X, train_y):
    dev_X, val_X = train_X[dev_index, :], train_X[val_index,:]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
#     dev_features, val_features = train_features[dev_index, :], train_features[val_index, :]
    model = model_lstm_atten(embedding_matrix)
    pred_test_y_temp = train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = 2, callback = [clr,])
    pred_val_y = model.predict(val_X, batch_size = 1024)
    best_thresh = threshold_search(val_y, pred_val_y)
    val_score_temp = metrics.f1_score(val_y, (pred_val_y > best_thresh).astype(int))
    print("val temp best f1 score is {0} and best thresh is {1}".format(val_score_temp, best_thresh))
    
    thresh_use += best_thresh
    pred_test_y += pred_test_y_temp
    val_score += val_score_temp
    keras.backend.clear_session()
pred_test_y /= num_splits
thresh_use /= num_splits
val_score /= num_splits
output.append([pred_test_y, thresh_use, val_score, 'lstm atten glove+para'])

提交

sub = pd.read_csv('../input/sample_submission.csv')
sub.prediction = (pred_test_y > thresh_use).astype(int)
sub.to_csv("submission.csv", index=False)

由于这个比赛的结果是通过Kernel进行线上提交，也就是说运行时间不能超过两个小时，所以要合理利用时间，在此基础上可能需要对策略进行修改，比如交叉验证的折数，epoch的数量的等。以上基本上就是该次比赛我的全部流程。希望下次能比这次更好。

你可能感兴趣的:(自然语言处理,竞赛总结,Quora,文本分类,心得感悟,NLP,Kaggle)

OC语言多界面传值五大方式 Magnetic_h ios ui 学习 objective-c 开发语言
前言在完成暑假仿写项目时，遇到了许多需要用到多界面传值的地方，这篇博客来总结一下比较常用的五种多界面传值的方式。属性传值属性传值一般用前一个界面向后一个界面传值，简单地说就是通过访问后一个视图控制器的属性来为它赋值，通过这个属性来做到从前一个界面向后一个界面传值。首先在后一个界面中定义属性@interfaceBViewController:UIViewController@propertyNSSt
水平垂直居中的几种方法（总结） LJ小番茄 CSS_玄学语言 html javascript 前端 css css3
1.使用flexbox的justify-content和align-items.parent{display:flex;justify-content:center;/*水平居中*/align-items:center;/*垂直居中*/height:100vh;/*需要指定高度*/}2.使用grid的place-items:center.parent{display:grid;place-item
【一起学Rust | 设计模式】习惯语法——使用借用类型作为参数、格式化拼接字符串、构造函数广龙宇一起学Rust #Rust设计模式 rust 设计模式开发语言
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、使用借用类型作为参数二、格式化拼接字符串三、使用构造函数总结前言Rust不是传统的面向对象编程语言，它的所有特性，使其独一无二。因此，学习特定于Rust的设计模式是必要的。本系列文章为作者学习《Rust设计模式》的学习笔记以及自己的见解。因此，本系列文章的结构也与此书的结构相同（后续可能会调成结构），基本上分为三个部分
关于提高复杂业务逻辑代码可读性的思考编程经验分享开发经验 java 数据库开发语言
目录前言需求场景常规写法拆分方法领域对象总结前言实际工作中大部分时间都是在写业务逻辑，一般都是三层架构，表示层（Controller）接收客户端请求，并对入参做检验，业务逻辑层（Service）负责处理业务逻辑，一般开发都是在这一层中写具体的业务逻辑。数据访问层（Dao）是直接和数据库交互的，用于查数据给业务逻辑层，或者是将业务逻辑层处理后的数据写入数据库。简单的增删改查接口不用多说，基本上写好一
拥有断舍离的心态，过精简生活--《断舍离》读书笔记爱吃丸子的小樱桃
不知不觉间房间里的东西越来越多，虽然摆放整齐，但也时常会觉得空间逼仄，令人心生烦闷。抱着断舍离的态度，我开始阅读《断舍离》这本书，希望从书中能找到一些有效的方法，帮助我实现空间、物品上的断舍离。《断舍离》是日本作家山下英子通过自己的经历、思考和实践总结而成的，整体内涵也从刚开始的私人生活哲学的“断舍离”升华成了“人生实践哲学”，接着又成为每个人都能实行的“改变人生的断舍离”，从“哲学”逐渐升华成“
探索OpenAI和LangChain的适配器集成：轻松切换模型提供商 nseejrukjhad langchain easyui 前端 python
#探索OpenAI和LangChain的适配器集成：轻松切换模型提供商##引言在人工智能和自然语言处理的世界中，OpenAI的模型提供了强大的能力。然而，随着技术的发展，许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具，集成了多种模型提供商，通过提供适配器，简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成，以便轻松切换模型提供商
使用Apify加载Twitter消息以进行微调的完整指南 nseejrukjhad twitter easyui 前端 python
#使用Apify加载Twitter消息以进行微调的完整指南##引言在自然语言处理领域，微调模型以适应特定任务是提升模型性能的常见方法。本文将介绍如何使用Apify从Twitter导出聊天信息，以便进一步进行微调。##主要内容###使用Apify导出推文首先，我们需要从Twitter导出推文。Apify可以帮助我们做到这一点。通过Apify的强大功能，我们可以批量抓取和导出数据，适用于各类应用场景。
深入理解 MultiQueryRetriever：提升向量数据库检索效果的强大工具 nseejrukjhad 数据库 python
深入理解MultiQueryRetriever：提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域，高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用，但仍存在一些局限性。本文将介绍一种创新的解决方案：MultiQueryRetriever，它通过自动生成多个查询视角来增强检索效果，提高结果的相关性和多样性。MultiQueryRetriever的工
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
阶段总结反思轻争
马上就要进入10月份了，今天做一下前段时间的总结和反思。前段时间，日更、英语、健身、护肤坚持的比较好。阅读、书法坚持的不好。1.中间被迫停更半个多月，其余时间一直在坚持日更挑战。偶尔也有不想写的时候，就做一下摘抄。因为阅读（输入）没跟上来，所以写作（输出）质量有待进一步加强。2.英语做到了一周至少学习5天，每次不少于30分钟，但是小班课没有跟上更新速度，下一步要争取利用零碎时间补听小班课。3.减肥
【华为OD技术面试真题 - 技术面】- python八股文真题题库（4) 算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例：文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片（Slicing）操作**基本切片语法
Python爬虫解析工具之xpath使用详解 eqa11 python 爬虫开发语言
文章目录Python爬虫解析工具之xpath使用详解一、引言二、环境准备1、插件安装2、依赖库安装三、xpath语法详解1、路径表达式2、通配符3、谓语4、常用函数四、xpath在Python代码中的使用1、文档树的创建2、使用xpath表达式3、获取元素内容和属性五、总结Python爬虫解析工具之xpath使用详解一、引言在Python爬虫开发中，数据提取是一个至关重要的环节。xpath作为一门
主题升华随机抽总结木棉咕噜
昨天晚上在火山灿教练那里抽了主题升华最后一关。一共抽了两个故事，现总结如下。第一个故事是《并不是你想象的那样》。主题一：有时候，面对别人一些貌似不合常情的行为，不要轻易的指责他，也许背后有我们所不知道的原因。在这一个主题里面，刚开始的时候，我没有加上貌似二字。所以就没有改动之后这么精准。主题二：有时候我们对他人善意的行为，可能会给我们带来一些意外的回报。主题三：面对同样一件事，因为不同的人看待问题
【无标题】达瓦达瓦 JhonKI 考研
博客主页：https://blog.csdn.net/2301_779549673欢迎点赞收藏⭐留言如有错误敬请指正！本文由JohnKi原创，首发于CSDN未来很长，值得我们全力奔赴更美好的生活✨文章目录前言111️‍111❤️111111111111111总结111前言111骗骗流量券，嘿嘿111111111111111111111111111️‍111❤️111111111111111总结11
上图为是否色发 JhonKI 考研
博客主页：https://blog.csdn.net/2301_779549673欢迎点赞收藏⭐留言如有错误敬请指正！本文由JohnKi原创，首发于CSDN未来很长，值得我们全力奔赴更美好的生活✨文章目录前言111️‍111❤️111111111111111总结111前言111骗骗流量券，嘿嘿111111111111111111111111111️‍111❤️111111111111111总结11
143234234123432 JhonKI 考研
博客主页：https://blog.csdn.net/2301_779549673欢迎点赞收藏⭐留言如有错误敬请指正！本文由JohnKi原创，首发于CSDN未来很长，值得我们全力奔赴更美好的生活✨文章目录前言111️‍111❤️111111111111111总结111前言111骗骗流量券，嘿嘿111111111111111111111111111️‍111❤️111111111111111总结11
1分钟解决 -bash: mvn: command not found，在Centos 7中安装Maven Energet!c 开发语言
1分钟解决-bash:mvn:commandnotfound，在Centos7中安装Maven检查Java环境1下载Maven2解压Maven3配置环境变量4验证安装5常见问题与注意事项6总结检查Java环境Maven依赖Java环境，请确保系统已经安装了Java并配置了环境变量。可以通过以下命令检查：java-version如果未安装，请先安装Java。1下载Maven从官网下载：前往Apach
CentOS的根目录下，/bin 和 /sbin 用途和权限 Energet!c Linux日常 centos linux 运维
CentOS的根目录下，/bin和/sbin用途和权限一、/bin(Binary)二、/sbin(SystemBinary)三、总结在CentOS的根目录下，/bin和/sbin目录有不同的用途和权限一、/bin(Binary)用途:存放系统的基本命令，这些命令对所有用户都是可用的。例如：ls、cp、mv、rm等。权限:普通用户和系统管理员都可以使用这些命令。二、/sbin(SystemBinar
linux 发展史种树的猴子内核 java 操作系统 linux 大数据
linux发展史说明此前对linux认识模糊一知半解，近期通过学习将自己对于linux的发展总结一下方便大家日后的学习。那Linux是目前一款非常火热的开源操作系统，可是linux是什么时候出现的，又是因为什么样的原因被开发出来的呢。以下将对linux的发展历程进行详细的讲解。目录一、Linux发展背景二、UINIX的诞生三、UNIX的重要分支-BSD的诞生四、Minix的诞生五、GNU与Free
又到年末伊人微语
今天，工作群里，各个部门开始提醒老师们上交各种期末总结资料，才蓦然感觉这个学期已接近尾声，才意识到2022即将过去，新的一年的脚步声已经越来越近不由得生阳一些感慨。年纪大了，感觉到每个日子都是“倏”地一声就过去了，来不及思量，来不及回顾，一年就这么过去了。我常常想，为什么会有这样的感觉呢？年轻时候的每一天是24小时，现在的每一天也不曾少过一分钟，为什么就会感觉到它的脚步越来越快呢？后来我想明白了，
自然语言处理_tf-idf _feivirus_ 算法机器学习和数学自然语言处理 tf-idf 逆文档频率词频
importpandasaspdimportmath1.数据预处理docA="Thecatsatonmyface"docB="Thedogsatonmybed"wordsA=docA.split("")wordsB=docB.split("")wordsSet=set(wordsA).union(set(wordsB))print(wordsSet){'on','my','face','sat',
大都会资本BMAN的2018年终总结非线性思考
1投资的本质是认知变现赚钱=足够的认知*高效的的变现。2投资的三大基石策略:提升认知高效变现知行合一3如果你亏钱了要么是认知的问题，要么是变现的问题，要么而是知行合一的问题。4投资需要知行合一，很简单的道理，却拦住了很多高手，是因为认知和行动中间还隔着人性。顶级的高手能把自己从贪嗔痴中抽离出来，顶级高手没有人性，只有原则。5如果你玩的是空气币，就不要幻想拿着它改变世界，那是你套出了幻觉，眼光放短一
2024.9.6 Python，华为笔试题总结，字符串格式化，字符串操作，广度优先搜索解决公司组织绩效互评问题，无向图 RaidenQ python 华为 leetcode 算法力扣广度优先无向图
1.字符串格式化name="Alice"age=30formatted_string="Name:{},Age:{}".format(name,age)print(formatted_string)或者name="Alice"age=30formatted_string=f"Name:{name},Age:{age}"print(formatted_string)2.网络健康检查第一行有两个整数m
android 更改窗口的层次,浮窗开发之窗口层级 Ms.Bu android 更改窗口的层次
最近在项目中遇到了这样的需求：需要在特定的其他应用之上悬浮自己的UI交互(拖动、输入等复杂的UI交互)，和九游的浮窗类似，不过我们的比九游的体验更好，我们越过了很多授权的限制。浮窗效果很多人都知道如何去实现一个简单的浮窗，但是却很少有人去深入的研究背后的流程机制，由于项目中浮窗交互比较复杂，遇到了些坑查看了很多资料，故总结浮窗涉及到的知识点：窗口层级关系(浮窗是如何“浮”的)？浮窗有哪些限制，如何
Python入门之Lesson2:Python基础语法小熊同学哦 Python入门课程 python 开发语言算法数据结构青少年编程
目录前言一.介绍1.变量和数据类型2.常见运算符3.输入输出4.条件语句5.循环结构二.练习三.总结前言欢迎来到《Python入门》系列博客的第二课。在上一课中，我们了解了Python的安装及运行环境的配置。在这一课中，我们将深入学习Python的基础语法，这是编写Python代码的根基。通过本节内容的学习，你将掌握变量、数据类型、运算符、输入输出、条件语句等Python编程的基础知识。一.介绍1
Android应用性能优化轻口味 Android
Android手机由于其本身的后台机制和硬件特点，性能上一直被诟病，所以软件开发者对软件本身的性能优化就显得尤为重要；本文将对Android开发过程中性能优化的各个方面做一个回顾与总结。Cache优化ListView缓存：ListView中有一个回收器，Item滑出界面的时候View会回收到这里，需要显示新的Item的时候，就尽量重用回收器里面的View；每次在getView函数中inflate新
高考后该不该给孩子买电脑，什么情况能买？什么情况不能买？寻求改变
我知道家长们很担心，怕买了电脑小孩沉迷游戏，耽误了学业，也不利于身体健康。对于准大学生来说，基本上在18岁左右，也不算小了，但在很多父母眼里，依旧是个小孩子。数据显示，这种情况是有发生的，大学生约70%的电脑主要被用于玩网络游戏，如果没有养成一个用良好的习惯，对孩子影响是非常大的。我总结为三买，三不买。最近有看到群里很多家长再问，小孩上大学该不该给他买电脑，要买和不买两种观点的家长都有，那么哪种情
放松的一天 4da9b7687fa0
20190325总结起床07:20图片发自App睡觉:23:00天气:晴今日任务清单学习·信息·阅读•水滴阅读Day40Alice’sAdventuresinWonderlandChapter6.2图片发自App•BBC跟读训练营Day24图片发自App图片发自App图片发自App•潘多拉口语训练营Day6Wow.Whatabigboy!•文化知识学习今日无•阅读时间地狱健康·饮食·锻炼•饮食目标
4 大低成本娱乐方式: 小说, 音乐, 视频, 电子游戏穷人小水滴娱乐音视频低成本小说游戏
穷人如何获得快乐?小说,音乐,视频,游戏,本文简单盘点一下这4大低成本(安全)娱乐方式.这里是穷人小水滴,专注于穷人友好型低成本技术.(本文为58号作品.)目录1娱乐方式1.1小说(网络小说)1.2音乐1.3视频(b站)1.4游戏(电子游戏/计算机软件)2低成本:一只手机即可3总结与展望1娱乐方式这几种,也可以说是艺术的具体形式.更专业的说,(娱乐)是劳动力再生产的重要组成部分.使人放松,获得快乐
计算机网络八股总结 Petrichorzncu 八股总结计算机网络笔记
这里写目录标题网络模型划分（五层和七层）及每一层的功能五层网络模型七层网络模型（OSI模型）==三次握手和四次挥手具体过程及原因==三次握手四次挥手TCP/IP协议组成==UDP协议与TCP/IP协议的区别==Http协议相关知识网络地址，子网掩码等相关计算网络模型划分（五层和七层）及每一层的功能五层网络模型应用层：负责处理网络应用程序，如电子邮件、文件传输和网页浏览。主要协议包括HTTP、FTP
异常的核心类Throwable 无量 java 源码异常处理 exception
java异常的核心是Throwable，其他的如Error和Exception都是继承的这个类里面有个核心参数是detailMessage，记录异常信息，getMessage核心方法，获取这个参数的值，我们可以自己定义自己的异常类，去继承这个Exception就可以了，方法基本上，用父类的构造方法就OK，所以这么看异常是不是很easy package com.natsu;
mongoDB 游标（cursor）实现分页迭代开窍的石头 mongodb
上篇中我们讲了mongoDB 中的查询函数，现在我们讲mongo中如何做分页查询如何声明一个游标 var mycursor = db.user.find({_id:{$lte:5}}); 迭代显示游标数
MySQL数据库INNODB 表损坏修复处理过程 0624chenhong tomcat mysql
最近mysql数据库经常死掉，用命令net stop mysql命令也无法停掉，关闭Tomcat的时候，出现Waiting for N instance(s) to be deallocated 信息。查了下，大概就是程序没有对数据库连接释放，导致Connection泄露了。因为用的是开元集成的平台，内部程序也不可能一下子给改掉的，就验证一下咯。启动Tomcat,用户登录系统，用netstat -
剖析如何与设计人员沟通不懂事的小屁孩工作
最近做图烦死了，不停的改图，改图……。烦，倒不是因为改，而是反反复复的改，人都会死。很多需求人员不知该如何与设计人员沟通，不明白如何使设计人员知道他所要的效果，结果只能是沟通变成了扯淡，改图变成了应付。那应该如何与设计人员沟通呢？我认为设计人员与需求人员先天就存在语言障碍。对一个合格的设计人员来说，整天玩的都是点、线、面、配色，哪种构图看起来协调；哪种配色看起来合理心里跟明镜似的，
qq空间刷评论工具换个号韩国红果果 JavaScript
var a=document.getElementsByClassName('textinput'); var b=[]; for(var m=0;m<a.length;m++){ if(a[m].getAttribute('placeholder')!=null) b.push(a[m]) } var l
S2SH整合之session 灵静志远 spring AOP struts session
错误信息： Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cartService': Scope 'session' is not active for the current thread; consider defining a scoped
xmp标签 a-john 标签
今天在处理数据的显示上遇到一个问题： var html = '<li><div class="pl-nr"><span class="user-name">' + user + '</span>' + text + '</div></li>'; ulComme
Ajax的常用技巧（2）---实现Web页面中的级联菜单 aijuans Ajax
在网络上显示数据，往往只显示数据中的一部分信息，如文章标题，产品名称等。如果浏览器要查看所有信息，只需点击相关链接即可。在web技术中，可以采用级联菜单完成上述操作。根据用户的选择，动态展开，并显示出对应选项子菜单的内容。在传统的web实现方式中，一般是在页面初始化时动态获取到服务端数据库中对应的所有子菜单中的信息，放置到页面中对应的位置，然后再结合CSS层叠样式表动态控制对应子菜单的显示或者隐
天-安-门，好高 atongyeye 情感
我是85后，北漂一族，之前房租1100，因为租房合同到期，再续，房租就要涨150。最近网上新闻，地铁也要涨价。算了一下，涨价之后，每次坐地铁由原来2块变成6块。仅坐地铁费用，一个月就要涨200。内心苦痛。晚上躺在床上一个人想了很久，很久。我生在农
android 动画百合不是茶 android 透明度平移缩放旋转
android的动画有两种 tween动画和Frame动画 tween动画;,透明度,缩放,旋转,平移效果 Animation 动画 AlphaAnimation 渐变透明度 RotateAnimation 画面旋转 ScaleAnimation 渐变尺寸缩放 TranslateAnimation 位置移动 Animation
查看本机网络信息的cmd脚本 bijian1013 cmd
@echo 您的用户名是：%USERDOMAIN%\%username%>"%userprofile%\网络参数.txt" @echo 您的机器名是：%COMPUTERNAME%>>"%userprofile%\网络参数.txt" @echo ___________________>>"%userprofile%\
plsql 清除登录过的用户征客丶 plsql
tools---preferences----logon history---history 把你想要删除的删除 -------------------------------------------------------------------- 若有其他凝问或文中有错误，请及时向我指出，我好及时改正，同时也让我们一起进步。 email ： binary_spac
【Pig一】Pig入门 bit1129 pig
Pig安装 1.下载pig wget http://mirror.bit.edu.cn/apache/pig/pig-0.14.0/pig-0.14.0.tar.gz 2. 解压配置环境变量如果Pig使用Map/Reduce模式，那么需要在环境变量中，配置HADOOP_HOME环境变量 expor
Java 线程同步几种方式 BlueSkator volatile synchronized ThredLocal ReenTranLock Concurrent
为何要使用同步？ java允许多线程并发控制，当多个线程同时操作一个可共享的资源变量时（如数据的增删改查），将会导致数据不准确，相互之间产生冲突，因此加入同步锁以避免在该线程没有完成操作之前，被其他线程的调用，从而保证了该变量的唯一性和准确性。 1.同步方法&
StringUtils判断字符串是否为空的方法（转帖） BreakingBad null StringUtils “”
转帖地址：http://www.cnblogs.com/shangxiaofei/p/4313111.html public static boolean isEmpty(String str) 　　判断某字符串是否为空，为空的标准是 str== null 或 str.length()== 0
编程之美-分层遍历二叉树 bylijinnan java 数据结构算法编程之美
import java.util.ArrayList; import java.util.LinkedList; import java.util.List; public class LevelTraverseBinaryTree { /** * 编程之美分层遍历二叉树 * 之前已经用队列实现过二叉树的层次遍历，但这次要求输出换行，因此要
jquery取值和ajax提交复习记录 chengxuyuancsdn jquery取值 ajax提交
// 取值 // alert($("input[name='username']").val()); // alert($("input[name='password']").val()); // alert($("input[name='sex']:checked").val()); // alert($("
推荐国产工作流引擎嵌入式公式语法解析器-IK Expression comsci java 应用服务器工作 Excel 嵌入式
这个开源软件包是国内的一位高手自行研制开发的，正如他所说的一样，我觉得它可以使一个工作流引擎上一个台阶。。。。。。欢迎大家使用，并提出意见和建议。。。 ----------转帖--------------------------------------------------- IK Expression是一个开源的（OpenSource），可扩展的（Extensible），基于java语言
关于系统中使用多个PropertyPlaceholderConfigurer的配置及PropertyOverrideConfigurer daizj spring
1、PropertyPlaceholderConfigurer Spring中PropertyPlaceholderConfigurer这个类，它是用来解析Java Properties属性文件值，并提供在spring配置期间替换使用属性值。接下来让我们逐渐的深入其配置。基本的使用方法是：(1) <bean id="propertyConfigurerForWZ&q
二叉树:二叉搜索树 dieslrae 二叉树
所谓二叉树,就是一个节点最多只能有两个子节点,而二叉搜索树就是一个经典并简单的二叉树.规则是一个节点的左子节点一定比自己小,右子节点一定大于等于自己(当然也可以反过来).在树基本平衡的时候插入,搜索和删除速度都很快,时间复杂度为O(logN).但是,如果插入的是有序的数据,那效率就会变成O(N),在这个时候,树其实变成了一个链表. tree代码:
C语言字符串函数大全 dcj3sjt126com c function
C语言字符串函数大全函数名: stpcpy 功能: 拷贝一个字符串到另一个用法: char *stpcpy(char *destin, char *source); 程序例: #include <stdio.h> #include <string.h> int main
友盟统计页面技巧 dcj3sjt126com 技巧
在基类调用就可以了, 基类ViewController示例代码 -(void)viewWillAppear:(BOOL)animated { [super viewWillAppear:animated]; [MobClick beginLogPageView:[NSString stringWithFormat:@"%@",self.class]];
window下在同一台机器上安装多个版本jdk，修改环境变量不生效问题处理办法 flyvszhb java jdk
window下在同一台机器上安装多个版本jdk，修改环境变量不生效问题处理办法本机已经安装了jdk1.7，而比较早期的项目需要依赖jdk1.6，于是同时在本机安装了jdk1.6和jdk1.7. 安装jdk1.6前，执行java -version得到 C:\Users\liuxiang2>java -version java version "1.7.0_21&quo
Java在创建子类对象的同时会不会创建父类对象 happyqing java 创建子类对象父类对象
1.在thingking in java 的第四版第六章中明确的说了，子类对象中封装了父类对象， 2."When you create an object of the derived class, it contains within it a subobject of the base class. This subobject is the sam
跟我学spring3 目录贴及电子书下载 jinnianshilongnian spring
一、《跟我学spring3》电子书下载地址：《跟我学spring3》（1-7 和 8-13） http://jinnianshilongnian.iteye.com/blog/pdf 跟我学spring3系列 word原版下载二、源代码下载最新依
第12章 Ajax（上） onestopweb Ajax
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
BI and EIM 4.0 at a glance blueoxygen BO
http://www.sap.com/corporate-en/press.epx?PressID=14787 有机会研究下EIM家族的两个新产品~~~~ New features of the 4.0 releases of BI and EIM solutions include: Real-time in-memory computing –
Java线程中yield与join方法的区别 tomcat_oracle java
长期以来，多线程问题颇为受到面试官的青睐。虽然我个人认为我们当中很少有人能真正获得机会开发复杂的多线程应用(在过去的七年中，我得到了一个机会)，但是理解多线程对增加你的信心很有用。之前，我讨论了一个wait()和sleep()方法区别的问题，这一次，我将会讨论join()和yield()方法的区别。坦白的说，实际上我并没有用过其中任何一个方法，所以，如果你感觉有不恰当的地方，请提出讨论。 &nb
android Manifest.xml选项阿尔萨斯 Manifest
结构继承关系 public final class Manifest extends Objectjava.lang.Objectandroid.Manifest 内部类 class Manifest.permission权限 class Manifest.permission_group权限组构造函数 public Manifest () 详细 androi
Oracle实现类split函数的方 zhaoshijie oracle
关键字：Oracle实现类split函数的方项目里需要保存结构数据，批量传到后他进行保存，为了减小数据量，子集拼装的格式，使用存储过程进行保存。保存的过程中需要对数据解析。但是oracle没有Java中split类似的函数。从网上找了一个，也补全了一下。 CREATE OR REPLACE TYPE t_split_100 IS TABLE OF VARCHAR2(100); cr