【汽车行业用户观点主题及情感识别】CNN文本解决方案

文章目录

  • 方案说明
    • 信息
    • 方案
      • 1. 合并重复id的标签,直接多标签分类,也方便交叉验证
      • 2. 分词:word level + char level
      • 3. 数据增强
      • 4. 双向RNN(GRU/LSTM)编码+注意力/胶囊
      • 5. 输出
      • 6. 训练指标F1,总出现NaN, 花费了很多时间测试修改(偶尔loss也出现NaN,未彻底解决,不是样本问题)
      • 7. 模型融合
  • 代码分析
    • 准备工作
      • 导入需要的包
      • 读取数据
      • 合并主题和感情,共三十类
    • 分析content部分
      • 合并重复数据,获取多标签的信息
      • 合并训练集和测试集
      • 加载预训练的词向量
      • 分词以及分字
      • 去除低频词
      • 获取所有的token
      • 在embedding中查找单词对应的向量
      • 在embedding中查找字对应的向量
      • 句子补齐
    • 模型
      • 指定标签y
      • 定义评价标准score
      • 数据增强与采样
      • 注意力模型
      • 胶囊网络
  • 参考文献

方案说明

信息

汽车行业用户观点主题及情感识别————专业陪跑二十年

队员:

  • chantcalf

语言keras+tensorflow

初赛 Rank 16, 复赛A/B Rank 12/22

文本分类,合并主题和情感,分30类

方案


1. 合并重复id的标签,直接多标签分类,也方便交叉验证

数据未清洗,大量数据模棱两可的,还是都交给模型了。

2. 分词:word level + char level

数据量小,单用字比单用词好点;

300d 预训练固定 + 100d 可训练;

可以分开训练再融合,也可以直接在模型里合一起;

我们使用的是https://github.com/Embedding/Chinese-Word-Vectors ;

其中字词同时训练的,在模型里直接拼接效果非常好。

可以使用不同的预训练词向量做融合(未尝试)。

3. 数据增强

这个题目全局上下文很弱,明显具有最优子结构,当某个连续子串含有某个标签,整句就有某个标签。

随机取两个样本拼起来,再把它们的标签取maximum, 就得到新样本,我们是每个batch随机采样一个当前batch的和一个全局采样的拼起来。

提高数据增强数据的比例能略微提升线上成绩。

类似的可以在模型中对句子按一定长度扫描或划分,共享模型权值,预测结果取max。这相当与一种隐性的数据增强(怕跑不动,未尝试)。

4. 双向RNN(GRU/LSTM)编码+注意力/胶囊

注意力模型和胶囊网络效果差不多,注意力模型跑得更快。胶囊网络可以每几个胶囊负责一个主题,也可以所有胶囊Flatten后预测,效果也差不多。

5. 输出

直接输出30类。RNN输出或CRF没有更好地效果(猜测是各标签没啥联系)。

6. 训练指标F1,总出现NaN, 花费了很多时间测试修改(偶尔loss也出现NaN,未彻底解决,不是样本问题)

F1最优结果阈值基本在0.5附近,也可以就看loss,调阈值提交;

训练集中多标签占比为10%, 提交数量为1.1倍较好。

7. 模型融合

略微改动结构或数据增强比例都可以当做一个新模型。

每个模型做五折交叉验证,每折五次平均,效果很好。

各模型取均值或加权平均。

第一次尝试stacking,用所有模型预测值当特征,用一个全连接NN做上层,基本同样的五折交叉验证,线上比均值略低。

代码分析

下面开始分模块进行解答


准备工作

导入需要的包

import numpy as np
import pandas as pd

from tqdm import tqdm # 用于显示进度条

import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import seaborn as sns
sns.set_style("white")
%matplotlib inline

读取数据

train = pd.read_csv('data/input/train.csv')
test = pd.read_csv('data/input/test_public.csv')

获取subject和sentiment的类别值并且转换为对应的0-9和0-2

subject = list(train['subject'].unique())
def get_subject(x):
    for i in range(len(subject)):
        if x==subject[i]:
            return i
    return -1
train['Y1'] = train['subject'].apply(get_subject)
subject = list(train['subject'].unique())
def get_subject(x):
    for i in range(len(subject)):
        if x==subject[i]:
            return i
    return -1
train['Y1'] = train['subject'].apply(get_subject)

合并主题和感情,共三十类

train['Y3'] = train['Y1']*3+train['Y2']

分析content部分

print (train.shape,train['content_id'].nunique())
# (9947, 8) 8290
gp = train[['content_id']].groupby(['content_id']).size().rename('counts').reset_index()
gp = gp.sort_values(by='counts',ascending=False)
print (gp.head())
#             content_id  counts
# 3402  PKeUvILOs6M3pYBl       7
# 6621  nJbR1iAFjQPma928       6
# 2242  HBcOg49avix7bfeF       6
# 7741  vqaIrjURK0c7me1i       6
# 5001  bOcxUzdVkeWJDE9I       5
gp[gp['counts']>1].shape
# (1254,2)

训练集中一共有9947行,其中不重复的8290行,其余的都是重复的content。 上面按content_id分组,并且计算组中元素个数,按照元素个数排序 发现最多有7个重复的行。 1254个content出现两次以上,其余只出现一次

合并重复数据,获取多标签的信息

#合并重复数据,获取多标签数据
def get_ys(x):
    x = list(x)
    ans = np.zeros(30)
    for i in x:
        ans[i]=1
    return ans
gpy = train.groupby(['content_id'])['Y3'].apply(get_ys).rename('Y4').reset_index()
gpy.shape
# (8290, 2)

上面获取每个不重复的content对应有几个标签(情感*主题)

合并训练集和测试集

train1 = train.groupby(['content_id','content']).size().rename('counts').reset_index()
train0 = pd.merge(gpy,train1[['content_id','content']],on=['content_id'],how='left')

合并train中不重复的content_id,content以及对应的类比信息。

加载预训练的词向量

这里使用的是百度百科语料预训练的词向量来做辅助。需要下载相应的词向量。这里的sgns.baidubaike.bigram-char词向量有1.8个G。下面加载embedding的过程会比较长

# 使用预训练的词向量和字向量
# https://github.com/Embedding/Chinese-Word-Vectors
embeddings_index = {}
EMBEDDING_DIM = 300
embfile = 'data/embedding/sgns.baidubaike.bigram-char'
with open(embfile, encoding='utf-8') as f:
    for i, line in enumerate(f):
        values = line.split()
        words = values[:-EMBEDDING_DIM]
        word = ''.join(words)
        try:
            coefs = np.asarray(values[-EMBEDDING_DIM:], dtype='float32')
            embeddings_index[word] = coefs
        except:
            pass
print('Found %s word vectors.' % len(embeddings_index))
# Found 635793 word vectors.

对上面的代码进行解释,将embedding词向量进行切分成word和value。word为单词,value为该单词对应的词向量。这里的词向量的维度是300维的紧凑向量。 上面一共是有635793个单词所组成的词向量。

分词以及分字

这里分别从字符上和

#分词
rls = ['?','!','“','”',':','…','(',')',
      '—','《','》','、','‘','’','"','\n','.',
       ';','#','【','】','\'',':','(','」','∠','+',',',
       '!','|',
      ]
def cut_words(x):
    x = str(x).strip()
    for c in rls:
        x = x.replace(c,' ')
    x = ' '.join(x.split())
    s = ' '.join(jieba.cut(x,cut_all=True))
    s = ' '.join(s.split())
    return s
#分字
def cut_chars(x):
    x = str(x).replace(' ','')
    y = [i for i in x]
    y = ' '.join(y)
    return y
train['chars'] = train['content'].apply(cut_chars)
test['chars'] = test['content'].apply(cut_chars)
train['chars_len'] = train['chars'].apply(lambda x:len(x.split()))
test['chars_len'] = test['chars'].apply(lambda x:len(x.split()))
print (train['chars_len'].describe())
print (test['chars_len'].describe())

上面统计了每句话的长度的统计信息,平均长度为44.而且训练集中和测试集中分布相似

train['words'] = train['content'].apply(cut_words)

上面的是分词

去除低频词

#去除低频词
word_count1 = {}
word_count2 = {}
for i in tqdm(range(len(train))):
    td = {}
    s = train.loc[i,'words'].split()
    for c in s:
        if c not in word_count1:
            word_count1[c]=1
        else:
            word_count1[c]+=1
        if c not in td:
            td[c] = 1
    for c in td:
        if c not in word_count2:
            word_count2[c]=1
        else:
            word_count2[c]+=1
def remove_low_words(x):
    s = x.split()
    t = []
    for c in s:
        if c in word_count1 and c in word_count2 and word_count1[c]>1 and word_count2[c]>1:
            t.append(c)
    return ' '.join(t)
train['words1'] = train['words'].apply(remove_low_words)
train['words_len'] = train['words1'].apply(lambda x:len(x.split()))

上面去掉了出现次数过少的单词,可以看到单词的平均长度降低到了26.
同样的对测试集进行同样的操作

获取所有的token

MAX_NB_WORDS = 10000
MAX_SEQUENCE_LENGTH = 128
MAX_SEQUENCE_LENGTH1 = 200
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(train['words1'])
word_index = tokenizer.word_index
print (len(word_index))
nb_words = min(MAX_NB_WORDS,len(word_index))

这里使用keras中的Tokenizer把分词后的所有词指定下标index。接下来把所有的content表示为所有的单词下标组合。

train_words = tokenizer.texts_to_sequences(train['words1'])
test_words = tokenizer.texts_to_sequences(test['words1'])

在embedding中查找单词对应的向量

word_embedding_matrix = np.zeros((nb_words + 1, EMBEDDING_DIM))
cc = 0
cc1 = 0
for word, i in word_index.items():
    #print (word,tokenizer.word_counts[word])
    if i > MAX_NB_WORDS:
        continue
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
        cc +=1
    else:
        cc1+=1
print (cc,cc1)
# 8089 556

这里的word_embedding_matrix表示本文的词库中每个单词的嵌入向量
有8089个单词可以在embedding中找到,而556个找不到。

在embedding中查找字对应的向量

tokenizer1 = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer1.fit_on_texts(train['chars'])
word_index1 = tokenizer1.word_index
print (len(word_index1))
nb_words1 = min(MAX_NB_WORDS,len(word_index1))
train_chars = tokenizer1.texts_to_sequences(train['chars'])
test_chars = tokenizer1.texts_to_sequences(test['chars'])
word_embedding_matrix1 = np.zeros((nb_words1 + 1, EMBEDDING_DIM))
cc = 0
cc1 = 0
for word, i in word_index1.items():
    if i > MAX_NB_WORDS:
        continue
    if word in embeddings_index:
        word_embedding_matrix1[i] = embeddings_index[word]
        cc +=1
    else:
        cc1+=1
print (cc,cc1)
# 2680 18

上面从字层面上进行index拆分以及将sentence表示成index的组合。可以找到2680个单词,18个单词找不到

句子补齐

def get_pad_char_seq(x):
    return pad_sequences(x,maxlen=MAX_SEQUENCE_LENGTH1)
def get_pad_seq(x):
    return pad_sequences(x,maxlen=MAX_SEQUENCE_LENGTH)
X = pad_sequences(train_words,maxlen=MAX_SEQUENCE_LENGTH)
test_X = pad_sequences(test_words,maxlen=MAX_SEQUENCE_LENGTH)   # char和word两个长度不同
test_X1 = get_pad_char_seq(test_chars)

keras中的pad_sequences(sequences, maxlen=None)用于将不同长度的文本都补齐到同一个长度,便于作为算法的输入

模型

指定标签y

这里的Y作为标签信息,共有30个分类。每个content可以分到多个类中

Y = np.array(list(train['Y4']))

定义评价标准score

定义优化目标:评价分类成功的标准F1-score

#f1_score, 总出现NAN,发现是K.sum会得到实数。。。就强行输出0
def f1_score(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.round(K.clip(y_pred, 0, 1))*K.round(K.clip(y_true, 0, 1)))
    c2 = K.sum(K.round(K.clip(y_pred, 0, 1)))
    c3 = K.sum(K.round(K.clip(y_true, 0, 1)))

    # If there are no true samples, fix the F1 score at 0.
    if c1==0 or c3 ==0 or c2==0:
        return 0

    # How many selected items are relevant?
    precision = c1 / (c2+0.000001)

    # How many relevant items are selected?
    recall = c1 / (c3+0.000001)

    # Calculate f1_score
    f1 = 2 * (precision * recall) / (precision + recall+0.000001)
    return f1

def c1(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.round(K.clip(y_pred, 0, 1)*K.clip(y_true, 0, 1)))
    return c1

def c2(y_true, y_pred):

    # Count positive samples.
    c2 = K.sum(K.round(K.clip(y_pred, 0, 1)))
    return c2

def c3(y_true, y_pred):

    # Count positive samples.
    c3 = K.sum(K.round(K.clip(y_true, 0, 1)))
    return c3

数据增强与采样

np.random.seed(1992)
class DataGenerator(keras.utils.Sequence):
    def __init__(self, data,data1,datay,
                 batch_size=256, shuffle=True,aug=0):
        self.batch_size = batch_size
        self.data = data
        self.data1 = data1
        self.datay = datay
        self.aug = aug
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        return int(np.floor(len(self.data) / self.batch_size))

    def __getitem__(self, index):
        indexes = np.array(range(index*self.batch_size,(index+1)*self.batch_size))
        indexes = indexes%len(self.data)
        indexes = self.indexes[indexes]
        X, y = self.__data_generation(indexes)
        return X, y

    def on_epoch_end(self):
        self.indexes = np.array(range(len(self.data)))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, indexes):
        X = []
        X1 = []
        y = []
        for i in range(self.batch_size):
            X.append(self.data[indexes[i]])
            X1.append(self.data1[indexes[i]])
            y.append(self.datay[indexes[i]])
            
        if self.aug>0:
            for i in range(self.aug):
                while True:
                    a = np.random.randint(self.batch_size)
                    b = np.random.randint(len(self.data))
                    a = indexes[a]
                    #b = indexes[b]
                    xx = self.data[a]+self.data[b]
                    xx1 = self.data1[a]+self.data1[b]
                    if len(xx)<MAX_SEQUENCE_LENGTH and len(xx1)<MAX_SEQUENCE_LENGTH1:
                        yy = self.datay[a]+self.datay[b]
                        yy = np.minimum(yy,1)
                        X.append(xx)
                        X1.append(xx1)
                        y.append(yy)
                        break;
        X = get_pad_seq(X)   
        X1 = get_pad_char_seq(X1) 
        y = np.array(y)
        return [X,X1],y
    
params = {'batch_size': 64,
          'shuffle': True}

#training_generator = DataGenerator(X_train,y_train, **params)

注意力模型

#注意力层
from keras import backend as K
from keras.layers import Layer
from keras import initializers, regularizers, constraints
 
def dot_product(x, kernel):
    """
    Wrapper for dot product operation, in order to be compatible with both
    Theano and Tensorflow
    Args:
        x (): input
        kernel (): weights
    Returns:
    """
    if K.backend() == 'tensorflow':
        return K.squeeze(K.dot(x, K.expand_dims(kernel)), axis=-1)
    else:
        return K.dot(x, kernel)

class AttentionWithContext(Layer):
    """
    Attention operation, with a context/query vector, for temporal data.
    Supports Masking.
    Follows the work of Yang et al. [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf]
    "Hierarchical Attention Networks for Document Classification"
    by using a context vector to assist the attention
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    How to use:
    Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
    The dimensions are inferred based on the output shape of the RNN.
    Note: The layer has been tested with Keras 2.0.6
    Example:
        model.add(LSTM(64, return_sequences=True))
        model.add(AttentionWithContext())
        # next add a Dense layer (for classification/regression) or whatever...
    """
 
    def __init__(self,
                 W_regularizer=None, u_regularizer=None, b_regularizer=None,
                 W_constraint=None, u_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
 
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')
 
        self.W_regularizer = regularizers.get(W_regularizer)
        self.u_regularizer = regularizers.get(u_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)
 
        self.W_constraint = constraints.get(W_constraint)
        self.u_constraint = constraints.get(u_constraint)
        self.b_constraint = constraints.get(b_constraint)
 
        self.bias = bias
        super(AttentionWithContext, self).__init__(**kwargs)
 
    def build(self, input_shape):
        assert len(input_shape) == 3
 
        self.W = self.add_weight((input_shape[-1], input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        if self.bias:
            self.b = self.add_weight((input_shape[-1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
 
        self.u = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_u'.format(self.name),
                                 regularizer=self.u_regularizer,
                                 constraint=self.u_constraint)
 
        super(AttentionWithContext, self).build(input_shape)
 
    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None
 
    def call(self, x, mask=None):
        uit = dot_product(x, self.W)
 
        if self.bias:
            uit += self.b
 
        uit = K.tanh(uit)
        ait = dot_product(uit, self.u)
 
        a = K.exp(ait)
 
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
 
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx())
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
 
        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)
 
    def compute_output_shape(self, input_shape):
        return input_shape[0], input_shape[-1]

胶囊网络

#胶囊网络
def squash(x, axis=-1):
    # s_squared_norm is really small
    # s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()
    # scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm)
    # return scale * x
    s_squared_norm = K.sum(K.square(x), axis, keepdims=True)
    scale = K.sqrt(s_squared_norm + K.epsilon())
    return x / scale

# A Capsule Implement with Pure Keras
class Capsule(Layer):
    def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True,
                 activation='default', **kwargs):
        super(Capsule, self).__init__(**kwargs)
        self.num_capsule = num_capsule
        self.dim_capsule = dim_capsule
        self.routings = routings
        self.kernel_size = kernel_size
        self.share_weights = share_weights
        if activation == 'default':
            self.activation = squash
        else:
            self.activation = Activation(activation)

    def build(self, input_shape):
        super(Capsule, self).build(input_shape)
        input_dim_capsule = input_shape[-1]
        if self.share_weights:
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(1, input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     # shape=self.kernel_size,
                                     initializer='glorot_uniform',
                                     trainable=True)
        else:
            input_num_capsule = input_shape[-2]
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(input_num_capsule,
                                            input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     initializer='glorot_uniform',
                                     trainable=True)

    def call(self, u_vecs):
        if self.share_weights:
            u_hat_vecs = K.conv1d(u_vecs, self.W)
        else:
            u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1])

        batch_size = K.shape(u_vecs)[0]
        input_num_capsule = K.shape(u_vecs)[1]
        u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule,
                                            self.num_capsule, self.dim_capsule))
        u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3))
        # final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule]

        b = K.zeros_like(u_hat_vecs[:, :, :, 0])  # shape = [None, num_capsule, input_num_capsule]
        for i in range(self.routings):
            b = K.permute_dimensions(b, (0, 2, 1))  # shape = [None, input_num_capsule, num_capsule]
            c = K.softmax(b)
            c = K.permute_dimensions(c, (0, 2, 1))
            b = K.permute_dimensions(b, (0, 2, 1))
            outputs = self.activation(K.batch_dot(c, u_hat_vecs, [2, 2]))
            if i < self.routings - 1:
                b = K.batch_dot(outputs, u_hat_vecs, [2, 3])

        return outputs

    def compute_output_shape(self, input_shape):
        return (None, self.num_capsule, self.dim_capsule)

参考文献

  1. 专业陪跑二十年知乎
  2. 专业陪跑二十年github

你可能感兴趣的:(自然语言处理)