文本分类Keras RNN实践——应用腾讯和百度中文词向量

中文词向量

深度学习在NLP领域大展身手,而深度学习处理文本,离不开文本的向量化。

英语独特的语法规则,使得单用空格就能将句子中的单词分割开来,从而取得词向量,这极大简化了英语的NLP预处理过程,工业界学术界也有非常好的资源支持,如谷歌公司word2vec算法、斯坦福大学GloVe算法等等,应用这些英文开源词向量模型,大大扩展了英文自然语言处理使用的深度和广度。

中文自文言文发展开始,丰富的语义,使得分词断句一直是阅读者需要迈过的第一道障碍,至今演变为白话文,增加标点符号分隔,也没有一套简单规则可循的分词断句方法,仿照英文的自然语言处理过程,无疑加大了中文在NLP领域的分析难度,也使得中文NLP研究进度落后。

目前获取中文词向量流行的方式是利用gensim工具,基于训练集与测试集组成的全体样本给出的中文词向量,词向量的精度取决于全体样本的大小,样本越大,利用获得的中文词向量进行深度学习的效果越好,计算两个词向量的“距离”来判断两者语义相似性的效果也越好。海量的中文资源,巨量的存储空间,强大的运算性能,巧妙的算法设计,要获得高质量的中文词向量,这几样缺一不可。

腾讯AI Lab近期开源了其训练的中文词向量数据(数据下载地址:https://ai.tencent.com/ailab/nlp/embedding.html),包含800多万条中文词汇,一些流行词也收录其中如一颗赛艇、因吹斯听等,发布介绍中其“训练词向量的语料来自腾讯新闻和天天快报的新闻语料,以及自行抓取的互联网网页和小说语料。”,“采用自研的Directional Skip-Gram (DSG)算法作为词向量的训练算法。”,数据大小约为16.7G(更正),词向量维度200,格式为txt文本,下载到本地服务器后上传数据到MySQL,通过查询SQL获取对应文本分词后的词向量。

百度AI开放平台让普通开发者拥有了一步到位的中文自然语言处理能力,通过抓取全网数据并训练模型得到商用级基础NLP资源,试用免费,支持多种NLP任务,如中文词向量、文本分类(目前支持娱乐、体育、科技等26个内容类型)、依存句法分析等等。中文词向量可以通过POST请求方式或者SDK方式获取,数据存放于百度服务器,试用默认调用量限制QPS为5,词向量维度1024,处理小型数据集的NLP任务足够了。

本文分别采用One-hot模型、Embedding模型、腾讯开源中文词向量和百度开放平台中文词向量,使用Keras深度学习API,简单应用于金融类文本分类,验证上述两大公开的中文词向量使用效果。

 

文本预处理

本文使用一个小型金融类文本数据集,经过前期处理(脏数据清洗是个大坑),包含4类标签,每类标签对应的样本数1000条,在Jupyter Notebook平台上采用keras的LSTM模型进行训练。

读取数据:

df_train = pd.read_excel('./data/taxdata.xls',header = 0)

1. One-hot + RNN

 标签编码,使用Tokenizer

#划分标签和数据
data_list = df_train['data']
class_list = df_train['class']

# 对标签进行编码
y_labels = list(y_train.value_counts().index)
le = LabelEncoder()
le.fit(y_labels)
num_labels = len(y_labels)
y_train = to_categorical(y_train.map(lambda x: le.transform([x])[0]), num_labels)
y_test = to_categorical(y_test.map(lambda x: le.transform([x])[0]), num_labels)

# 构建单词-id词典       
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")
tokenizer.fit_on_texts(data_list)
vocab = tokenizer.word_index
x_train_word_ids = tokenizer.texts_to_sequences(X_train)
x_test_word_ids = tokenizer.texts_to_sequences(X_test)

获取One-hot编码,由于One-hot编码后词向量维度为24271,为减轻运算负担,采用One-hot思想(每个词语出现的概率相互独立)的序列编码(每个词语占一个序列号码,若样本出现该词语置为词语对应的序列号码,否则置0)方式获取词向量,并增加一个维度:

#序列编码
x_train_padded_seqs = pad_sequences(x_train_word_ids, maxlen=200)
x_test_padded_seqs = pad_sequences(x_test_word_ids, maxlen=200)
x_train_padded_seqs=np.expand_dims(x_train_padded_seqs,axis=2)
x_test_padded_seqs=np.expand_dims(x_test_padded_seqs,axis=2)

 构建RNN模型:

# 序列 + RNN model
model = Sequential()
model.add(LSTM(256, dropout=0.5,recurrent_dropout=0.1))
model.add(Dense(256, activation='relu'))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train_padded_seqs, y_train,
          batch_size=32,
          epochs=12,
          validation_data=(x_test_padded_seqs, y_test))

训练结果:

Train on 3200 samples, validate on 800 samples
Epoch 1/12
3200/3200 [==============================] - 30s 9ms/step - loss: 1.4094 - acc: 0.2594 - val_loss: 1.4007 - val_acc: 0.2437
Epoch 2/12
3200/3200 [==============================] - 30s 9ms/step - loss: 1.3944 - acc: 0.2559 - val_loss: 1.3860 - val_acc: 0.2550
Epoch 3/12
3200/3200 [==============================] - 30s 9ms/step - loss: 1.3905 - acc: 0.2672 - val_loss: 1.3946 - val_acc: 0.2637
Epoch 4/12
3200/3200 [==============================] - 29s 9ms/step - loss: 1.3937 - acc: 0.2625 - val_loss: 1.4082 - val_acc: 0.2662
Epoch 5/12
3200/3200 [==============================] - 30s 9ms/step - loss: 1.3904 - acc: 0.2591 - val_loss: 1.3824 - val_acc: 0.2575
Epoch 6/12
3200/3200 [==============================] - 27s 9ms/step - loss: 1.3907 - acc: 0.2497 - val_loss: 1.3866 - val_acc: 0.2687
Epoch 7/12
3200/3200 [==============================] - 27s 8ms/step - loss: 1.3875 - acc: 0.2753 - val_loss: 1.3793 - val_acc: 0.2687
Epoch 8/12
3200/3200 [==============================] - 27s 8ms/step - loss: 1.3876 - acc: 0.2581 - val_loss: 1.3811 - val_acc: 0.2662
Epoch 9/12
3200/3200 [==============================] - 28s 9ms/step - loss: 1.3872 - acc: 0.2638 - val_loss: 1.3919 - val_acc: 0.2700
Epoch 10/12
3200/3200 [==============================] - 28s 9ms/step - loss: 1.3871 - acc: 0.2697 - val_loss: 1.3900 - val_acc: 0.2475
Epoch 11/12
3200/3200 [==============================] - 28s 9ms/step - loss: 1.3946 - acc: 0.2603 - val_loss: 1.3881 - val_acc: 0.2725
Epoch 12/12
3200/3200 [==============================] - 29s 9ms/step - loss: 1.3888 - acc: 0.2544 - val_loss: 1.3786 - val_acc: 0.2650

可见One-hot + RNN 模型训练,训练集和验证集精度非常低。

 

2. Embedding + RNN

对每个样本,用主流的jieba分词包分词,同时根据停用词表去除诸如“的”、“是”等无意义词,使用keras的tokenizer,可以训练并获得相对于本地数据集的词嵌入向量,词向量维度300。

MAX_SEQUENCE_LENGTH = 100 # 每篇文章选取100个词
EMBEDDING_DIM = 300 # 词向量维度,300维

tokenizer = Tokenizer(num_words=15000) 
tokenizer.fit_on_texts(data_list) # 传入数据集,得到完整数据集部分的词向量
sequences = tokenizer.texts_to_sequences(data_list) 
word_index_dict = tokenizer.word_index

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # 限制每篇文章的长度
labels = to_categorical(np.asarray(class_list)) # label one hot表示

打乱文章顺序,切割数据:

# 打乱文章顺序
VALIDATION_SPLIT = 0.2
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

# 切割数据
x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

定义embedding层,当trainable为true时,由keras训练本地数据集获得词嵌入向量,并将这部分词嵌入向量作用于训练集和验证集的内容数据中:

# 设置 trainable = True,代表词向量作为参数进行更新
num_words = len(word_index_dict) 
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

加载RNN,这里选用LSTM层:

from keras import backend as K  
from keras.layers import Embedding,Input,Dense,Conv1D,MaxPooling1D,Dense,Flatten,Lambda,LSTM
K.set_image_dim_ordering('tf') 

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = LSTM(256, dropout=0.5)(embedded_sequences)
x = Dense(256, activation='relu')(x)
preds = Dense(num_labels, activation='softmax')(x)

model = Model(sequence_input, preds)
model.summary()
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train, validation_data=(x_val, y_val),
          epochs=12, batch_size=128)

获得模型训练结果:

Train on 3200 samples, validate on 800 samples
Epoch 1/12
3200/3200 [==============================] - 15s 5ms/step - loss: 1.3157 - acc: 0.4903 - val_loss: 0.9386 - val_acc: 0.6562
Epoch 2/12
3200/3200 [==============================] - 15s 5ms/step - loss: 0.7593 - acc: 0.7184 - val_loss: 0.7342 - val_acc: 0.7163
Epoch 3/12
3200/3200 [==============================] - 16s 5ms/step - loss: 0.6163 - acc: 0.7737 - val_loss: 0.6189 - val_acc: 0.7688
Epoch 4/12
3200/3200 [==============================] - 14s 5ms/step - loss: 0.4820 - acc: 0.8147 - val_loss: 0.6228 - val_acc: 0.7512
Epoch 5/12
3200/3200 [==============================] - 15s 5ms/step - loss: 0.4272 - acc: 0.8491 - val_loss: 0.6892 - val_acc: 0.7550
Epoch 6/12
3200/3200 [==============================] - 14s 4ms/step - loss: 0.3695 - acc: 0.8606 - val_loss: 0.6637 - val_acc: 0.7475
Epoch 7/12
3200/3200 [==============================] - 14s 5ms/step - loss: 0.3239 - acc: 0.8791 - val_loss: 0.7030 - val_acc: 0.7500
Epoch 8/12
3200/3200 [==============================] - 13s 4ms/step - loss: 0.2788 - acc: 0.9056 - val_loss: 0.7326 - val_acc: 0.7538
Epoch 9/12
3200/3200 [==============================] - 14s 4ms/step - loss: 0.2457 - acc: 0.9138 - val_loss: 0.7843 - val_acc: 0.7362
Epoch 10/12
3200/3200 [==============================] - 13s 4ms/step - loss: 0.2107 - acc: 0.9241 - val_loss: 0.8854 - val_acc: 0.7350
Epoch 11/12
3200/3200 [==============================] - 13s 4ms/step - loss: 0.1975 - acc: 0.9331 - val_loss: 0.8781 - val_acc: 0.7350
Epoch 12/12
3200/3200 [==============================] - 13s 4ms/step - loss: 0.1770 - acc: 0.9384 - val_loss: 0.8481 - val_acc: 0.7538

Embedding + RNN 验证集精度明显提高,最高为0.7688,而训练集精度达到90%以上,明显有过拟合现象。经过调参,验证集可能获得更高精确度。

 

3. 腾讯中文词向量 + RNN

将腾讯AI Lab开源的中文词向量全量上传至MySQL,通过查询MySQL获取样本分词后的中文词向量,词向量维度200。

#连接MySQL
db = pymysql.connect(host="***",user="***", password="***",db="***", port=***)
cursor = db.cursor()

#查询词向量
def querywordvec(word):
    vec_int = []
    query_sql= "select * from XXX where word='%s'" % word
    try:
        cursor.execute(query_sql)
        results = cursor.fetchall()
        vec = results[0][2]
        for i in range(len(vec)):
            vec_int.append(float(vec[i]))
    except:
        print("Error: unable to fetch data")
    return vec_int

举个例子,查询“开发”,结果:

print(querywordvec("开发"))
[0.205318, 0.02924, 0.025059, -0.031507, -0.035252, 0.147428, 0.064118, 0.402488, 0.424321, 0.437024, 0.012467, -0.098729, -0.158572, -0.088177, -0.043449, 0.089409, -0.099055, -0.283804, 0.112545, 0.025541, -0.01726, -0.150909, -0.083299, 0.037459, 0.29605, 0.01388, -0.287553, 0.117286, 0.13666, 0.493275, 0.302443, 0.082535, -0.009056, 0.24045, -0.007371, 0.119541, 0.432921, 0.025741, -0.29922, 0.21198, 0.021523, 0.220857, 0.44779, 0.291499, -0.184952, -0.006434, -0.115189, -0.266904, 0.003495, -0.159119, -0.384113, -0.387713, 0.170096, 0.198, 0.07035, 0.177311, -0.019644, -0.188508, 0.031889, -0.392723, 0.227364, 0.616728, -0.059071, -0.364697, -0.077505, -0.260351, -0.268732, 0.238778, 0.427052, 0.321993, -0.037369, -0.159352, 0.400518, 0.229699, -0.3446, 0.046306, -0.066257, 0.377816, -0.055773, -0.325963, -0.102563, 0.205084, 0.118749, 0.403796, 0.085079, -0.134903, -0.035444, 0.126386, -0.142862, -0.293126, 0.142639, 0.108202, -0.022327, 0.011597, 0.426736, -0.090832, 0.168828, -0.330628, -0.333454, 0.214868, -0.452769, -0.024319, 0.072544, 0.127925, -0.389489, -0.088286, -0.190296, -0.085807, 0.077976, -0.020705, -0.219265, -0.360965, 0.207212, -0.285199, -0.211401, -0.120366, 0.086652, 0.090502, -0.144671, -0.392925, 0.285612, -0.427401, -0.148718, -0.061124, 0.139129, 0.199844, -0.39937, -0.123523, -0.21283, -0.129926, 0.249201, -0.196021, 0.166995, -0.452386, -0.371797, -0.234237, -0.063534, 0.056776, -0.159089, 0.260067, -0.412732, -0.195597, -0.431985, -0.471183, -0.170963, 0.180416, 0.197121, 0.296519, 0.081022, -0.383157, -0.227555, -0.285242, 0.040487, -0.224609, -0.121581, 0.186237, 0.010203, 0.502054, -0.2188, -0.088945, 0.219765, -0.045787, 0.119763, 0.30921, 0.231384, -0.163442, -0.1442, 0.06971, -0.325053, -0.247143, 0.112627, 0.034369, -0.096266, 0.194837, -0.301401, -0.099836, -0.075422, 0.367559, -0.319538, 0.470193, -0.165735, -0.350219, 0.295977, 0.009617, 0.201713, 0.33146, 0.03736, 0.224218, -0.09293, 0.10523, -0.018303, 0.191042, 0.260462, 0.025095, 0.122858, 0.635381, 0.26528, 0.309128, -0.30828, -0.015132]

获取词向量矩阵,统计未查询到的词共有k=1002, 如“租代建”,“按平销”,“65%”,“税盘”等,全体数据集可获得13281-1002 = 12279个中文词向量。

k = 1
num_words = len(word_index_dict)  
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index_dict.items():
    vec = querywordvec(word)
    if len(vec) != 0:
        embedding_matrix[i-1] = vec
    else:
        k+=1
db.close()

设置嵌入层:

# trainable = False,代表词向量不作为参数进行更新
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

同样的,调用第二节Embedding + RNN中LSTM模型训练,修改epochs=20,结果:

Train on 3200 samples, validate on 800 samples
Epoch 1/20
3200/3200 [==============================] - 23s 7ms/step - loss: 1.3878 - acc: 0.3259 - val_loss: 1.3240 - val_acc: 0.3287
Epoch 2/20
3200/3200 [==============================] - 21s 7ms/step - loss: 1.2471 - acc: 0.4431 - val_loss: 1.2147 - val_acc: 0.4225
Epoch 3/20
3200/3200 [==============================] - 21s 6ms/step - loss: 1.1815 - acc: 0.4797 - val_loss: 1.1546 - val_acc: 0.4888
Epoch 4/20
3200/3200 [==============================] - 21s 7ms/step - loss: 1.1166 - acc: 0.5075 - val_loss: 1.0379 - val_acc: 0.5587
Epoch 5/20
3200/3200 [==============================] - 22s 7ms/step - loss: 1.0926 - acc: 0.5291 - val_loss: 0.9903 - val_acc: 0.5962
Epoch 6/20
3200/3200 [==============================] - 21s 6ms/step - loss: 1.0647 - acc: 0.5384 - val_loss: 0.9586 - val_acc: 0.5850
Epoch 7/20
3200/3200 [==============================] - 12s 4ms/step - loss: 1.0155 - acc: 0.5725 - val_loss: 1.1208 - val_acc: 0.4913
Epoch 8/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.9712 - acc: 0.5816 - val_loss: 0.8816 - val_acc: 0.6275
Epoch 9/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.9227 - acc: 0.6134 - val_loss: 0.8344 - val_acc: 0.6338
Epoch 10/20
3200/3200 [==============================] - 11s 4ms/step - loss: 0.9085 - acc: 0.6163 - val_loss: 0.8979 - val_acc: 0.6312
Epoch 11/20
3200/3200 [==============================] - 12s 4ms/step - loss: 0.8666 - acc: 0.6331 - val_loss: 0.9032 - val_acc: 0.6138
Epoch 12/20
3200/3200 [==============================] - 12s 4ms/step - loss: 0.8023 - acc: 0.6597 - val_loss: 0.7601 - val_acc: 0.6800
Epoch 13/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.7834 - acc: 0.6778 - val_loss: 0.8206 - val_acc: 0.6488
Epoch 14/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.7422 - acc: 0.6966 - val_loss: 0.8015 - val_acc: 0.6637
Epoch 15/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.7179 - acc: 0.7041 - val_loss: 0.7536 - val_acc: 0.6637
Epoch 16/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.6965 - acc: 0.7084 - val_loss: 0.7869 - val_acc: 0.6937
Epoch 17/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.6855 - acc: 0.7216 - val_loss: 0.7411 - val_acc: 0.6800
Epoch 18/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.6342 - acc: 0.7341 - val_loss: 0.7292 - val_acc: 0.7125
Epoch 19/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.6188 - acc: 0.7550 - val_loss: 0.7849 - val_acc: 0.6987
Epoch 20/20
3200/3200 [==============================] - 11s 3ms/step - loss: 0.5950 - acc: 0.7641 - val_loss: 0.7121 - val_acc: 0.7188

腾讯中文词向量的效果,训练集与验证集精度大致相当,验证集精度最高0.7188。

 

4. 百度中文词向量 + RNN 

使用百度开放平台获取基础NLP资源(传送门:https://cloud.baidu.com/doc/NLP/index.html),需要注册百度云账号,并按程序获得APP_ID,API_KEY,SECRET_KEY配置信息。

#使用python sdk获取中文词向量
APP_ID = '***'
API_KEY = '***'
SECRET_KEY = '***'
client = AipNlp(APP_ID, API_KEY, SECRET_KEY)

获取中文词向量,词向量维度1024:

k = 1
num_words = len(word_index_dict)  
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index_dict.items():
    #去除数字
    pattern = re.compile('[0-9]+')
    match = pattern.findall(word)
    if len(word) <= 2 and not match:
        wordinfo = client.wordEmbedding(word)
        time.sleep(0.6)
        try :
            wordinfo['vec'] 
        except Exception as e: 
            k+=1
        else:
            embedding_vector = wordinfo['vec']
    if embedding_vector is not None:
        embedding_matrix[i-1] = embedding_vector

百度中文词向量只能查询两个字的词语,共有2566个两字词未获取词向量,两字以上的词语有4658个,数字类的词590个,全体数据集可获得13281-2566-590-4658 = 5467个中文词向量。

训练模型,调用第二节Embedding + RNN中LSTM模型训练,修改epochs=20,结果:

Train on 3200 samples, validate on 800 samples
Epoch 1/20
3200/3200 [==============================] - 18s 5ms/step - loss: 1.1565 - acc: 0.5000 - val_loss: 0.8308 - val_acc: 0.6700
Epoch 2/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.8307 - acc: 0.6653 - val_loss: 0.7311 - val_acc: 0.7163
Epoch 3/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.7342 - acc: 0.7109 - val_loss: 0.8210 - val_acc: 0.6613
Epoch 4/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.6248 - acc: 0.7591 - val_loss: 0.7152 - val_acc: 0.7175
Epoch 5/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.5598 - acc: 0.7812 - val_loss: 0.6750 - val_acc: 0.7350
Epoch 6/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.4954 - acc: 0.8069 - val_loss: 0.6367 - val_acc: 0.7625
Epoch 7/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.4472 - acc: 0.8341 - val_loss: 0.6642 - val_acc: 0.7625
Epoch 8/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.4024 - acc: 0.8453 - val_loss: 0.6992 - val_acc: 0.7488
Epoch 9/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.3542 - acc: 0.8656 - val_loss: 0.7054 - val_acc: 0.7650
Epoch 10/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.3318 - acc: 0.8694 - val_loss: 0.8433 - val_acc: 0.7300
Epoch 11/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.3042 - acc: 0.8809 - val_loss: 0.8931 - val_acc: 0.7113
Epoch 12/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.2804 - acc: 0.8919 - val_loss: 0.7937 - val_acc: 0.7525
Epoch 13/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.2454 - acc: 0.9106 - val_loss: 0.8601 - val_acc: 0.7475
Epoch 14/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.2255 - acc: 0.9131 - val_loss: 0.8541 - val_acc: 0.7450
Epoch 15/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.1984 - acc: 0.9266 - val_loss: 0.9004 - val_acc: 0.7550
Epoch 16/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.1934 - acc: 0.9300 - val_loss: 0.8683 - val_acc: 0.7612
Epoch 17/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.1783 - acc: 0.9291 - val_loss: 0.9406 - val_acc: 0.7525
Epoch 18/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.1735 - acc: 0.9353 - val_loss: 0.9408 - val_acc: 0.7600
Epoch 19/20
3200/3200 [==============================] - 15s 5ms/step - loss: 0.1453 - acc: 0.9519 - val_loss: 1.0023 - val_acc: 0.7612
Epoch 20/20
3200/3200 [==============================] - 16s 5ms/step - loss: 0.1463 - acc: 0.9425 - val_loss: 1.0696 - val_acc: 0.7575

训练集精度很高0.9519,验证集最高0.7650,也出现过拟合。

 

5. 经典分类器

除了用深度学习方法外,我们也可以试着用经典的贝叶斯分类器、随机森林、支持向量机等模型,试试精度如何。

#中文文本处理
def TextProcessing(data_list, class_list):    
    #切分训练集和验证集
    train_data_list, test_data_list, train_class_list, test_class_list = train_test_split(data_list, class_list, test_size=0.25)
    
    # 统计训练集词频
    all_words_dict = {}  
    for word_list in train_data_list:
        for word in word_list:
            if word in all_words_dict.keys():
                all_words_dict[word] += 1
            else:
                all_words_dict[word] = 1
  
    # 根据键的值倒序排序
    all_words_tuple_list = sorted(all_words_dict.items(), key=lambda f: f[1], reverse=True)
    all_words_list, all_words_nums = zip(*all_words_tuple_list)  # 解压缩
    all_words_list = list(all_words_list)  # 转换成列表
    
    return all_words_list, train_data_list, test_data_list, train_class_list, test_class_list
  
#预先需要jieba切词,取前N个词
def Words_dict(all_words_list, deleteN, stopwords_set=set()):
    feature_words = []  # 特征列表
    n = 1
    for t in range(deleteN, len(all_words_list), 1):
        if n > 1000:  # feature_words的维度为1000
            break
            # 如果这个词不是数字,并且不是指定的结束语,并且单词长度大于1小于5,那么这个词就可以作为特征词
        if not all_words_list[t].isdigit() and all_words_list[t] not in stopwords_set and 1 < len(all_words_list[t]) < 5:
            feature_words.append(all_words_list[t])
        n += 1
    return feature_words
  
#获取One-hot向量
def TextFeatures(train_data_list, test_data_list, feature_words):
    def text_features(text, feature_words):  # 出现在特征集中,则置1
        text_words = set(text)
        features = [1 if word in text_words else 0 for word in feature_words]
        return features
  
    train_feature_list = [text_features(text, feature_words) for text in train_data_list]
    test_feature_list = [text_features(text, feature_words) for text in test_data_list]
    return train_feature_list, test_feature_list  # 返回结果

设计分类器:

#多分类贝叶斯分类器
def TextClassifier_MultinomialNB(train_feature_list, test_feature_list, train_class_list, test_class_list):
    classifier = MultinomialNB().fit(train_feature_list, train_class_list)
    test_accuracy = classifier.score(test_feature_list, test_class_list)
    return test_accuracy

#支持向量机分类器
def TextClassifier_SVC(train_feature_list, test_feature_list, train_class_list, test_class_list):
    clf = SVC(C=150,kernel='rbf', degree=51, gamma='auto',coef0=0.0,shrinking=False,probability=False,tol=0.001,cache_size=300, class_weight=None,verbose=False,max_iter=-1,decision_function_shape=None,random_state=None)
    clf.fit(train_feature_list, train_class_list)
    test_accuracy=clf.score(test_feature_list, test_class_list)
    return test_accuracy

#随机森林分类器
def TextClassifier_RF(train_feature_list, test_feature_list, train_class_list, test_class_list):
    forest = RandomForestClassifier(n_estimators=500,random_state=5, warm_start=False, 
                                    min_impurity_decrease=0.0,
                                    min_samples_split=15)
    forest.fit(train_feature_list, train_class_list)
    test_accuracy=forest.score(test_feature_list, test_class_list)
    return test_accuracy

看看各个分类器的效果

if __name__ == '__main__':
    # 文本预处理
    all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(data_list,class_list)
    feature_words = Words_dict(all_words_list, 0, stopwords_set)
    train_feature_list, test_feature_list = TextFeatures(train_data_list, test_data_list, feature_words)
    
    #比较各个分类器的精确度
    test_accuracy_MultinomialNB = TextClassifier_MultinomialNB(train_feature_list, test_feature_list, train_class_list, test_class_list)
    print("MultinomialNB Accuracy:",test_accuracy_MultinomialNB)
    
    test_accuracy_SVC = TextClassifier_SVC(train_feature_list, test_feature_list, train_class_list, test_class_list)
    print("SVC Accuracy:",test_accuracy_SVC)
    
    test_accuracy_RF = TextClassifier_RF(train_feature_list, test_feature_list, train_class_list, test_class_list)
    print("RF Accuracy:",test_accuracy_RF)
  

结果:

MultinomialNB Accuracy: 0.71
SVC Accuracy: 0.7542857142857143
RF Accuracy: 0.7557142857142857

可见,对于小数据集,采用经典分类器,也能获得不输于深度学习的效果。

 

总结

通过以上试验可以看出,在小数据集上,经典分类器有着很好的效果。在采用深度学习和词向量的例子中,有中文词向量的模型要比One-hot 模型精度提高不少,奇怪的是,从百度获得的中文词向量个数要比从腾讯获得的中文词向量个数要少一半多,训练效果却比较好,有待更进一步的试验。

最后,笔者获悉Chinese Word Vectors中文词向量已经推出(传送门:https://github.com/Embedding/Chinese-Word-Vectors),有兴趣者可以根据这个来做一些试验。

祝大家学习愉快~

 

欢迎关注微信公众号:a_white_deer

欢迎讨论技术问题:[email protected]

 

你可能感兴趣的:(NLP,自然语言处理,word2vec,中文词向量,文本分类)