深度学习可以用于文本表示,可以将其映射到一个低纬空间。其中比较典型的例子有:FastText、Word2Vec和Bert。这里我们将介绍FastText
FastText是一种典型的深度学习词向量的表示方法,它非常简单通过Embedding层将单词映射到稠密空间,然后将句子中所有的单词在Embedding空间中进行平均,进而完成分类操作。
所以FastText是一个三层的神经网络,输入层、隐含层和输出层。
通过keras实现的FastText网络结构
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
from keras.layers import Dense
VOCAB_SIZE = 2000
EMBEDDING_DIM = 100
MAX_WORDS = 500
CLASS_NUM = 5
def build_fastText():
model = Sequential()
# 通过embedding层,我们将词汇因射程EMBEDDING_dim维向量
model.add(Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_WORDS))
# 通过GlobalAveragePooling1D,我们平均了文档中所有词的embedding
model.add(GlobalAveragePooling1D())
# 通过输出层Softmax分类(真实的fastText这里是分层Softmax),得到类别概率分布
model.add(Dense(CLASS_NUM, activation='softmax'))
# 定义损失函数、优化器、分类度量指标
model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])
return model
model = build_fastText()
print(model.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 500, 100) 200000
_________________________________________________________________
global_average_pooling1d_1 ( (None, 100) 0
_________________________________________________________________
dense_1 (Dense) (None, 5) 505
=================================================================
Total params: 200,505
Trainable params: 200,505
Non-trainable params: 0
_________________________________________________________________
None
FastText在文本分类任务上,是优于TF-IDF的:
可以参考论文:Bag of Tricks for Efficient Text Classification.
FastText可以快速的在CPU上进行训练,最好的实践方法就是 官方开源的版本.
import pandas as pd
from sklearn.metrics import f1_score
# 转换为FastText需要的格式
train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000)
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text','label_ft',]].iloc[:-5000].to_csv('train.csv', index=None, header = None, sep='\t')
import fasttext
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,
verbose=2, minCount=1, epoch=25, loss="hs")
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
# 0.8252748964565065
fast的参数可以看这篇博客fasttext(3)-- API 文档 & 参数说明.
这里我们使用10折交叉验证,每折使用9/10的数据进行训练,剩余1/10作为验证集检验模型的效果。这里需要注意每折的划分必须保证标签的分布与整个数据集的分布一致。
数据集划分代码如下:
import pandas as pd
from sklearn.model_selection import StratifiedKFold
train_df = pd.read_csv('./train_set.csv', sep='\t')
skf = StratifiedKFold(n_splits=10) # 分层采样
for n_fold, (tr_idx, val_idx) in enumerate(skf.split(train_df['text'],train_df['label'])):
tr_x, tr_y, val_x, val_y = train_df['text'].iloc[tr_idx], train_df['label'][tr_idx], train_df['text'].iloc[val_idx], train_df['label'][val_idx]
tr_y = '__label__' + tr_y.astype(str)
traindata = pd.DataFrame(list(zip(tr_x.values, tr_y.values)))
traindata.to_csv(f'./k_fold_all/train_split{n_fold}.csv', index=None, header=['text', 'label_ft'], sep='\t')
testdata = pd.DataFrame(list(zip(val_x.values, val_y.values)))
testdata.to_csv(f'./k_fold_all/test_split{n_fold}.csv', index=None, header=['text', 'label'], sep='\t')
由于阿里的DSW磁盘空间有限,不能一次划分,需要每次使用后删除再继续使用,可以参考以下代码
import fasttext
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold
import os
train_df = pd.read_csv('./train_set.csv', sep='\t')
skf = StratifiedKFold(n_splits=10) # 分层采样
val_f1=[]
for n_fold, (tr_idx, val_idx) in enumerate(skf.split(train_df['text'],train_df['label'])):
print('start')
tr_x, tr_y, val_x, val_y = train_df['text'].iloc[tr_idx], train_df['label'][tr_idx], train_df['text'].iloc[val_idx], train_df['label'][val_idx]
tr_y = '__label__' + tr_y.astype(str)
traindata = pd.DataFrame(list(zip(tr_x.values, tr_y.values)))
traindata.to_csv(f'./k_fold_all/train_split{n_fold}.csv', index=None, header=['text', 'label_ft'], sep='\t')
model = fasttext.train_supervised(f'./k_fold_all/train_split{n_fold}.csv', lr=1.0, wordNgrams=2,
verbose=2, minCount=1, epoch=25, loss='hs')
os.remove(f'./k_fold_all/train_split{n_fold}.csv')
testdata = pd.DataFrame(list(zip(val_x.values, val_y.values)))
testdata.to_csv(f'./k_fold_all/test_split{n_fold}.csv', index=None, header=['text', 'label'], sep='\t')
val_df = pd.read_csv(f'./k_fold_all/test_split{n_fold}.csv', sep='\t')
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in val_df['text']]
val_f1.append(f1_score(val_df['label'].values.astype(str), val_pred, average='macro'))
os.remove(f'./k_fold_all/test_split{n_fold}.csv')
print(f'the f1_score of {n_fold} training is:', val_f1[n_fold])
print('The average f1_score is', sum(val_f1)/len(val_f1))
结果如下:
start
Read 163M words
Number of words: 6828
Number of labels: 14
Progress: 100.0% words/sec/thread: 2251816 lr: 0.000000 avg.loss: 0.089500 ETA: 0h 0m 0s
Traceback (most recent call last):
File "fasttext_demo.py", line 17, in <module>
os.remove(f'./k_fold_all/train_split{n_fold}.csv')
NameError: name 'os' is not defined
sh-4.2$
sh-4.2$ python fasttext_demo.py
start
Read 163M words
Number of words: 6828
Number of labels: 14
Progress: 100.0% words/sec/thread: 2244679 lr: 0.000000 avg.loss: 0.088328 ETA: 0h 0m 0s
the f1_score of 0 training is: 0.9120997907905529
start
Read 163M words
Number of words: 6817
Number of labels: 14
Progress: 100.0% words/sec/thread: 2301944 lr: 0.000000 avg.loss: 0.094412 ETA: 0h 0m 0s
the f1_score of 1 training is: 0.9167706177720959
start
Read 163M words
Number of words: 6840
Number of labels: 14
Progress: 100.0% words/sec/thread: 2509204 lr: 0.000000 avg.loss: 0.091381 ETA: 0h 0m 0s
the f1_score of 2 training is: 0.9143350076699093
start
Read 163M words
Number of words: 6819
Number of labels: 14
Progress: 100.0% words/sec/thread: 2256073 lr: 0.000000 avg.loss: 0.093455 ETA: 0h 0m 0s
the f1_score of 3 training is: 0.9125226732362589
start
Read 163M words
Number of words: 6803
Number of labels: 14
Progress: 100.0% words/sec/thread: 2322227 lr: 0.000000 avg.loss: 0.092105 ETA: 0h 0m 0s
the f1_score of 4 training is: 0.9166235998880595
start
Read 163M words
Number of words: 6820
Number of labels: 14
Progress: 100.0% words/sec/thread: 2236955 lr: 0.000000 avg.loss: 0.087768 ETA: 0h 0m 0s
the f1_score of 5 training is: 0.9157415100282014
start
Read 163M words
Number of words: 6829
Number of labels: 14
Progress: 100.0% words/sec/thread: 2935532 lr: 0.000000 avg.loss: 0.091475 ETA: 0h 0m 0s
the f1_score of 6 training is: 0.916523049067497
start
Read 163M words
Number of words: 6811
Number of labels: 14
Progress: 100.0% words/sec/thread: 2321420 lr: 0.000000 avg.loss: 0.095948 ETA: 0h 0m 0s
the f1_score of 7 training is: 0.9097686032531304
start
Read 163M words
Number of words: 6820
Number of labels: 14
Progress: 100.0% words/sec/thread: 2558949 lr: 0.000000 avg.loss: 0.091643 ETA: 0h 0m 0s
the f1_score of 8 training is: 0.9136261632461267
start
Read 163M words
Number of words: 6819
Number of labels: 14
Progress: 100.0% words/sec/thread: 2461429 lr: 0.000000 avg.loss: 0.093095 ETA: 0h 0m 0s
the f1_score of 9 training is: 0.9167282203907307
按照官方提供的参数先用15000条训练,在验证集上得分87.69,线上得分88.27
使用全部数据验证集得分91.21,线上91.29
线上线下差别不大,测试集A和训练集应该是同分布的
用15000个数据随便调一下
import pandas as pd
from sklearn.metrics import f1_score
import fasttext
train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000)
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')
for lr in [1.0, 0.1, 0.01]:
for loss in ['hs','ns','softmax']:
for epoch in [5, 10, 15, 20]:
model = fasttext.train_supervised('train.csv', lr=lr, wordNgrams=2, verbose=2, minCount=1, epoch=epoch, loss=loss)
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f'loss:{loss},lr:{lr},epoch:{epoch}')
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
loss:hs,lr:1.0,epoch:5
0.7620961931962223
loss:hs,lr:1.0,epoch:10
0.8035277121396049
loss:hs,lr:1.0,epoch:15
0.8258956670449468
loss:hs,lr:1.0,epoch:20
0.8247007076560852
loss:ns,lr:1.0,epoch:5
0.7507353242633579
loss:ns,lr:1.0,epoch:10
0.8278838952561169
loss:ns,lr:1.0,epoch:15
0.8623175031934739
loss:ns,lr:1.0,epoch:20
0.8747259381025999
loss:softmax,lr:1.0,epoch:5
0.8300330441795662
loss:softmax,lr:1.0,epoch:10
0.8710713759559711
loss:softmax,lr:1.0,epoch:15
0.8787511628050294
loss:softmax,lr:1.0,epoch:20
0.879324856171109
loss:hs,lr:0.1,epoch:5
0.13440858698369465
loss:hs,lr:0.1,epoch:10
0.3029053780828899
loss:hs,lr:0.1,epoch:15
0.43312510966014184
loss:hs,lr:0.1,epoch:20
0.5731313956699715
loss:ns,lr:0.1,epoch:5
0.10784658055298713
loss:ns,lr:0.1,epoch:10
0.25715870734183666
loss:ns,lr:0.1,epoch:15
0.3762214074337213
loss:ns,lr:0.1,epoch:20
0.46314232252732657
loss:softmax,lr:0.1,epoch:5
0.1308005358612768
loss:softmax,lr:0.1,epoch:10
0.32929993852967476
loss:softmax,lr:0.1,epoch:15
0.49923308082195517
loss:softmax,lr:0.1,epoch:20
0.6453246580387494
loss:hs,lr:0.01,epoch:5
0.024066902663277188
loss:hs,lr:0.01,epoch:10
0.024066902663277188
loss:hs,lr:0.01,epoch:15
0.024066902663277188
loss:hs,lr:0.01,epoch:20
0.024066902663277188
loss:ns,lr:0.01,epoch:5
0.0005691519635742744
loss:ns,lr:0.01,epoch:10
0.022930058524417155
loss:ns,lr:0.01,epoch:15
0.022930058524417155
loss:ns,lr:0.01,epoch:20
0.022930058524417155
loss:softmax,lr:0.01,epoch:5
0.022949324243224427
loss:softmax,lr:0.01,epoch:10
0.022930058524417155
loss:softmax,lr:0.01,epoch:15
0.022930058524417155
loss:softmax,lr:0.01,epoch:20
0.022930058524417155