今天是NLP之旅第四次打卡了,今天的内容开始接触到了fasttext模型,下面简要的介绍一下fasttext。
FastText是一种神经网络的文本分类模型。首先,模型做的第一步是将文本序列中所有词进行了Embeding,然后将词向量叠加,组成句子向量,之后放入神经网络进行训练。
如上图所示,FastText通过神经网络的隐藏层进行分类输出!
代码如下:
import pandas as pd
from sklearn.metrics import f1_score
# 转换为FastText需要的格式
train_df = pd.read_csv('./data/train_set.csv', sep='\t', nrows=15000)
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')
import fasttext
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,
verbose=2, minCount=1, epoch=25, loss="hs")
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
#输出:0.8254489506623065
此时的F1得分为0.82,在此基础上进行参数调节,观察结果。
首先minCount参数,将其依次增大,观察结果
list_minCount = [1,3,5,7,9]
for min_count in list_minCount:
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,
verbose=2, minCount=min_count, epoch=25, loss="hs")
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
#输出:0.8254489506623065,0.8238417806176509,0.8269662953096721,0.8249239914825777
观察到f1得分上下波动,可见此参数不一定越大越好。
其次改变lr参数,观察结果:
lr_list=[1.0,1.5,2.0,2.5,3.0]
for lr in lr_list:
model = fasttext.train_supervised('train.csv', lr=1.5, wordNgrams=2,
verbose=2, minCount=1, epoch=25, loss="hs")
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
#输出:0.8254489506623065,0.8253173427147923,0.8315348831695133,0.832263221325885,0.8203914829799589
得分随学习率不同而浮动