python fasttext 文本分类

参考

https://github.com/facebookresearch/fastText/tree/master/python#train_supervised-parameters(官方git)
https://tianchi.aliyun.com/course/316?spm=5176.21206777.J_3641663050.16.44ea17c9FVba3m(天池案例)

参数

input             # training file path (required)
lr                # learning rate [0.1]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [1]
minCountLabel     # minimal number of label occurences [1]
minn              # min length of char ngram [0]
maxn              # max length of char ngram [0]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [softmax]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
label             # label prefix ['__label__']
verbose           # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

代码

import fasttext

model = fasttext.train_supervised('data.train.txt')

print(model.words)
print(model.labels)

model.predict("Which baking dish is best to bake a banana bread ?")
model.save_model("model_filename.ftz")

# 以csv形式传入
df_all['label_ft'] = '__label__' + df_all['label'].astype(str)
df_all[['msg', 'label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')
# 训练
model = fasttext.train_supervised('train.csv', lr=0.05, wordNgrams=5, verbose=2, minCount=1, epoch=20,
                                      loss="hn")
# 预测
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in df_all['msg'].values[-5000:0]]
# 评分
print(f1_score(df_all['label'].values[5000:0], val_pred, average='macro')

备注

  • input:中的标签应以’label’为前缀,否则修改label参数
  • loss:损失函数共有四种。ns, hs, softmax, ova。softmax为默认,函数公式参考https://blog.csdn.net/gbz3300255/article/details/108470972。ova参考https://wenku.baidu.com/view/b6f675a2ef3a87c24028915f804d2b160a4e8673.html。其他两个不知道。

你可能感兴趣的:(NLP,机器学习,深度学习,python,数据挖掘,机器学习,数据分析)