fastText文本分类

fastText文本分类原理

【1.】对N个文档分词获得词表
【2.】用词粒度/字粒度的ngram扩充词表(有一些hash tricks以防词表爆炸)
【3.】获得某1个文档的分词和ngram词索引向量
【4.】对上一步的词索引向量做embedding,获得该文档的embedding词向量序列
【5.】对上一步的embedding词向量序列,按词取平均
【6.】对上一步取平均之后的向量做层次softmax(多分类softmax的变体)

fastText文本分类建模调参案例

安装fastText:https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
下载数据集(cooking.stackexchange.tar.gz):https://download.csdn.net/download/cymy001/11222071
数据集说明:数据集是从Stack exchange 网站的烹饪部分下载的问题示例及相应的类别标签,将基于此数据集构建一个分类器来自动识别烹饪问题的类别。 cooking.stackexchange.tar.gz解压缩后的文本文件的每一行都包含一个标签列表(所有标签都以 __label__ 前缀开始),其后是相应的文档。对模型进行训练,以预测给定文档的标签。

step1:baseline
#查看全量数据集
prodeMBP:fastText_learn pro$ wc cooking.stackexchange.txt
   15404  169582 1401900 cooking.stackexchange.txt
 #切分训练集、验证集,验证集评估该学习分类器对新数据的适用程度
prodeMBP:fastText_learn pro$ head -n 12404 cooking.stackexchange.txt > cooking.train
prodeMBP:fastText_learn pro$ tail -n 3000 cooking.stackexchange.txt > cooking.valid 
#训练模型(input参数传入训练数据,-output 保存模型的位置和文件名)
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking.train -output model_cooking
Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 0.1%  words/sec/thread: 50053  lr: 0.099944  loss: 15.778889
Progress: 0.1%  words/sec/thread: 77722  lr: 0.099885  loss: 16.707058
Progress: 0.2%  words/sec/thread: 94088  lr: 0.099843  loss: 17.041201
...
Progress: 99.1%  words/sec/thread: 81627  lr: 0.000851  loss: 9.994794
Progress: 100.0%  words/sec/thread: 81663  lr: 0.000000  loss: 9.992051  eta: 0h0m 
#拿两个例子测试训练好的模型文件model_cooking.bin
#(1)预测1个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext predict model_cooking.bin -
Which baking dish is best to bake a banana bread ?
__label__baking
__label__food-safety
Why not put knives in the dishwasher?
__label__food-safety
#(2)预测5个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext predict model_cooking.bin - 5
Why not put knives in the dishwasher?
__label__food-safety __label__baking __label__bread __label__equipment __label__substitutions

在 Stack Exchange 上,这句话标有三个标签:equipment,cleaning 和 knives。模型预测的五个标签中有一个是正确的,精确度为 0.20。 在三个真实标签中,只有 equipment 标签被该模型预测出,召回率为 0.33。

#拿测试集整体了解模型model_cooking.bin的预测质量,P@i/R@i表示预测i个标签对应的准确/召回率
#(1)预测1个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking.bin cooking.valid 
N	3000
P@1	0.149
R@1	0.0646
Number of examples: 3000
#(2)预测5个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking.bin cooking.valid 5
N	3000
P@5	0.067
R@5	0.145
Number of examples: 3000
step2:数据预处理优化
#预处理单词包含大写字母或标点符号
prodeMBP:fastText_learn pro$ cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed_1.txt
prodeMBP:fastText_learn pro$ head -n 12404 cooking.preprocessed_1.txt > cooking_1.train
prodeMBP:fastText_learn pro$ tail -n 3000 cooking.preprocessed_1.txt > cooking_1.valid
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_1
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 0.0%  words/sec/thread: 85227  lr: 0.099987  loss: 15.532344
Progress: 0.1%  words/sec/thread: 125621  lr: 0.099921  loss: 16.70705
···
Progress: 99.4%  words/sec/thread: 88113  lr: 0.000576  loss: 9.895709
Progress: 100.0%  words/sec/thread: 96064  lr: 0.000000  loss: 9.894776  eta: 0h0m 

prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_1.bin cooking_1.valid 
N	3000
P@1	0.177
R@1	0.0767
Number of examples: 3000
step3:调参
#加入参数迭代次数epoch,表示每个样本在整个训练过程出现的次数。
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_2 -epoch 25
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 0.0%  words/sec/thread: 41040  lr: 0.100000  loss: 15.532344
Progress: 0.0%  words/sec/thread: 101927  lr: 0.099987  loss: 16.70705
···
Progress: 100.0%  words/sec/thread: 88445  lr: 0.000026  loss: 7.17666
Progress: 100.0%  words/sec/thread: 88444  lr: 0.000000  loss: 7.176500  eta: 0h0m 

prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_2.bin cooking_1.valid
N	3000
P@1	0.519
R@1	0.225
Number of examples: 3000
 #加入参数lr,改变模型学习速度,即处理每个样本后 模型变化的幅度。 
 #学习率为 0 意味着模型根本不会改变,即不会学到任何东西;好的学习速率在 0.1 - 1.0 范围内。
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_3 -lr 1.0
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 0.6%  words/sec/thread: 83356  lr: 0.994433  loss: 15.532344
Progress: 0.6%  words/sec/thread: 84618  lr: 0.994055  loss: 14.675364
Progress: 0.7%  words/sec/thread: 81756  lr: 0.993252  loss: 13.116207
···
Progress: 99.6%  words/sec/thread: 88352  lr: 0.004163  loss: 6.479478
Progress: 100.0%  words/sec/thread: 88351  lr: 0.000000  loss: 6.479478  eta: 0h0m 

prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_3.bin cooking_1.valid
N	3000
P@1	0.568
R@1	0.245
Number of examples: 3000
#同时加入参数lr、epoch
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_4 -lr 1.0 -epoch 25
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 0.0%  words/sec/thread: 75736  lr: 0.999973  loss: 15.532344
Progress: 0.0%  words/sec/thread: 76247  lr: 0.999843  loss: 16.707058
Progress: 0.0%  words/sec/thread: 90871  lr: 0.999761  loss: 17.011614
···
Progress: 99.9%  words/sec/thread: 88333  lr: 0.001316  loss: 4.354121
Progress: 100.0%  words/sec/thread: 88332  lr: 0.000000  loss: 4.353338  eta: 0h0m

prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_4.bin cooking_1.valid
N	3000
P@1	0.591
R@1	0.255
Number of examples: 3000
 #加入2-gram
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_5 -lr 1.0 -epoch 25 -wordNgrams 2
Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 0.0%  words/sec/thread: 81584  lr: 0.999974  loss: 15.532344
Progress: 0.0%  words/sec/thread: 158995  lr: 0.999657  loss: 16.70705
Progress: 0.0%  words/sec/thread: 96518  lr: 0.999574  loss: 17.011614
Progress: 0.1%  words/sec/thread: 87310  lr: 0.999470  loss: 17.178629
···
Progress: 99.8%  words/sec/thread: 87492  lr: 0.001646  loss: 3.171684
Progress: 100.0%  words/sec/thread: 87481  lr: 0.000000  loss: 3.171049  eta: 0h0m 

prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_5.bin cooking_1.valid
N	3000
P@1	0.613
R@1	0.265
Number of examples: 3000

fastText文本分类_第1张图片精度从 14.9% 到达 61.3%的重要改进步骤包括:
预处理数据 ;
改变迭代次数 (使用选项 -epoch, 标准范围 [5 - 50]) ;
改变学习速率 (使用选项 -lr, 标准范围 [0.1 - 1.0]) ;
使用 word n-grams (使用选项 -wordNgrams, 标准范围 [1 - 5]).

你可能感兴趣的:(自然语言处理)