TagSpace 单词、标签的嵌入
用途: 学习从短文到相关主题标签的映射,例如,在 这篇文章 中的描述。这是一个典型的分类应用。
模型: 通过学习两者的嵌入,学习的映射从单词集到标签集。 例如,输入“restaurant has great food <\tab> #restaurant <\tab> #yum”将被翻译成下图。(图中的节点是要学习嵌入的实体,图中的边是实体之间的关系。
training:
The AG’s news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
新闻数据,4大类,12万篇。
World
Sports
Business
Sci/Tech
The file classes.txt contains a list of classes corresponding to each label.
__label__2 , garca winds up best in tough going , given what sergio garca has achieved in his career already it is difficult to believe he is only 24 years old . he had a 67 yesterday , four under , to share the volvo masters lead with his fellow spaniard
__label__3 , us shares take a tumble on oil prices , new york , nov 23 ( afp ) - wall street shares slid on tuesday as oil prices surged higher and investors sensed weaknesses in the technology sector .
__label__4 , product review blackberry 7100t smartphone ( newsfactor ) , newsfactor - research in motion ' s ( nasdaq rimm ) quad-band \blackberry 7100t with \pda capabilities is a gsm/gprs ( 850/900/1800/1900 mhz ) cellular handset that can make and receive phone calls in more than 100 countries around the world .
./classification_ag_news.sh
Downloading dataset ag_news
Compiling StarSpace
make: *** No targets specified and no makefile found. Stop.
Start to train on ag_news data:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 20
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 0
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : /tmp/starspace/data/ag_news.train
Read 5M words
Number of words in dictionary: 95811
Number of labels in dictionary: 4
Loading data from file : /tmp/starspace/data/ag_news.train
Total number of examples loaded : 120000
Initialized model weights. Model size :
matrix : 95815 10
Training epoch 0: 0.01 0.002
Epoch: 100.0% lr: 0.008017 loss: 0.036824 eta: <1min tot: 0h0m0s (20.0%)
---+++ Epoch 0 Train error : 0.03529871 +++--- ☃
Training epoch 1: 0.008 0.002
Epoch: 100.0% lr: 0.006033 loss: 0.018947 eta: <1min tot: 0h0m1s (40.0%)
---+++ Epoch 1 Train error : 0.01904551 +++--- ☃
Training epoch 2: 0.006 0.002
Epoch: 100.0% lr: 0.004000 loss: 0.015143 eta: <1min tot: 0h0m1s (60.0%)
---+++ Epoch 2 Train error : 0.01569214 +++--- ☃
Training epoch 3: 0.004 0.002
Epoch: 100.0% lr: 0.002000 loss: 0.014580 eta: <1min tot: 0h0m2s (80.0%)
---+++ Epoch 3 Train error : 0.01361201 +++--- ☃
Training epoch 4: 0.002 0.002
Epoch: 100.0% lr: -0.000000 loss: 0.011692 eta: <1min tot: 0h0m2s (100.0%)
---+++ Epoch 4 Train error : 0.01211345 +++--- ☃
Saving model to file : /tmp/starspace/models/ag_news
Saving model in tsv format : /tmp/starspace/models/ag_news.tsv
Start to evaluate trained model:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 50
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to load a trained starspace model.
STARSPACE-2018-2
Initialized model weights. Model size :
matrix : 95815 10
Model loaded.
Loading data from file : /tmp/starspace/data/ag_news.test
Total number of examples loaded : 7600
------Loaded model args:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Predictions use 4 known labels.
Evaluation Metrics :
hit@1: 0.917105 hit@10: 1 hit@20: 1 hit@50: 1 mean ranks : 1.10263 Total examples : 7600
核心脚本是:
../starspace train \
-trainFile "${DATADIR}"/ag_news.train \
-model "${MODELDIR}"/ag_news \
-initRandSd 0.01 \
-adagrad false \
-ngrams 1 \
-lr 0.01 \
-epoch 5 \
-thread 20 \
-dim 10 \
-negSearchLimit 5 \
-trainMode 0 \
-label "__label__" \
-similarity "dot" \
-verbose true
通过训练得到模型:
Saving model to file : /tmp/starspace/models/ag_news
Saving model in tsv format : /tmp/starspace/models/ag_news.tsv
models$ wc -l ag_news.tsv
95815 ag_news.tsv
这个其实就是词向量和标签向量。
race-day -0.00570984 0.00158739 0.0122618 0.0040543 0.0113636 0.0171366 -0.000411966 -0.00239621 0.00648128 0.00166139
807km -0.0186726 0.00619443 -0.000953957 -0.00184454 -0.00456583 0.00638993 0.00550364 -0.00182587 -0.00843166 0.0182373
__label__2 0.277595 0.276433 -0.137682 0.20364 -0.129544 -0.21292 0.284562 -0.127947 0.18984 0.0745169
__label__4 0.00365386 -0.15706 -0.0372935 -0.0641772 0.0239136 0.0957492 -0.175419 0.372076 -0.166987 0.168438
__label__3 -0.0821545 -0.0460339 0.0548889 -0.274909 0.325552 -0.0362561 -0.0883801 -0.110232 -0.0197069 -0.107682
__label__1 -0.22841 -0.0855479 0.102414 0.166055 -0.27244 0.14407 -0.0423679 -0.149521 -0.00979879 -0.135866
也可以参利用其它的格式来训练的文档。
输入文件的格式:
restaurant has great food #yum #restaurant
命令:
$./starspace train -trainFile input.txt -model tagspace -label ‘#’
示例脚本:
我们将该模型应用于 AG的新闻主题分类数据集 的文本分类问题。在这一问题中我们的标签是新闻文章类别,我们使用 hit@1 度量来衡量分类的准确性。这个示例脚本 下载数据并在示例目录下运行StarSpace模型:
$bash examples/classification_ag_news.sh
请继续阅读下面的文献。我这里只是抽象一些核心的理念。
https://www.aclweb.org/anthology/D14-1194
我们简要分析一下这篇论文,这是2014年的一篇论文。