StarSpace系列之一:tagspace

问题类型

TagSpace 单词、标签的嵌入
用途: 学习从短文到相关主题标签的映射,例如,在 这篇文章 中的描述。这是一个典型的分类应用。

模型: 通过学习两者的嵌入,学习的映射从单词集到标签集。 例如,输入“restaurant has great food <\tab> #restaurant <\tab> #yum”将被翻译成下图。(图中的节点是要学习嵌入的实体,图中的边是实体之间的关系。

StarSpace系列之一:tagspace_第1张图片

训练数据

training:

training data

The AG’s news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
新闻数据,4大类,12万篇。
World
Sports
Business
Sci/Tech

数据样例

The file classes.txt contains a list of classes corresponding to each label.

__label__2 , garca winds up best in tough going , given what sergio garca has achieved in his career already it is difficult to believe he is only 24 years old . he had a 67 yesterday , four under , to share the volvo masters lead with his fellow spaniard
__label__3 , us shares take a tumble on oil prices , new york , nov 23 ( afp ) - wall street shares slid on tuesday as oil prices surged higher and investors sensed weaknesses in the technology sector .
__label__4 , product review blackberry 7100t smartphone ( newsfactor ) , newsfactor - research in motion ' s ( nasdaq rimm ) quad-band \blackberry 7100t with \pda capabilities is a gsm/gprs ( 850/900/1800/1900 mhz ) cellular handset that can make and receive phone calls in more than 100 countries around the world .

训练

./classification_ag_news.sh
Downloading dataset ag_news
Compiling StarSpace
make: *** No targets specified and no makefile found.  Stop.
Start to train on ag_news data:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 20
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 0
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : /tmp/starspace/data/ag_news.train
Read 5M words
Number of words in dictionary:  95811
Number of labels in dictionary: 4
Loading data from file : /tmp/starspace/data/ag_news.train
Total number of examples loaded : 120000
Initialized model weights. Model size :
matrix : 95815 10
Training epoch 0: 0.01 0.002
Epoch: 100.0%  lr: 0.008017  loss: 0.036824  eta: <1min   tot: 0h0m0s  (20.0%)
 ---+++                Epoch    0 Train error : 0.03529871 +++--- ☃
Training epoch 1: 0.008 0.002
Epoch: 100.0%  lr: 0.006033  loss: 0.018947  eta: <1min   tot: 0h0m1s  (40.0%)
 ---+++                Epoch    1 Train error : 0.01904551 +++--- ☃
Training epoch 2: 0.006 0.002
Epoch: 100.0%  lr: 0.004000  loss: 0.015143  eta: <1min   tot: 0h0m1s  (60.0%)
 ---+++                Epoch    2 Train error : 0.01569214 +++--- ☃
Training epoch 3: 0.004 0.002
Epoch: 100.0%  lr: 0.002000  loss: 0.014580  eta: <1min   tot: 0h0m2s  (80.0%)
 ---+++                Epoch    3 Train error : 0.01361201 +++--- ☃
Training epoch 4: 0.002 0.002
Epoch: 100.0%  lr: -0.000000  loss: 0.011692  eta: <1min   tot: 0h0m2s  (100.0%)
 ---+++                Epoch    4 Train error : 0.01211345 +++--- ☃
Saving model to file : /tmp/starspace/models/ag_news
Saving model in tsv format : /tmp/starspace/models/ag_news.tsv
Start to evaluate trained model:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 50
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to load a trained starspace model.
STARSPACE-2018-2
Initialized model weights. Model size :
matrix : 95815 10
Model loaded.
Loading data from file : /tmp/starspace/data/ag_news.test
Total number of examples loaded : 7600
------Loaded model args:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Predictions use 4 known labels.
Evaluation Metrics :
hit@1: 0.917105 hit@10: 1 hit@20: 1 hit@50: 1 mean ranks : 1.10263 Total examples : 7600

核心脚本是:

../starspace train \
  -trainFile "${DATADIR}"/ag_news.train \
  -model "${MODELDIR}"/ag_news \
  -initRandSd 0.01 \
  -adagrad false \
  -ngrams 1 \
  -lr 0.01 \
  -epoch 5 \
  -thread 20 \
  -dim 10 \
  -negSearchLimit 5 \
  -trainMode 0 \
  -label "__label__" \
  -similarity "dot" \
  -verbose true
  

分析一下

通过训练得到模型:
Saving model to file : /tmp/starspace/models/ag_news
Saving model in tsv format : /tmp/starspace/models/ag_news.tsv
models$ wc -l ag_news.tsv

95815 ag_news.tsv
这个其实就是词向量和标签向量。

race-day	-0.00570984	0.00158739	0.0122618	0.0040543	0.0113636	0.0171366	-0.000411966	-0.00239621	0.00648128	0.00166139
807km	-0.0186726	0.00619443	-0.000953957	-0.00184454	-0.00456583	0.00638993	0.00550364	-0.00182587	-0.00843166	0.0182373
__label__2	0.277595	0.276433	-0.137682	0.20364	-0.129544	-0.21292	0.284562	-0.127947	0.18984	0.0745169
__label__4	0.00365386	-0.15706	-0.0372935	-0.0641772	0.0239136	0.0957492	-0.175419	0.372076	-0.166987	0.168438
__label__3	-0.0821545	-0.0460339	0.0548889	-0.274909	0.325552	-0.0362561	-0.0883801	-0.110232	-0.0197069	-0.107682
__label__1	-0.22841	-0.0855479	0.102414	0.166055	-0.27244	0.14407	-0.0423679	-0.149521	-0.00979879	-0.135866

其它格式也可以训练

也可以参利用其它的格式来训练的文档。
输入文件的格式:
restaurant has great food #yum #restaurant
命令:
$./starspace train -trainFile input.txt -model tagspace -label ‘#’

示例脚本:
我们将该模型应用于 AG的新闻主题分类数据集 的文本分类问题。在这一问题中我们的标签是新闻文章类别,我们使用 hit@1 度量来衡量分类的准确性。这个示例脚本 下载数据并在示例目录下运行StarSpace模型:

$bash examples/classification_ag_news.sh

总结

  1. 适合为文本打标签,标签集合可以比分类体系更大。分类问题可以直接用fastText原始版。当然也可以用这个。
  2. 当然,如果你想把分类标签或者主题标签与词向量,在一个特征空间做向量化,没错,就是这个了。

如果你想了解深层的模型

请继续阅读下面的文献。我这里只是抽象一些核心的理念。

https://www.aclweb.org/anthology/D14-1194
我们简要分析一下这篇论文,这是2014年的一篇论文。

  1. 论文题目
    TAGSPACE: Semantic Embeddings from Hashtags
  2. 论文思想
    利用CNN做doc向量; 然后优化 f(w,t+),f(w,t-)的距离作为目标函数,得到了 t(标签)和doc在一个特征空间的向量表达;这样就可以找 doc的hashtags了。关于warp loss 请 参考最后一篇文献。

所有的精华都在一张图上:
StarSpace系列之一:tagspace_第2张图片

参考文献

  1. http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
  2. Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
  3. https://www.aclweb.org/anthology/D14-1194
  4. https://medium.com/@gabrieltseng/intro-to-warp-loss-automatic-differentiation-and-pytorch-b6aa5083187a

你可能感兴趣的:(算法,深度学习,机器学习,深度学习,cnn,标签,主题,embedding)