3.4 例子学习
比如,我们想估计一个LDA模型,对一个文档集合,存储在文件models/casestudy/trndocs.dat中。继而使用其模型来做推论,为存储在文件models/casestudy/newdocs.dat中的新数据。
我们想要估计100个主题,alpha=0.5且beta=1.我们想完成1000 Gibbs取样重复,保存模型,在每100个重复,并且每次保存模型,都打印出每个话题的最相似20个单词。设想我们现在在GibbsLDA++的根目录,我们将运行如下的命令,从头估计LDA模型。
$ src/lda -est -alpha 0.5 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 20 -dfile models/casestudy/trndocs.dat
现在查看models/casestudy目录,我们可以看到如下的输出。
Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files:
<model_name>.others
<model_name>.phi
<model_name>.theta
<model_name>.tassign
<model_name>.twords
in which:
<model_name>: is the name of a LDA model corresponding to the time step it was saved on the hard disk. For example, the name of the model was saved at the Gibbs sampling iteration 400th will be model-00400. Similarly, the model was saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.
<model_name>.others: This file contains some parameters of LDA model, such as:
alpha=?
beta=?
ntopics=? # i.e., number of topics
ndocs=? # i.e., number of documents
nwords=? # i.e., the vocabulary size
liter=? # i.e., the Gibbs sampling iteration at which the model was saved
<model_name>.phi: This file contains the word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the vocabulary.
<model_name>.theta: This file contains the topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic.
<model_name>.tassign: This file contains the topic assignments for words in training data. Each line is a document that consists of a list of <wordij>:<topic of wordij>
<model_file>.twords: This file contains twords most likely words of each topic. twords is specified in the command line (see Sections 3.1.1 and 3.1.2).
GibbsLDA++ also saves a file called wordmap.txt that contains the maps between words and word's IDs (integer). This is because GibbsLDA++ works directly with integer IDs of words/terms inside instead of text strings.
现在,我们想要继续完成另一个800 Gibbs取样重复,从先前估计的模型model-01000以savestep=100,twords=30,我们完成如下的命令:
$ src/lda -estc -dir models/casestudy/ -model model-01000 -niters 800 -savestep 100 -twords 30
现在查看casestudy目录来看输出。
现在,如果我们想要推论(30 Gibbs取样重复)为新数据newdocs.dat使用一个先前估计的LDA模型,比如model-01800,我们完成如下的命令:
src/lda -inf -dir models/casestudy/ -model model-01800 -niters 30 -twords 20 -dfile newdocs.dat
现在,查看casestudy目录,我们可以看到推论的输出
newdocs.dat.others
newdocs.dat.phi
newdocs.dat.tassign
newdocs.dat.theta
newdocs.dat.twords