GibbsLDA++

3.4 例子学习

         比如,我们想估计一个LDA模型,对一个文档集合,存储在文件models/casestudy/trndocs.dat中。继而使用其模型来做推论,为存储在文件models/casestudy/newdocs.dat中的新数据。

         我们想要估计100个主题,alpha=0.5beta=1.我们想完成1000 Gibbs取样重复,保存模型,在每100个重复,并且每次保存模型,都打印出每个话题的最相似20个单词。设想我们现在在GibbsLDA++的根目录,我们将运行如下的命令,从头估计LDA模型。

 

         $ src/lda -est -alpha 0.5 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 20 -dfile models/casestudy/trndocs.dat

 

         现在查看models/casestudy目录,我们可以看到如下的输出。

 

Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files:

 

<model_name>.others

<model_name>.phi

<model_name>.theta

<model_name>.tassign

<model_name>.twords

 

in which:

 

<model_name>: is the name of a LDA model corresponding to the time step it was saved on the hard disk. For example, the name of the model was saved at the Gibbs sampling iteration 400th will be model-00400. Similarly, the model was saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.

 

<model_name>.others: This file contains some parameters of LDA model, such as:

 

alpha=?

beta=?

ntopics=? # i.e., number of topics

ndocs=? # i.e., number of documents

nwords=? # i.e., the vocabulary size

liter=? # i.e., the Gibbs sampling iteration at which the model was saved

 

<model_name>.phi: This file contains the word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the vocabulary.

 

<model_name>.theta: This file contains the topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic.

 

<model_name>.tassign: This file contains the topic assignments for words in training data. Each line is a document that consists of a list of <wordij>:<topic of wordij>

 

 

 

 

<model_file>.twords: This file contains twords most likely words of each topic. twords is specified in the command line (see Sections 3.1.1 and 3.1.2).

 

GibbsLDA++ also saves a file called wordmap.txt that contains the maps between words and word's IDs (integer). This is because GibbsLDA++ works directly with integer IDs of words/terms inside instead of text strings.

 

         现在,我们想要继续完成另一个800 Gibbs取样重复,从先前估计的模型model-01000savestep=100twords=30,我们完成如下的命令:

 

         $ src/lda -estc -dir models/casestudy/ -model model-01000 -niters 800 -savestep 100 -twords 30

 

         现在查看casestudy目录来看输出。

 

         现在,如果我们想要推论(30 Gibbs取样重复)为新数据newdocs.dat使用一个先前估计的LDA模型,比如model-01800,我们完成如下的命令:

 

          src/lda -inf -dir models/casestudy/ -model model-01800 -niters 30 -twords 20 -dfile newdocs.dat

        

         现在,查看casestudy目录,我们可以看到推论的输出

 

newdocs.dat.others

newdocs.dat.phi

newdocs.dat.tassign

newdocs.dat.theta

newdocs.dat.twords

你可能感兴趣的:(bbs)