之前几篇文章讲到了文档主题模型,但是毕竟我的首要任务还是做分类任务,而涉及主题模型的原因主要是用于text representation,因为考虑到Topic Model能够明显将文档向量降低维度,当然TopicModel可以做比这更多的事情,但是对于分类任务,我觉得这一点就差不多了。
LDA之前已经说到过,是一个比较完善的文档主题模型,这次试用的是JGibbsLDA开源的LDA代码做LDA的相关工作,简单易用,用法官网上有,也可以自行谷歌。
按照官网上的参数和格式规范,就可以训练生成语料相关的结果了,一共会产生以下几个文件:
- model-final.twords:topic-word,也就是每个主题对应的单词分布
- model-final.others:LDA的一些参数
- model-final.phi:该文件是一个主题数×词数量的矩阵
- model-final.tassign:这个是统计文档单词的tf-idf
- model-final.theta:这个就是我们需要的,表示文档对应的主题概率
- wordmap.txt:这个是用来统计单词词频
当然我们需要用到的是model-final.theta这个文件,并将它作为文档神经网络分类器的输入文章向量;
然后开始我们的实验:
实验语料:20_newsgroups,包含20类的分类新闻,并将测试集和训练集按照1:1分开
实验环境:JDK1.8 windows7
使用LDA开源工具:JGibbsLDA
分类器使用:100*300*20的简单三层神经BP神经网络,神经网络的工具选取的是JOONE
首先,将预料进行预处理,去掉停用词和无关的词语(如日期年份邮件地址等),这个实验没有使用词干化处理,原因是开始准备使用Lucene的词干化处理工具,但是其处理效果很不好,会把does词干化成doe,把integrate 词干化成intergr 这就达不到我们的目的,而之后使用Stanford的coreNLP词干化工具,coreNLP词干化效果不错,但是其处理是基于上下文的,导致处理速度过慢,达不到预期效果,所以最后没有做词干化处理
由于LDA对于短文本的效果并不好,所以我们针对语料进行了筛选,选择了文本长度大于5000的文章,当然这个是我自己定义的,不一定具备什么道理,经过这个处理之后,训练文本的数量减少到了126个测试文本数量减少到了121个(之前都是9500个训练文本和测试文本) PS:这个实验只是用来测试LDA的Text Presentation性能,所以对于小部分文本进行测试就达到了实验的目的。
训练文本trainScale处理后的形式(这里这是列举了三行,全部资源见下面链接):
126
archive atheism resources alt atheism archive resources modified december version atheist resources addresses atheist organizations usa freedom religion foundation darwin fish bumper stickers assorted atheist paraphernalia freedom religion foundation write ffrf box madison wi telephone evolution designs evolution designs sell darwin fish fish symbol christians stick cars feet word darwin written inside deluxe moulded plastic fish postpaid write evolution designs laurel canyon north hollywood san francisco bay area darwin fish lynn gold mailing net lynn directly price fish american atheist press aap publish atheist books critiques bible lists biblical contradictions book bible handbook ball foote american atheist press isbn edition bible contradictions absurdities atrocities immoralities ball foote bible contradicts aap based king james version bible write american atheist press box austin tx cameron road austin tx telephone fax prometheus books sell books including haught holy horrors write east amherst street buffalo york telephone alternate address newer older prometheus books glenn drive buffalo ny african americans humanism organization promoting black secular humanism uncovering history black freethought publish quarterly newsletter aah examiner write norm allen jr african americans humanism box buffalo ny united kingdom rationalist press association national secular society islington high street holloway road london ew london nl british humanist association south place ethical society lamb conduit passage conway hall london wc rh red lion square london wc rl fax national secular society publish freethinker monthly magazine founded germany ibka internationaler bund der konfessionslosen und atheisten postfach berlin germany ibka publish journal miz materialien und informationen zur zeit politisches journal der konfessionslosesn und atheisten hrsg ibka miz vertrieb postfach berlin germany atheist books write ibdk internationaler ucherdienst der konfessionslosen postfach hannover germany telephone books fiction thomas disch santa claus compromise short story ultimate proof santa exists characters events fictitious similarity living dead gods uh walter miller jr canticle leibowitz gem atomic doomsday novel monks spent lives copying blueprints saint leibowitz filling sheets paper ink leaving white lines letters edgar pangborn davy atomic doomsday novel set clerical church example forbids produce describe substance atoms philip dick philip dick dick wrote philosophical thought provoking short stories novels stories bizarre times approachable wrote sf wrote truth religion technology believed met sort god remained sceptical novels relevance galactic pot healer fallible alien deity summons group earth craftsmen women remote planet raise giant cathedral beneath oceans deity demand faith earthers pot healer joe fernwright unable comply polished ironic amusing novel maze death noteworthy description technology based religion valis schizophrenic hero searches hidden mysteries gnostic christianity reality fired brain pink laser beam unknown divine origin accompanied dogmatic dismissively atheist friend assorted odd characters divine invasion god invades earth making young woman pregnant returns star system terminally ill assisted dead man brain wired hour listening music margaret atwood handmaid tale story based premise congress mysteriously assassinated fundamentalists charge nation set book diary woman life live christian theocracy women property revoked bank accounts closed sinful luxuries outlawed radio readings bible crimes punished retroactively doctors performed legal abortions hunted hanged atwood writing style difficult tale grows chilling authors bible dull rambling work criticized worth reading ll fuss exists versions true version books fiction peter de rosa vicars christ bantam press de rosa christian catholic enlighting history papal immoralities adulteries fallacies german translation gottes erste diener die dunkle seite des papsttums droemer knaur michael martin atheism philosophical justification temple university press philadelphia usa detailed scholarly justification atheism outstanding appendix defining terminology usage tendentious area argues negative atheism belief existence god positive atheism belief existence god includes refutations challenging arguments god attention paid refuting contempory theists platinga swinburne isbn hardcover paperback case christianity temple university press comprehensive critique christianity considers contemporary defences christianity ultimately demonstrates unsupportable incoherent isbn james turner god creed johns hopkins university press baltimore md usa subtitled origins unbelief america examines unbelief agnostic atheistic mainstream alternative view focusses period considering france britain emphasis american england developments religious history secularization atheism god creed intellectual history fate single idea belief god exists isbn hardcover paper george seldes editor thoughts ballantine books york usa dictionary quotations kind concentrating statements writings explicitly implicitly person philosophy view includes obscure suppressed opinions popular observations traces expressed twisted idea centuries number quotations derived cardiff men religion noyes views religion isbn paper richard swinburne existence god revised edition clarendon paperbacks oxford book second volume trilogy began coherence theism concluded faith reason work swinburne attempts construct series inductive arguments existence god arguments tendentious rely imputation late century western christian values aesthetics god supposedly simple conceived decisively rejected mackie miracle theism revised edition existence god swinburne includes appendix incoherent attempt rebut mackie mackie miracle theism oxford posthumous volume comprehensive review principal arguments existence god ranges classical philosophical positions descartes anselm berkeley hume al moral arguments newman kant sidgwick restatements classical theses plantinga swinburne addresses positions push concept god realm rational kierkegaard kung philips replacements god lelie axiarchism book delight read formalistic written martin works refreshingly direct compared hand waving swinburne james haught holy horrors illustrated history religious murder madness prometheus books religious persecution ancient times christians library congress catalog card number norm allen jr african american humanism anthology listing african americans humanism gordon stein anthology atheism rationalism prometheus books anthology covering wide range subjects including devil evil morality history freethought comprehensive bibliography edmund cohen mind bible believer prometheus books study christian fundamentalists net resources small mail based archive server mantis uk carries archives alt atheism moderated articles assorted files send mail archive uk send atheism mail reply mathew ?
其中的每一行都表示一个文档,行的单词表示文档的单词,使用的是词袋模型,因此词的顺序对于结果没有关系
第一行的126表示126篇文档
然后我们将这个训练文本应用于LDA的处理,主要代码如下:
public void lda(){
LDACmdOption ldaOption = new LDACmdOption();
ldaOption.est = true;
ldaOption.K=100; //表示100个主题
ldaOption.beta = 0.1; //beta参数
ldaOption.alpha = 10.0/ldaOption.K; //alpha参数
ldaOption.niters = 500; //迭代代数
ldaOption.savestep=200; //每隔200代就保存一下
ldaOption.modelName="model-train"; //模型名称
ldaOption.dir="D:\\J2ee_workspace\\LDATest"; //训练文本所在目录
ldaOption.dfile="trainScale"; //训练文本文件
Estimator estimator = new Estimator();
estimator.init(ldaOption);
estimator.estimate(); //开始参数估计
}
代码中的具体参数都给出了注释,训练出来的model-final.theta结果如下:(这里只展示model-final.theta的部分内容)
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.3045054945054945;0.002307692307692308;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.004505494505494505;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.5671428571428572;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.017692307692307695;0.0078021978021978015;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.002307692307692308;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;0.027582417582417584;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.02208791208791209;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;0.01989010989010989;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0078021978021978015;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;
需要说明的是,我对JGibbsLDA代码做了部分修改,使之满足我的神经网络分类器的输出格式要求,上面的前20行表示类别信息,中间数字为1的所在位置表示这个类别,比如上面前20列表示这个文本属于类别1, 20列之后表示这个文档的主题分布,我使用了100个类,所以是100个数字
有了训练文本产生的LDA模型就可以对测试数据按照生成的模型产生测试文档向量,在这里,生成测试文档向量的方法有多种,当然最简单的是将测试文档再次丢进训练文档,重新跑个LDA模型出来,这种方法显然耗时,所以不建议采用,当然如果测试文档数量比较大的话而训练文档数量小的话还是可以试一试的,一般会采用第二种方法:对于新的文档,在训练文档生成的模型基础之上在生成新的文档的向量,这个一般的做法是只对新的文档进行Gibbs采样,而模型的twords不变。JGibbsLDA有比较容易的实现方法:
public void generateWithLDAModel(){ LDACmdOption ldaOption = new LDACmdOption(); ldaOption.inf = true; ldaOption.estc = false; ldaOption.dir = "D:\\J2ee_workspace\\LDATest"; ldaOption.modelName = "model-final"; //根据训练文档生成的模型文件,注意文件的位置需要在根目录下 ldaOption.dfile = "testScale"; //测试文档路径 Inferencer inferencer = new Inferencer(); inferencer.init(ldaOption); Model newModel = inferencer.inference(); newModel.saveModelTheta("./vector/test/testScale");//新生成的文档向量文件存放的位置 }
生成新的测试文档向量文件如下(只列出几行):
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;0.001765650080256822;0.004975922953451044;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.4158908507223114;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.07078651685393259;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.0033707865168539327;1.6051364365971107E-4;0.09486356340288925;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.38218298555377206;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.006581059390048154;
上面的表示意义和之前的训练文档向量一样
有了这些个文件,就可以丢到JOONE神经网络分类器(三层100*300*20的简单BP神经网络)里面去分类了:
分类效果如下:
在121个测试用例中,正确的分类用例为100个,准确率约为81%,对于这个结果,我还是觉得可以接受的,虽然可能对于这样的效果还不如简单的tf-idf+SVM模型,但是这个实验主要是想探寻LDA的降维做法对于分类任务是不是可行的,所以对于文档维度为100,81%的结果我觉得还是勉强能接受的。