Mallet主要用于文本分类,所以它设计思路都是偏向文本分类的。
由于需要用到里面的最大熵以及贝叶斯算法 所以 得研究一下
主页 :http://mallet.cs.umass.edu/index.php
参考文章:http://mallet.cs.umass.edu/classifier-devel.php
http://mallet.cs.umass.edu/import-devel.php
网上找了下,材料不多,只能自己苦逼地去看官方提供的一些guide还有API,然后就研究源代码了
我的目的是,把MALLET导入到自己的java项目中(用的是eclipse),然后灵活地用里面一些算法,bayes,和最大熵算法进行文本分类。
导入到工程部分:
下载链接:http://mallet.cs.umass.edu/download.php 我这个时候的最新版本是2.0.7
这是压缩包里面的内容,把src文件夹以及lib里面的jar包都拷贝到工程项目里面,把jar包都加载到工程上
最终我的工程目录是这样的,src放我自己的一些类
malletSrc放mallet的源码
mallet文件夹里面放的都是对应的jar包
下面是我的研究笔记:
具体各个类的用法只能通过API和源码以及自己的测试去分析了。
下面提供一些测试例子
为了生成一个Instance得搞定下图这几个东西啊..REF:http://mallet.cs.umass.edu/import-devel.php
好像子类还好多,我只研究到我够用的几个东西就O了。
源代码里面的注释:
An instance contains four generic fields of predefined name:
"data", "target", "name", and "source". "Data" holds the data represented
`by the instance, "target" is often a label associated with the instance,
"name" is a short identifying name for the instance (such as a filename),
and "source" is human-readable sourceinformation, (such as the original text).
关于Data:
需要Alphabet以及FeatureVetor,配合使用,Alphabet用来保存各个属性的名字,FeatureVector用来保存一个对象在各个属性下的值
测试代码1:
public static void main(String[] args) { String[] attributeStr = new String[]{"长","宽","高"}; Alphabet dict = new Alphabet(attributeStr); double[] values = new double[]{1,2,3}; FeatureVector vetor = new FeatureVector(dict, values); System.out.println(vetor.toString()); }
输出:
长(0)=1.0
宽(1)=2.0
高(2)=3.0
我们可以指定values对应与哪个属性值,从0开始,比如长对应0,宽对应1,高对应2,测试如下
public static void main(String[] args) { String[] attributeStr = new String[]{"长","宽","高"}; Alphabet dict = new Alphabet(attributeStr); double[] values = new double[]{1,2,3}; int[] indices = new int[]{2,0,1}; FeatureVector vetor = new FeatureVector(dict, indices,values); System.out.println(vetor.toString()); }
输出:
长(0)=2.0
宽(1)=3.0
高(2)=1.0
一个比较地方需要注意的是如果指明的values的对应索引有重复,比如,2和3都指明它属于长,那么得到的值是累计的而不是覆盖的,值为5,这个就单词统计的效果吧
String[] attributeStr = new String[]{"长","宽","高"}; Alphabet dict = new Alphabet(attributeStr); double[] values = new double[]{1,2,3}; int[] indices = new int[]{2,0,0}; FeatureVector vetor = new FeatureVector(dict, indices,values); System.out.println(vetor.toString());
输出:
长(0)=5.0
高(2)=1.0
好吧 先把Data搞定了。FeatureVector就是我需要的data
Source:我就让它为NULL了
Label:
/** You should never call this directly. New Label objects are
created on-demand by calling LabelAlphabet.lookupIndex(obj). */
上面是源代码的一句话,Label需要通过LabelAlphabet来创建,所以再研究下LabelAlphabet,然后做以下测试
public static void main(String[] args) { LabelAlphabet labels = new LabelAlphabet(); Label label = labels.lookupLabel("桌子"); System.out.println(label.toString()); }
输出为:桌子,这样一来Label也搞定了
Name:作为一个instance的id号,那么就简单的用整型作为它的序号好了。
好了,这四个东西都搞定了,就可以创建Instance了,然后把Instance都加入到InstanceList里面去 之后就可以参考http://mallet.cs.umass.edu/classifier-devel.php
进行分类了,分类测试代码如下:
import cc.mallet.*; import cc.mallet.classify.Classifier; import cc.mallet.classify.ClassifierTrainer; import cc.mallet.classify.MaxEntTrainer; import cc.mallet.types.Alphabet; import cc.mallet.types.FeatureVector; import cc.mallet.types.Instance; import cc.mallet.types.InstanceList; import cc.mallet.types.Label; import cc.mallet.types.LabelAlphabet; import cc.mallet.types.Labeling; public class test { String label;//实例的类别 double length;//长度 double width;//宽度 double high; public test(String label,double length,double width,double high){ this.label = label; this.length = length; this.width = width; this.high = high; } public static void main(String[] args) { LabelAlphabet labels = new LabelAlphabet(); String[] attributeName = new String[]{"长","宽","高"}; Alphabet dic = new Alphabet(attributeName); labels.lookupIndex("桌子"); labels.lookupIndex("椅子"); InstanceList list = new InstanceList(dic,labels); int id = 0; for(int i = 0; i < 100; ++i){ test temp = new test("桌子",4,2,3); test temp2 = new test("椅子",0,0,0); double[] tempArray = new double[3]; tempArray[0] = temp.length; tempArray[1] = temp.width; tempArray[2] = temp.high; FeatureVector vec = new FeatureVector(dic, tempArray); Instance ins = new Instance(vec, labels.lookupLabel(temp.label), ++id, null); list.add(ins); tempArray[0] = temp2.length; tempArray[1] = temp2.width; tempArray[2] = temp2.high; vec = new FeatureVector(dic, tempArray); ins = new Instance(vec, labels.lookupLabel(temp2.label), ++id, null); list.add(ins); } //创造一个测试样本 test testTemp = new test("未知",0,0,2); double[] tempArray = new double[3]; tempArray[0] = testTemp.length; tempArray[1] = testTemp.width; tempArray[2] = testTemp.high; FeatureVector vec = new FeatureVector(dic, tempArray); Instance testIns = new Instance(vec,null, ++id, null); //进行最大熵分类 ClassifierTrainer trainer = new MaxEntTrainer(); Classifier classifier = trainer.train(list); Labeling label = classifier.classify(testIns).getLabeling(); System.out.println(label.getBestLabel().toString()); } }
输出结果 :
得到的分类结果为椅子 左下角
关于那个异常,它备注了,(This is not necessarily cause for alarm. Sometimes this happens close to the maximum, where the function may be very flat.)
好吧,先研究这样吧,基本够我用了,笔记就先这样记着