学过数据挖掘和机器学习课的人估计都知道有weka这么一个开源软件,这个软件实现了很多聚类,分类算法。如果没有接触过这个软件,可以参看这篇文章,看完后基本就了解的差不多了: http://blog.csdn.net/yangliuy/article/details/7589306
当然,很多情况下我们不是直接使用这个软件,而是希望把软件中的算法拿过来集成到我们自己的系统中,也就是通常意义上的二次开发。由于weka是开源的,二次开发自然很方面。下面来简单说说如何进行二次开发。
首先在本地安装weka,在安装目录下找到jar包
在eclipse下新建一个工程,导入weka的jar包
接下来就可以尽情使用weka里面提供的算法了,这里我使用的EM算法对用户进行聚类
import java.io.*; import weka.core.*; import weka.clusterers.*; import weka.filters.*; import weka.filters.unsupervised.attribute.Remove; /** * This class shows how to perform a "classes-to-clusters" * evaluation like in the Explorer using EM. The class needs as * first parameter an ARFF file to work on. The last attribute is * interpreted as the class attribute. * <p/> * This code is based on the method "startClusterer" of the * "weka.gui.explorer.ClustererPanel" class and the * "evaluateClusterer" method of the "weka.clusterers.ClusterEvaluation" * class. * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class ClassesToClusters { public static void main(String[] args) throws Exception { // load data BufferedReader reader = new BufferedReader(new FileReader("C:/Users/felven/Desktop/1.arff")); Instances data = new Instances(reader); data.setClassIndex(data.numAttributes() - 1); // generate data for clusterer (w/o class) Remove filter = new Remove(); filter.setAttributeIndices("" + (data.classIndex() + 1)); filter.setInputFormat(data); Instances dataClusterer = Filter.useFilter(data, filter); // train clusterer EM clusterer = new EM(); // set further options for EM, if necessary... String[] options = new String[4]; // max. iterations options[0] = "-I"; options[1] = "100"; //set cluseter numbers options[2]="-N"; options[3]="3"; clusterer.setOptions(options); clusterer.buildClusterer(dataClusterer); // evaluate clusterer ClusterEvaluation eval = new ClusterEvaluation(); eval.setClusterer(clusterer); eval.evaluateClusterer(data); // print results System.out.println(eval.clusterResultsToString()); } }
当然,我们也可以实现稍复杂的功能,例如把结果输出到文件,同时为了美观,把程序按模块进行分解。
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import weka.clusterers.ClusterEvaluation; import weka.clusterers.EM; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove; public class UserBehaviourCluster { private Instances instances; private String sourceFile; private String targetFile; private int maxCluster; private int[] assignment; public UserBehaviourCluster(String sourceFile, String targetFile, int maxCluster) { this.sourceFile = sourceFile; this.targetFile = targetFile; this.maxCluster = maxCluster; } public void loadData() throws Exception { DataSource dataSource = new DataSource(sourceFile); instances = dataSource.getDataSet(); } public void cluster() throws Exception { //set attribute 1 weight to 0 instances.attribute(0).setWeight(0); EM clusterer = new EM(); // set further options for EM, if necessary... String[] options = new String[4]; // max. iterations options[0] = "-I"; options[1] = "100"; //set cluseter numbers options[2]="-N"; options[3]= Integer.toString(maxCluster); clusterer.setOptions(options); clusterer.buildClusterer(instances); // evaluate clusterer ClusterEvaluation eval = new ClusterEvaluation(); eval.setClusterer(clusterer); eval.evaluateClusterer(instances); // print results //System.out.println(eval.clusterResultsToString()); assignment = new int[instances.numInstances()]; for (int j = 0; j<instances.numInstances() ; j++){ assignment[j]=clusterer.clusterInstance(instances.get(j) ); System.out.println(assignment[j]); } } public void writeResult() throws Exception { FileReader fr = null; BufferedReader br = null; FileWriter fw = null; BufferedWriter bw = null; String line=null; int j=0; try { fr = new FileReader(sourceFile); br = new BufferedReader(fr); fw = new FileWriter(targetFile); bw = new BufferedWriter(fw); line=br.readLine(); bw.write(line+",cluster\n"); while((line=br.readLine())!=null){ bw.write(line+","+assignment[j++]+"\n"); } } finally { if (br != null) { br.close(); } if (bw != null) { bw.close(); } } } public void process() throws Exception{ loadData(); cluster(); writeResult(); } public static void main(String[] args) throws Exception{ UserBehaviourCluster cluster=new UserBehaviourCluster("C:/Users/felven/Desktop/test3.csv", "C:/Users/felven/Desktop/test4.csv", 2); cluster.process(); } }
其中读入的文件格式如下:
输出文件格式如下:
感觉还不错吧,更多的内容可以看这里: http://weka.wikispaces.com/Use+Weka+in+your+Java+code
最后要感谢老大在java方面提供帮助,我java实在太菜了。