今天开始分析KMeansDriver的源码,由于之前已经分析过KMeans算法的原理(其实是已经使用hadoop实现过了),所以在这里就不讲原理了,而且大家对这个算法的原理基本都清楚(搞数据挖掘的最清楚的应该就算这个算法了吧)。今天要分析的内容其实可以暂时完全不使用hadoop集群,即可以暂时不用开vmware了。额,废话讲了这么多,开始分析吧。
首先把下面的代码复制到java工程(这个工程是在讲canopy的时候建立的,以后的分析都是在这个工程下面的)中:
package mahout.fansy.test.kmeans; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.mahout.clustering.kmeans.KMeansDriver; import org.apache.mahout.common.distance.ManhattanDistanceMeasure; public class KmeansTest { public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException { Configuration conf=new Configuration(); conf.set("mapred.job.tracker", "hadoop:9001"); Path input=new Path("hdfs://hadoop:9000/user/hadoop/input/test_canopy"); Path clustersIn=new Path("hdfs://hadoop:9000/user/hadoop/output/test_canopy/clusters-0-final/part-r-00000"); Path output =new Path("hdfs://hadoop:9000/user/hadoop/output/test_kmeans"); ManhattanDistanceMeasure measure=new ManhattanDistanceMeasure(); double convergenceDelta =2.0; int maxIterations=4; boolean runClustering=false; double clusterClassificationThreshold=0.5; boolean runSequential=false; KMeansDriver.run(conf, input, clustersIn, output, measure, convergenceDelta , maxIterations, runClustering, clusterClassificationThreshold, runSequential); } }然后在KMeansDriver.run()这一行设置断点,上面的程序其实是设置一些参数,然后调用KMeansDriver的run方法进行下面的操作,点击debug,开始今天的分析之旅:
点击debug后点击F5即可进入run()方法内,其实也可以直接按着CTRL然后左键点击run看其调用方法是哪个。进入:
public static void run(Configuration conf, Path input, Path clustersIn, Path output, DistanceMeasure measure, double convergenceDelta, int maxIterations, boolean runClustering, double clusterClassificationThreshold, boolean runSequential) throws IOException, InterruptedException, ClassNotFoundException { // iterate until the clusters converge String delta = Double.toString(convergenceDelta); if (log.isInfoEnabled()) { log.info("Input: {} Clusters In: {} Out: {} Distance: {}", new Object[] {input, clustersIn, output, measure.getClass().getName()}); log.info("convergence: {} max Iterations: {} num Reduce Tasks: {} Input Vectors: {}", new Object[] { convergenceDelta, maxIterations, VectorWritable.class.getName()}); } Path clustersOut = buildClusters(conf, input, clustersIn, output, measure, maxIterations, delta, runSequential); if (runClustering) { log.info("Clustering data"); clusterData(conf, input, clustersOut, output, measure, clusterClassificationThreshold, runSequential); } }关于log的可以暂时直接忽略不看,那么就直接到了
Path clustersOut = buildClusters(conf, input, clustersIn, output, measure, maxIterations, delta, runSequential);至于后面的if(runClustering)其实就是是否在最后的时候把原始数据进行分类(算法只是得到中心点向量而言),这个暂时也不理,那就看buildClusters吧。好吧,CTRL+鼠标左键,binggo,进入:
public static Path buildClusters(Configuration conf, Path input, Path clustersIn, Path output, DistanceMeasure measure, int maxIterations, String delta, boolean runSequential) throws IOException, InterruptedException, ClassNotFoundException { double convergenceDelta = Double.parseDouble(delta);// 获得输入参数 List<Cluster> clusters = new ArrayList<Cluster>(); // 初始化中心点向量 KMeansUtil.configureWithClusterInfo(conf, clustersIn, clusters); //***************************重点分析1 if (clusters.isEmpty()) { throw new IllegalStateException("No input clusters found in " + clustersIn + ". Check your -c argument."); } Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath); if (runSequential) { new ClusterIterator().iterateSeq(conf, input, priorClustersPath, output, maxIterations); } else { new ClusterIterator().iterateMR(conf, input, priorClustersPath, output, maxIterations); } return output; }首先两行是初始化参数之类的,然后一看,有个KMeansUtil.configureWithClusterInfo()方法,这个是干嘛的?有ClusterIn,clusters参数,ClusterIn是最开始的中心点向量文件位置,而clusters是刚刚才初始化的中心点变量(里面还是空值呢),那会不会是把中心点向量文件中的向量读出来然后赋值给clusters变量了?CTRL+(你懂的),好吧,进入下面的代码:
public static void configureWithClusterInfo(Configuration conf, Path clusterPath, Collection<Cluster> clusters) { for (Writable value : new SequenceFileDirValueIterable<Writable>(clusterPath, PathType.LIST, PathFilters.partFilter(), conf)) { Class<? extends Writable> valueClass = value.getClass(); if (valueClass.equals(ClusterWritable.class)) { ClusterWritable clusterWritable = (ClusterWritable) value; value = clusterWritable.getValue(); valueClass = value.getClass(); } log.debug("Read 1 Cluster from {}", clusterPath); if (valueClass.equals(Kluster.class)) { // get the cluster info clusters.add((Kluster) value); } else if (valueClass.equals(Canopy.class)) { // get the cluster info Canopy canopy = (Canopy) value; clusters.add(new Kluster(canopy.getCenter(), canopy.getId(), canopy.getMeasure())); } else { throw new IllegalStateException("Bad value class: " + valueClass); } } }首先是一个For循环,看
new SequenceFileDirValueIterable<Writable>(clusterPath, PathType.LIST, PathFilters.partFilter(), conf)应该是一个序列文件的读取方法,全部读入了一个集合,不然的话for怎么采用foreach循环?然后看到了ClusterWritable类,依稀记得canopy算法最后产生的结果的value也是这个类型的吧,那么canopy最后产生的文件应该可以作为clustersIn的路径了(感觉应该是的)。再往后看valueClass等于Kluster或者Canopy?这个是什么意思?会不会是用canopy算法得到的最后结果作为clustersIn的话就是Canopy,而使用kmeas算法得到最后的结果作为clustersIn的话就是Kluster了?(这个可以写log进行验证,此为验证1,后面要验证此分析猜想)。至于clusters.add()这个就是添加中心向量到clusters变量了,这个是毋庸置疑的。所以整体上来说这个方法的作用和我们猜想是一样了。
从**************************重点分析1往下走:
Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);这个应该和output有关,再看prior,这个单词应该是初始的意思吧(看来编写mahout源码的英文不错,额,好像写源码的母语就是english?),所以这个应该是第一次运行算法产生的输出。
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath);这三句的最后一句应该是把中心点又写入了一个序列文件?,那么priorClusterPath就是第一次的中心点文件,oh,My,上面猜错了。policy应该是终止迭代的条件,这个值和阈值有关,prior应该是一个分类的东东,因为Classifier(如果我英语还好的话,这个单词应该是分类器?)。
然后就直接是否要在hadoop运行还是单机运行了,这个由runSequential变量进行设置。
重点,重点,重点来啦:
new ClusterIterator().iterateMR(conf, input, priorClustersPath, output, maxIterations);看吧,循环来了:
public void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations) throws IOException, InterruptedException, ClassNotFoundException { ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath); Path clustersOut = null; int iteration = 1; while (iteration <= numIterations) { conf.set(PRIOR_PATH_KEY, priorPath.toString()); String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath; System.out.println(jobName); Job job = new Job(conf, jobName); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(ClusterWritable.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(ClusterWritable.class); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(CIMapper.class); job.setReducerClass(CIReducer.class); FileInputFormat.addInputPath(job, inPath); clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration); priorPath = clustersOut; FileOutputFormat.setOutputPath(job, clustersOut); job.setJarByClass(ClusterIterator.class); if (!job.waitForCompletion(true)) { throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath); } ClusterClassifier.writePolicy(policy, clustersOut); FileSystem fs = FileSystem.get(outPath.toUri(), conf); iteration++; if (isConverged(clustersOut, conf, fs)) { break; } } Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX); FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn); }这里不断进行循环直到不符合条件为止,这里的Mapper是 CIMapper,Reducer是CIReducer。
还有,看最后一行,重命名。我之前实现的kmeans算法应该是使用迭代的次数i来不断创建新的路径作为中心点文件,这里难道使用的是重命名?这个就要再继续分析了。
接下来怎么做?两种方式,其一,仿造sequence file ,直接调用算法进行运算,查看最后的结果,达到整体了解的目的。其二,改代码,使其可以单机运行(这里的单机不是指的直接把runSequential改为true),这样最容易分析其数据流。我也准备采用这种方式。
分享,快乐,成长
转载请注明出处:http://blog.csdn.net/fansy1990