之前的关于中心点文件的分析基本是对的,但是在第一篇整体分析的时候没有说如何产生中心点向量文件所以在第二篇写了如何得到,其实在mahout里面有一个自动生成中心点文件的方法,之前漏掉了。现在补上,首先编写下面的debug代码:
package mahout.fansy.test.kmeans; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.util.ToolRunner; import org.apache.mahout.clustering.kmeans.KMeansDriver; import org.apache.mahout.common.distance.ManhattanDistanceMeasure; public class KmeansTest { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { test2(); } // 间接调用run方法 public static void test2() throws Exception{ String[] arg={"-fs","fansyPC:9000","-jt","fansyPC:9001", "--input","hdfs://fansyPC:9000/user/fansy/output/kmeans-in-transform/part-r-00000", "--output","hdfs://fansyPC:9000/user/fansy/output/kmeans-output", "-dm","org.apache.mahout.common.distance.ManhattanDistanceMeasure", "-c","hdfs://fansyPC:9000/user/fansy/output/kmeans-center", "-k","2", "-x","4", "--tempDir","hdfs://fansyPC:9000/user/fansy/output/kmeans-tmp" }; ToolRunner.run(new Configuration(), new KMeansDriver(),arg); } }然后进入调试模式(因为现在没有进入mapreduce模式中,所以可以直接进行调试),可以发现在KMeansDriver的95行:
clusters = RandomSeedGenerator.buildRandom(getConf(), input, clusters, Integer.parseInt(getOption(DefaultOptionCreator.NUM_CLUSTERS_OPTION)), measure);这个方法即是自动根据输入文件(输入文件要求要为sequence file,可以参考 http://blog.csdn.net/fansy1990/article/details/9635575 把文本转换为需要的格式)从中选出中心点文件。进入这个方法:
public static Path buildRandom(Configuration conf, Path input, Path output, int k, DistanceMeasure measure) throws IOException { // delete the output directory FileSystem fs = FileSystem.get(output.toUri(), conf); HadoopUtil.delete(conf, output); Path outFile = new Path(output, "part-randomSeed"); boolean newFile = fs.createNewFile(outFile); if (newFile) { Path inputPathPattern; if (fs.getFileStatus(input).isDir()) { inputPathPattern = new Path(input, "*"); } else { inputPathPattern = input; } FileStatus[] inputFiles = fs.globStatus(inputPathPattern, PathFilters.logsCRCFilter()); SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, outFile, Text.class, ClusterWritable.class); Random random = RandomUtils.getRandom(); List<Text> chosenTexts = Lists.newArrayListWithCapacity(k); List<ClusterWritable> chosenClusters = Lists.newArrayListWithCapacity(k); int nextClusterId = 0; for (FileStatus fileStatus : inputFiles) { if (fileStatus.isDir()) { continue; } for (Pair<Writable,VectorWritable> record : new SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(), true, conf)) { Writable key = record.getFirst(); VectorWritable value = record.getSecond(); Kluster newCluster = new Kluster(value.get(), nextClusterId++, measure); newCluster.observe(value.get(), 1); Text newText = new Text(key.toString()); int currentSize = chosenTexts.size(); if (currentSize < k) { chosenTexts.add(newText); ClusterWritable clusterWritable = new ClusterWritable(); clusterWritable.setValue(newCluster); chosenClusters.add(clusterWritable); } else if (random.nextInt(currentSize + 1) != 0) { // with chance 1/(currentSize+1) pick new element int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly chosenTexts.remove(indexToRemove); chosenClusters.remove(indexToRemove); chosenTexts.add(newText); ClusterWritable clusterWritable = new ClusterWritable(); clusterWritable.setValue(newCluster); chosenClusters.add(clusterWritable); } } }前面就是删除中心点文件(如果有的话)然后是新建中心点文件,然后从输入文件中选择K个写入该文件即可。这里说下选择K个中心点的方法:
然后就是继续往下debug了,当运行到ClusterClassifier的writePolicy方法时:
public static void writePolicy(ClusteringPolicy policy, Path path) throws IOException { Path policyPath = new Path(path, POLICY_FILE_NAME); Configuration config = new Configuration(); FileSystem fs = FileSystem.get(policyPath.toUri(), config); SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, policyPath, Text.class, ClusteringPolicyWritable.class); writer.append(new Text(), new ClusteringPolicyWritable(policy)); writer.close(); }这里的configuration重新定义了,这样不会有问题?我之前定义的应该是另外加上一句:conffig.set("mapred.job.tracker","fansyPC:9001")的,不然应该是找不到了,但是这里居然可以找到,并且真的写入了,这里应该要看下是咋回事了。
分享,快乐,成长
转载请注明出处:http://blog.csdn.net/fansy1990