Canopy算法实战总结

通过canopy算法实战了解了mapreduce的coding套路,job、input、output、format、map、reduce、configuration等的设置,文件序列化和反序列化sequenceFile

理解文章要感谢mahout 源码解析之聚类--Canopy算法

下面大概说收canopy算法的步骤

1、通过InputDriver将文本文件变为seqFile

2、

Path clustersOut = CanopyDriver.buildClusters(new Configuration(),  directoryContainingConvertedInput, ouput, measure, t1, t2, t1, t2, 0, false);

进行聚类生成中心点他有单机和mr两种,设置t1t2,同时还要设置阈值这要当样本数小于该类时就去掉该canopy

3、可以生成每个族的样本数,将他们展示出来在这里还没有实现后续我再琢磨怎么在本地文件中查看。其中还有一个ClusteringPoilcy分类策略

下面是最外边的代码:

private static void run(Path input, Path output, DistanceMeasure measure, double t1, double t2)
        throws Exception
{
    Path directoryContainingConvertedInput = new Path(output, "data");

    System.out.println("InputDriver begin!!!!!!!!!!");
    InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector");

    System.out.println("InputDriver done!!!!!!!!!!");
    Path ouput = new Path("/ouput");
    System.out.println(ouput.toString());
    Path clustersOut = CanopyDriver.buildClusters(new Configuration(),  directoryContainingConvertedInput, ouput, measure, t1, t2, t1, t2, 0, false);
    System.out.println("pathout____:"+clustersOut.toString());
    ///ouput/clusters-0-final
    System.out.println("clusterDATA!!!!!");

   // ClusterClassifier.writePolicy(new CanopyClusteringPolicy(), clustersOut);
    Path policyPath = new Path(clustersOut, "_policy");
    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(policyPath.toUri(), config);
      System.out.println("____fs___"+fs.toString());
    SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, policyPath, Text.class, ClusteringPolicyWritable.class);
    System.out.println("______writer___"+writer.toString());
    writer.append(new Text(), new ClusteringPolicyWritable(new CanopyClusteringPolicy()));
    writer.close();
    System.out.println("_____writer_close___");



    System.out.println("ClusterClassificationDriver:beginING!!!");
   //instead__ ClusterClassificationDriver.run(new Configuration(), directoryContainingConvertedInput, ouput, new Path(ouput, "clusteredPoints"), 0.0D, true, false);
    config.setFloat("pdf_threshold", new Double(0.0D).floatValue());
    config.setBoolean("emit_most_likely", true);
    //config.set("clusters_in", ouput.toUri().toString());
    System.out.println(ouput.toUri().toString()+"___________");
    Job job = new Job(config, "Cluster Classification Driver running over input: " + input);
    job.setJarByClass(ClusterClassificationDriver.class);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setMapperClass(ClusterClassificationMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(IntWritable.class);
    job.setOutputValueClass(WeightedPropertyVectorWritable.class);
    FileInputFormat.addInputPath(job, directoryContainingConvertedInput);
    Path output2=new Path("output2");
    System.out.println("______output2");
    FileOutputFormat.setOutputPath(job, output2);
    if (!job.waitForCompletion(true)) {
        throw new InterruptedException("Cluster Classification Driver Job failed processing " + input);
    }

    System.out.println("ClusterClassificationDriver:DONE!!!");
    //run(conf, input, output, measure, t1, t2, t1, t2, 0, runClustering, clusterClassificationThreshold, runSequential);
    CanopyDriver.run(new Configuration(), directoryContainingConvertedInput, ouput, measure, t1, t2, false, 0.0D, false);

    System.out.println("CanopyDriver done!!!!!!!!!!");

 ClusterDumper clusterDumper = new ClusterDumper(new Path(ouput, "clusters-0-final"),output2);

    System.out.println("ClusterDumper done!!!!!!!!!!");
    clusterDumper.printClusters(null);
    System.out.println("ClusterDumper printClusters done!!!!!!!!!!");
}





你可能感兴趣的:(hadoop,canopy)