在Mahout_in_Action这本书中,有个kmeans的简单实例,可书中只给了源代码,而并没有指出要导入哪些包才能正确运行
这本书在内容开始提到书中所有代码都是基于mahout0.4版本的,可是我发现这个kmeans的例子,却是基于mahout0.3的,有几个函数0.4版中是没有的
我不知道是不是因为我直接用的编译好的包,但我下mahout0.4的源码看了,也没有,下面我会在代码中标注出哪几个函数是0.4中没有的
public static final double[][] points = { {1, 1}, {2, 1}, {1, 2}, {2, 2}, {3, 3}, {8, 8}, {9, 8}, {8, 9}, {9, 9}}; public static void writePointsToFile(List<Vector> points, String fileName, FileSystem fs, Configuration conf) throws IOException { Path path = new Path(fileName); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, LongWritable.class, VectorWritable.class); long recNum = 0; VectorWritable vec = new VectorWritable(); for (Vector point : points) { vec.set(point); writer.append(new LongWritable(recNum++), vec); } writer.close(); } public static List<Vector> getPoints(double[][] raw) { List<Vector> points = new ArrayList<Vector>(); for (int i = 0; i < raw.length; i++) { double[] fr = raw[i]; Vector vec = new RandomAccessSparseVector("vector: " + String.valueOf(i), fr.length);//在mahout0.4中没有这种参数的构造函数 vec.assign(fr); points.add(vec); } return points; } public static void main(String args[]) throws Exception { int k = 2; List<Vector> vectors = getPoints(points); File testData = new File("testdata"); if (!testData.exists()) { testData.mkdir(); } testData = new File("testdata/points"); if (!testData.exists()) { testData.mkdir(); } Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); writePointsToFile(vectors, "testdata/points/file1", fs, conf); Path path = new Path("testdata/clusters/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, Cluster.class); for (int i = 0; i < k; i++) { Vector vec = vectors.get(i); Cluster cluster = new Cluster(vec, i);//mahout-0.4中也没有这个构造函数 cluster.addPoint(cluster.getCenter());//mahout-0.4中也没有这个函数 writer.append(new Text(cluster.getIdentifier()), cluster); } writer.close(); KMeansDriver.runJob("testdata/points", "testdata/clusters", //mahout-0.4中改为KMeansDriver.run(参数),没有runJob这个函数了 "output", EuclideanDistanceMeasure.class.getName(), 0.001, 10, 1); SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("output/points/part-00000"), conf); Text key = new Text(); Text value = new Text(); while (reader.next(key, value)) { System.out.println(key.toString() + " belongs to cluster " + value.toString()); } reader.close(); }
当然可以把它改成支持mahout0.4的代码,如果运行时选择run as->java application,则会在本地目录中生成结果文件,如果run as->run on hadoop,则会在hdfs中生成文件。
根据错误提示,找到hadoop-core-0.20.2.jar,mahout-core-0.3.jar,mahout-math-0.3.jar导入,此时代码应该不会报错了,在运行过程中,根据提示,依次将下列包导到工程中。我在测试的过程中,先后导入了mahout-collections-0.3.jar,slf4j-api-1.5.8.jar,slf4j-jcl-1.5.8.jar,commons-logging-1.1.1.jar,commons-cli-2.0.jar,commons-httpclient-3.1.jar,这些包都可以在mahout安装目录或安装目录下的lib目录下找到,最后运行结果如下:
Vector:0 belongs to cluster 0 Vector:1 belongs to cluster 0 Vector:2 belongs to cluster 0 Vector:3 belongs to cluster 0 Vector:4 belongs to cluster 0 Vector:5 belongs to cluster 1 Vector:6 belongs to cluster 1 Vector:7 belongs to cluster 1 Vector:8 belongs to cluster 1