mahout中的kmeans简单实例

在Mahout_in_Action这本书中,有个kmeans的简单实例,可书中只给了源代码,而并没有指出要导入哪些包才能正确运行

这本书在内容开始提到书中所有代码都是基于mahout0.4版本的,可是我发现这个kmeans的例子,却是基于mahout0.3的,有几个函数0.4版中是没有的

我不知道是不是因为我直接用的编译好的包,但我下mahout0.4的源码看了,也没有,下面我会在代码中标注出哪几个函数是0.4中没有的

public static final double[][] points = { {1, 1}, {2, 1}, {1, 2},
{2, 2}, {3, 3}, {8, 8}, {9, 8}, {8, 9}, {9, 9}};

public static void writePointsToFile(List<Vector> points,
String fileName, FileSystem fs, Configuration conf)
throws IOException {
Path path = new Path(fileName);
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
path, LongWritable.class, VectorWritable.class);
long recNum = 0;
VectorWritable vec = new VectorWritable();
for (Vector point : points) {
vec.set(point);
writer.append(new LongWritable(recNum++), vec);
}
writer.close();
}

public static List<Vector> getPoints(double[][] raw) {
List<Vector> points = new ArrayList<Vector>();
for (int i = 0; i < raw.length; i++) {
double[] fr = raw[i];
Vector vec = new RandomAccessSparseVector("vector: "
+ String.valueOf(i), fr.length);//在mahout0.4中没有这种参数的构造函数
vec.assign(fr);
points.add(vec);
}
return points;
}

public static void main(String args[]) throws Exception {
int k = 2; 
List<Vector> vectors = getPoints(points);
File testData = new File("testdata"); 
if (!testData.exists()) {
testData.mkdir();
}
testData = new File("testdata/points");
if (!testData.exists()) {
testData.mkdir();
}
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
writePointsToFile(vectors, "testdata/points/file1", fs, conf);
Path path = new Path("testdata/clusters/part-00000");
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
path, Text.class, Cluster.class);
for (int i = 0; i < k; i++) {
Vector vec = vectors.get(i);
Cluster cluster = new Cluster(vec, i);//mahout-0.4中也没有这个构造函数
cluster.addPoint(cluster.getCenter());//mahout-0.4中也没有这个函数
writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();
KMeansDriver.runJob("testdata/points", "testdata/clusters", //mahout-0.4中改为KMeansDriver.run(参数),没有runJob这个函数了
"output", EuclideanDistanceMeasure.class.getName(), 0.001,
10, 1);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
new Path("output/points/part-00000"), conf);
Text key = new Text();
Text value = new Text();
while (reader.next(key, value)) { 
System.out.println(key.toString() + " belongs to cluster "
+ value.toString());
}
reader.close();
}

当然可以把它改成支持mahout0.4的代码,如果运行时选择run as->java application,则会在本地目录中生成结果文件,如果run as->run on hadoop,则会在hdfs中生成文件。

根据错误提示,找到hadoop-core-0.20.2.jar,mahout-core-0.3.jar,mahout-math-0.3.jar导入,此时代码应该不会报错了,在运行过程中,根据提示,依次将下列包导到工程中。我在测试的过程中,先后导入了mahout-collections-0.3.jar,slf4j-api-1.5.8.jar,slf4j-jcl-1.5.8.jar,commons-logging-1.1.1.jar,commons-cli-2.0.jar,commons-httpclient-3.1.jar,这些包都可以在mahout安装目录或安装目录下的lib目录下找到,最后运行结果如下:

Vector:0 belongs to cluster 0
Vector:1 belongs to cluster 0
Vector:2 belongs to cluster 0
Vector:3 belongs to cluster 0
Vector:4 belongs to cluster 0
Vector:5 belongs to cluster 1
Vector:6 belongs to cluster 1
Vector:7 belongs to cluster 1
Vector:8 belongs to cluster 1

 

你可能感兴趣的:(exception,vector,String,File,application,Path)