1.下载Mahout
http://www.apache.org/dist//mahout/0.4/
2.解压
tar zxvf mahout-distribution-0.4.tar.gz
3.算法列表
./bin/mahout -h
显示出当前mahout支持的所有算法
聚类Clustering
Clustering of synthetic control data
Pre-Prep
1)下载输入数据,点
这里
由60行60列组成
_time |
_time+x |
_time+2x |
.. |
_time+60x |
28.7812 |
34.4632 |
31.3381 |
.. |
31.2834 |
24.8923 |
25.741 |
27.5532 |
.. |
32.8217 |
..
..
35.5351 |
41.7067 |
39.1705 |
48.3964 |
.. |
38.6103 |
24.2104 |
41.7679 |
45.2228 |
43.7762 |
.. |
48.8175 |
..
..
2)启动hadoop
$HADOOP_HOME/bin/start-all.sh
3)将数据上传到hadoop
$HADOOP_HOME/bin/hadoop fs -mkdir testdata
$HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data> testdata
创建测试目录testdata,并把数据导入到这个tastdata目录中(这里的目录的名字只能是testdata)
Perform Clustering
a.For canopy :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
b.For kmeans :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
c.For fuzzykmeans :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
d.For dirichlet :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
e.For meanshift :
$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
也可以直接在hadoop环境下运行
hadoop jar mahout-examples-0.4-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
Read / Analyze Output
1)查看所有输出文件
hadoop fs -lsr output
2)下载到本地
hadoop fs -get output $MAHOUT_HOME/examples
转入output目录下,
$cd MAHOUT_HOME/examples/output
$ls
如果看到以下结果,那么算法运行成功,你的安装也就成功了:
clusteredPoints clusters-1 clusters-2 clusters-4 clusters-6 clusters-8 data
clusters-0 clusters-10 clusters-3 clusters-5 clusters-7 clusters-9
3)单个的集群结果在output/clusters-i
4)所有集群汇总结果在output/clusteredPoints
结果是sequence格式的。如果直接在hdfs上查看,使用
./bin/mahout vectordump --seqFile output/data/part-m-00000