Mahout的0.11安装与测试
1.1 Mahout
本地hadoop环境布置,当然测试mahout可以安装单节点
hostname |
ip |
安装位置 |
hadoop 集群部署 |
invin01 |
192.168.199.61 |
|
NameNode, DFSZKFailoverController, jdk |
invin02 |
192.168.199.62 |
|
NameNode, DFSZKFailoverController, jdk |
invin03 |
192.168.199.63 |
|
ResourceManager, jdk |
invin04 |
192.168.199.64 |
|
QuorumPeerMain, Jdk, NodeManager, JournalNode |
invin05 |
192.168.199.65 |
|
QuorumPeerMain, Jdk, NodeManager, JournalNode |
invin06 |
192.168.199.66 |
|
QuorumPeerMain, Jdk, NodeManager, JournalNode |
1. 在invin01安装Mahout
http://mahout.apache.org/ //官网
http://mahout.apache.org/general/downloads.html //下载页面
http://www.apache.org/dyn/closer.cgi/mahout/ //下载最新版
// 在invin01上下载mahout 0.11 安装包
su - hduser
cd ~
wget http://ftp.jaist.ac.jp/pub/apache/mahout/0.11.0/apache-mahout-distribution-0.11.0.tar.gz
tar -zxvf apache-mahout-distribution-0.11.0.tar.gz
// 设置环境变量
vi .bashrc
# for hadoop
source .bashrc //使环境变量生效
mahout //验证是否安装完成
2. Mahout k-means算法测试
启动hadoop, invin04,invin05,invin06 上启动zookeeper,在invin01节点启动 sbin/start-dfs.sh, 在invin03 节点启动sbin/start-yarn.sh
(这里按照自己建的集群启动hadoop就可以了)
//下载测试数据
wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
cd ~/apache-mahout-distribution-0.11.0
mkdir testdata
scp synthetic_control.data ~/apache-mahout-distribution-0.11.0/testdata
//运行mahout K-Means程序
cd ~/apache-mahout-distribution-0.11.0
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
//查看运行结果
cd ~/apache-mahout-distribution-0.11.0/output
2. Mahout 分类算法测试-20newsgroups
算法流程
朴素贝叶斯分类是一种十分简单的分类算法,朴素贝叶斯的思想基础是这样的:对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率哪个最大,就认为此待分类项属于哪个类别。
这二十个新闻组数据集合是收集大约20,000新闻组文档,均匀的分布在20个不同的集合。这20个新闻组集合采集最近流行的数据集合到文本程序中作为实验,根据机器学习技术。例如文本分类,文本聚集。我们将使用Mahout的Bayes Classifier创造一个模型,它将一个新文档分类到这20个新闻组集合范例演示
cd apache-mahout-distribution-0.11.0/examples/bin/
./classify-20newsgroups.sh
选1,2
Summary
-------------------------------------------------------
Correctly Classified Instances : 6727 90.6237%
Incorrectly Classified Instances : 696 9.3763%
Total Classified Instances : 7423
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
305 0 0 0 0 0 0 0 0 0 0 1 0 0 1 2 0 1 8 1 | 319 a = alt.atheism
1 329 3 16 9 12 3 2 0 0 0 1 4 1 6 0 1 0 1 0 | 389 b = comp.graphics
2 33 232 80 12 17 4 0 0 1 0 2 2 0 0 0 0 0 0 1 | 386 c = comp.os.ms-windows.misc
0 5 2 346 16 2 12 2 0 0 1 1 7 0 0 0 0 0 0 0 | 394 d = comp.sys.ibm.pc.hardware
1 0 3 4 349 2 3 0 0 0 1 0 5 0 0 0 0 0 0 1 | 369 e = comp.sys.mac.hardware
0 19 1 4 3 330 0 0 0 1 0 2 1 1 2 0 0 0 0 0 | 364 f = comp.windows.x
0 2 0 22 12 1 314 9 3 2 1 4 7 1 4 0 0 1 0 2 | 385 g = misc.forsale
0 0 0 2 2 0 7 388 5 0 0 0 4 1 0 0 0 4 0 2 | 415 h = rec.autos
0 0 0 1 1 0 4 7 382 0 0 0 1 0 0 0 0 1 0 0 | 397 i = rec.motorcycles
0 0 0 0 0 0 1 2 2 403 1 1 1 0 0 0 0 0 1 0 | 412 j = rec.sport.baseball
0 0 0 2 0 0 1 0 1 2 368 0 0 0 0 1 0 0 0 1 | 376 k = rec.sport.hockey
0 4 1 0 1 1 2 0 0 0 0 356 2 1 0 0 1 2 0 1 | 372 l = sci.crypt
0 9 0 8 10 0 12 3 0 0 0 4 350 1 0 0 0 0 1 0 | 398 m = sci.electronics
1 2 0 1 2 1 0 1 2 0 1 1 4 374 6 0 0 1 1 4 | 402 n = sci.med
0 3 0 0 1 0 1 1 0 0 0 1 0 3 362 0 0 0 1 1 | 374 o = sci.space
2 0 0 0 1 1 1 0 0 0 0 0 1 6 0 354 0 0 4 0 | 370 p = soc.religion.christian
1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 359 0 0 3 | 366 q = talk.politics.mideast
0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 372 0 7 | 383 r = talk.politics.guns
35 0 0 1 0 1 0 2 1 1 0 0 0 1 2 9 2 9 172 8 | 244 s = talk.religion.misc
2 0 0 0 0 0 0 0 0 0 1 2 1 1 2 1 4 11 1 282 | 308 t = talk.politics.misc
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.8771
Accuracy 90.6237%
Reliability 86.0341%
Reliability (standard deviation) 0.2193
Weighted precision 0.9103
Weighted recall 0.9062
Weighted F1 score 0.9049