此篇博客主要参考https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation,不过个人按照上面的提示,出现了一些错误,下面就结合出现的问题和解决方案简要分析下:
(mahout版本:0.7)一、数据:
数据从下面的网址下载:http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data,训练数据:19.1M,测试数据:3.4M,训练数据的前三行如下:
0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20 0,udp,other,SF,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,0.00,0.00,0.00,0.00,0.08,0.15,0.00,255,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal,15 0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.00,1.00,0.00,0.00,0.05,0.07,0.00,255,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,19二、运行实例:(1)上传数据:$hadoop_home/bin/hadoop fs -put kddTrain.txt input/kddTrain.txt ;$hadoop_home/bin/hadoop fs -put kddTest.txt input/kddTest.txt ;
(2)生成原始数据的描述文件:
fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/core/target/mahout-core-0.7-job.jar org.apache.mahout.classifier.df.tools.Describe -p input/kddTrain.txt -f out/forest/info/kdd1.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L N上面的命令和原文的不一样,上面的红色的部分是不同 的;数据的描述文件可以在http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.names上面看到,不过下载的数据中多出最后一列,暂时在描述文件中也没有找到,所以在Label后面多了一个Number(N)。
提示信息如下:12/12/26 20:02:27 INFO tools.Describe: Generating the descriptor... 12/12/26 20:02:28 INFO tools.Describe: generating the dataset... 12/12/26 20:02:31 INFO tools.Describe: storing the dataset description(3)建树:fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -ood -d input/kddTrain.txt -ds out/forest/info/kdd1.info -sl 5 -p -t 100 --output out/forest1如果按照原文的话,会出现下面的错误提示:12/12/26 20:04:34 ERROR mapreduce.BuildForest: Exception org.apache.commons.cli2.OptionException: Unexpected out/forest1 while processing Options at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99) at org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:139) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:253) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)根据上面的错误提示,我把-o(--output)选项去掉了(或者去掉-ood选项,保留--output选项亦可),然后就可以了,生成的目录文件为od/forest.seq ;下面是BuildForest的使用参数:
Usage: [--data <path> --dataset <dataset> --selection <m> --no-complete --minsplit <minsplit> --minprop <minprop> --seed <seed> --partial --nbtrees <nbtrees> --output <path> --help] Options --data (-d) path Data path --dataset (-ds) dataset Dataset path --selection (-sl) m Optional, Number of variables to select randomly at each tree-node. For classification problem, the default is square root of the number of explanatory variables. For regression problem, the default is 1/3 of the number of explanatory variables. --no-complete (-nc) Optional, The tree is not complemented --minsplit (-ms) minsplit Optional, The tree-node is not divided, if the branching data size is smaller than this value. The default is 2. --minprop (-mp) minprop Optional, The tree-node is not divided, if the proportion of the variance of branching data is smaller than this value. In the case of a regression problem, this value is used. The default is 1/1000(0.001). --seed (-sd) seed Optional, seed value used to initialise the Random number generator --partial (-p) Optional, use the Partial Data implementation --nbtrees (-t) nbtrees Number of trees to grow --output (-o) path Output path, will contain the Decision Forest --help (-h) Print out help下面是建好树的提示信息:12/12/26 20:06:21 INFO mapreduce.BuildForest: Build Time: 0h 1m 32s 618 12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest num Nodes: 47353 12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest mean num Nodes: 473 12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest mean max Depth: 12 12/12/26 20:06:21 INFO mapreduce.BuildForest: Storing the forest in: od/forest.seq(3)测试数据:命令:
fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i input/kddTest.txt -ds out/forest/info/kdd1.info -m od/forest.seq -a -mr -o predictions测试信息如下:Summary ------------------------------------------------------- Correctly Classified Instances : 16285 72.2365% Incorrectly Classified Instances : 6259 27.7635% Total Classified Instances : 22544TestForest的用法:Usage: [--input <input> --dataset <dataset> --model <path> --output <output> --analyze --mapreduce --help] Options --input (-i) input Path to job input directory. --dataset (-ds) dataset Dataset path --model (-m) path Path to the Decision Forest --output (-o) output The directory pathname for output. --analyze (-a) --mapreduce (-mr) --help (-h) Print out help
分享,成长,快乐