Mahout贝叶斯分类后数据解析

mahout0.7,hadoop1.0.4

运行本示例,参考:http://blog.csdn.net/fansy1990/article/details/11681565.

首先,贴上原始数据:

0.2,0.3,0.4:1
0.32,0.43,0.45:1
0.23,0.33,0.54:1
2.4,2.5,2.6:2
2.3,2.2,2.1:2
5.4,7.2,7.2:3
5.6,7,6:3
5.8,7.1,6.3:3
6,6,5.4:3
11,12,13:4
数据前三列是每个样本的属性,最后一列是样本的标签,即类别。

这里贴出来运行的输出结果:

mahout@ubuntu:~/hadoop-1.0.4/bin$ ./hadoop jar ../lib/mahout.jar mahout.fansy.bayes.BayesRunner -i /bayes/input/bayes.txt -o /bayes/output -scv , -scl : --tempDir /bayes/temp
Warning: $HADOOP_HOME is deprecated.

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/mahout/hadoop-1.0.4/lib/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/mahout/hadoop-1.0.4/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
14/01/19 22:23:59 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/input/bayes.txt], --output=[/bayes/output], --splitCharacterLabel=[:], --splitCharacterVector=[,], --startPhase=[0], --tempDir=[/bayes/temp]}
***********************************转换数据开始
14/01/19 22:24:00 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.
14/01/19 22:24:00 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/input/bayes.txt], --output=[/bayes/output/transform], --splitCharacterLabel=[:], --splitCharacterVector=[,], --startPhase=[0], --tempDir=[temp]}
14/01/19 22:24:02 INFO input.FileInputFormat: Total input paths to process : 1
14/01/19 22:24:03 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/01/19 22:24:03 WARN snappy.LoadSnappy: Snappy native library not loaded
14/01/19 22:24:07 INFO mapred.JobClient: Running job: job_201401030100_0025
14/01/19 22:24:09 INFO mapred.JobClient:  map 0% reduce 0%
14/01/19 22:24:45 INFO mapred.JobClient:  map 100% reduce 0%
14/01/19 22:24:51 INFO mapred.JobClient: Job complete: job_201401030100_0025
14/01/19 22:24:51 INFO mapred.JobClient: Counters: 19
14/01/19 22:24:51 INFO mapred.JobClient:   Job Counters 
14/01/19 22:24:51 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30542
14/01/19 22:24:51 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/19 22:24:51 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/19 22:24:51 INFO mapred.JobClient:     Launched map tasks=1
14/01/19 22:24:51 INFO mapred.JobClient:     Data-local map tasks=1
14/01/19 22:24:51 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/01/19 22:24:51 INFO mapred.JobClient:   File Output Format Counters 
14/01/19 22:24:51 INFO mapred.JobClient:     Bytes Written=520
14/01/19 22:24:51 INFO mapred.JobClient:   FileSystemCounters
14/01/19 22:24:51 INFO mapred.JobClient:     HDFS_BYTES_READ=240
14/01/19 22:24:51 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=21938
14/01/19 22:24:51 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=520
14/01/19 22:24:51 INFO mapred.JobClient:   File Input Format Counters 
14/01/19 22:24:51 INFO mapred.JobClient:     Bytes Read=135
14/01/19 22:24:51 INFO mapred.JobClient:   Map-Reduce Framework
14/01/19 22:24:51 INFO mapred.JobClient:     Map input records=10
14/01/19 22:24:51 INFO mapred.JobClient:     Physical memory (bytes) snapshot=66392064
14/01/19 22:24:51 INFO mapred.JobClient:     Spilled Records=0
14/01/19 22:24:51 INFO mapred.JobClient:     CPU time spent (ms)=1810
14/01/19 22:24:51 INFO mapred.JobClient:     Total committed heap usage (bytes)=15728640
14/01/19 22:24:51 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=974467072
14/01/19 22:24:51 INFO mapred.JobClient:     Map output records=10
14/01/19 22:24:51 INFO mapred.JobClient:     SPLIT_RAW_BYTES=105
***********************************写入indexLabel任务开始
labels number is : 4
***********************************BayesJob1开始执行
14/01/19 22:24:52 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.
14/01/19 22:24:52 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/output/transform], --labelIndex=[/bayes/output/labelIndex.bin], --output=[/bayes/output/job1], --startPhase=[0], --tempDir=[temp]}
14/01/19 22:24:52 INFO input.FileInputFormat: Total input paths to process : 1
14/01/19 22:24:53 INFO mapred.JobClient: Running job: job_201401030100_0026
14/01/19 22:24:54 INFO mapred.JobClient:  map 0% reduce 0%
14/01/19 22:26:11 INFO mapred.JobClient:  map 100% reduce 0%
14/01/19 22:26:41 INFO mapred.JobClient:  map 100% reduce 100%
14/01/19 22:26:47 INFO mapred.JobClient: Job complete: job_201401030100_0026
14/01/19 22:26:47 INFO mapred.JobClient: Counters: 29
14/01/19 22:26:47 INFO mapred.JobClient:   Job Counters 
14/01/19 22:26:47 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/19 22:26:47 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=70509
14/01/19 22:26:47 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/19 22:26:47 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/19 22:26:47 INFO mapred.JobClient:     Launched map tasks=1
14/01/19 22:26:47 INFO mapred.JobClient:     Data-local map tasks=1
14/01/19 22:26:47 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=29595
14/01/19 22:26:47 INFO mapred.JobClient:   File Output Format Counters 
14/01/19 22:26:47 INFO mapred.JobClient:     Bytes Written=277
14/01/19 22:26:47 INFO mapred.JobClient:   FileSystemCounters
14/01/19 22:26:47 INFO mapred.JobClient:     FILE_BYTES_READ=162
14/01/19 22:26:47 INFO mapred.JobClient:     HDFS_BYTES_READ=780
14/01/19 22:26:47 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45819
14/01/19 22:26:47 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=277
14/01/19 22:26:47 INFO mapred.JobClient:   File Input Format Counters 
14/01/19 22:26:47 INFO mapred.JobClient:     Bytes Read=520
14/01/19 22:26:47 INFO mapred.JobClient:   Map-Reduce Framework
14/01/19 22:26:47 INFO mapred.JobClient:     Map output materialized bytes=162
14/01/19 22:26:47 INFO mapred.JobClient:     Map input records=10
14/01/19 22:26:47 INFO mapred.JobClient:     Reduce shuffle bytes=0
14/01/19 22:26:47 INFO mapred.JobClient:     Spilled Records=8
14/01/19 22:26:47 INFO mapred.JobClient:     Map output bytes=370
14/01/19 22:26:47 INFO mapred.JobClient:     Total committed heap usage (bytes)=131207168
14/01/19 22:26:47 INFO mapred.JobClient:     CPU time spent (ms)=40250
14/01/19 22:26:47 INFO mapred.JobClient:     Combine input records=10
14/01/19 22:26:47 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
14/01/19 22:26:47 INFO mapred.JobClient:     Reduce input records=4
14/01/19 22:26:47 INFO mapred.JobClient:     Reduce input groups=4
14/01/19 22:26:47 INFO mapred.JobClient:     Combine output records=4
14/01/19 22:26:47 INFO mapred.JobClient:     Physical memory (bytes) snapshot=245063680
14/01/19 22:26:47 INFO mapred.JobClient:     Reduce output records=4
14/01/19 22:26:47 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1956704256
14/01/19 22:26:47 INFO mapred.JobClient:     Map output records=10
***********************************BayesJob2开始执行
14/01/19 22:26:47 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.
14/01/19 22:26:47 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/output/job1], --labelNumber=[4], --output=[/bayes/output/job2], --startPhase=[0], --tempDir=[temp]}
14/01/19 22:26:47 INFO input.FileInputFormat: Total input paths to process : 1
14/01/19 22:26:48 INFO mapred.JobClient: Running job: job_201401030100_0027
14/01/19 22:26:49 INFO mapred.JobClient:  map 0% reduce 0%
14/01/19 22:27:04 INFO mapred.JobClient:  map 100% reduce 0%
14/01/19 22:27:13 INFO mapred.JobClient:  map 100% reduce 33%
14/01/19 22:27:19 INFO mapred.JobClient:  map 100% reduce 100%
14/01/19 22:27:24 INFO mapred.JobClient: Job complete: job_201401030100_0027
14/01/19 22:27:24 INFO mapred.JobClient: Counters: 29
14/01/19 22:27:24 INFO mapred.JobClient:   Job Counters 
14/01/19 22:27:24 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/19 22:27:24 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16104
14/01/19 22:27:24 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/19 22:27:24 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/19 22:27:24 INFO mapred.JobClient:     Launched map tasks=1
14/01/19 22:27:24 INFO mapred.JobClient:     Data-local map tasks=1
14/01/19 22:27:24 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14798
14/01/19 22:27:24 INFO mapred.JobClient:   File Output Format Counters 
14/01/19 22:27:24 INFO mapred.JobClient:     Bytes Written=187
14/01/19 22:27:24 INFO mapred.JobClient:   FileSystemCounters
14/01/19 22:27:24 INFO mapred.JobClient:     FILE_BYTES_READ=91
14/01/19 22:27:24 INFO mapred.JobClient:     HDFS_BYTES_READ=391
14/01/19 22:27:24 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=44843
14/01/19 22:27:24 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=187
14/01/19 22:27:24 INFO mapred.JobClient:   File Input Format Counters 
14/01/19 22:27:24 INFO mapred.JobClient:     Bytes Read=277
14/01/19 22:27:24 INFO mapred.JobClient:   Map-Reduce Framework
14/01/19 22:27:24 INFO mapred.JobClient:     Map output materialized bytes=91
14/01/19 22:27:24 INFO mapred.JobClient:     Map input records=4
14/01/19 22:27:24 INFO mapred.JobClient:     Reduce shuffle bytes=91
14/01/19 22:27:24 INFO mapred.JobClient:     Spilled Records=4
14/01/19 22:27:24 INFO mapred.JobClient:     Map output bytes=81
14/01/19 22:27:24 INFO mapred.JobClient:     Total committed heap usage (bytes)=176033792
14/01/19 22:27:24 INFO mapred.JobClient:     CPU time spent (ms)=3770
14/01/19 22:27:24 INFO mapred.JobClient:     Combine input records=2
14/01/19 22:27:24 INFO mapred.JobClient:     SPLIT_RAW_BYTES=114
14/01/19 22:27:24 INFO mapred.JobClient:     Reduce input records=2
14/01/19 22:27:24 INFO mapred.JobClient:     Reduce input groups=2
14/01/19 22:27:24 INFO mapred.JobClient:     Combine output records=2
14/01/19 22:27:24 INFO mapred.JobClient:     Physical memory (bytes) snapshot=248565760
14/01/19 22:27:24 INFO mapred.JobClient:     Reduce output records=2
14/01/19 22:27:24 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1957756928
14/01/19 22:27:24 INFO mapred.JobClient:     Map output records=2
***********************************写入bayesian model 任务开始
Write bayesian model to '/bayes/output/model/naiveBayesModel.bin'
***********************************分类任务开始
14/01/19 22:27:24 WARN fs.FileSystem: "ubuntu:9000" is a deprecated filesystem name. Use "hdfs://ubuntu:9000/" instead.
14/01/19 22:27:24 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/bayes/output/transform], --labelNumber=[4], --model=[/bayes/output/model], --output=[/bayes/output/classified], --startPhase=[0], --tempDir=[temp]}
14/01/19 22:27:25 INFO input.FileInputFormat: Total input paths to process : 1
14/01/19 22:27:25 INFO mapred.JobClient: Running job: job_201401030100_0028
14/01/19 22:27:26 INFO mapred.JobClient:  map 0% reduce 0%
14/01/19 22:27:40 INFO mapred.JobClient:  map 100% reduce 0%
14/01/19 22:27:45 INFO mapred.JobClient: Job complete: job_201401030100_0028
14/01/19 22:27:45 INFO mapred.JobClient: Counters: 19
14/01/19 22:27:45 INFO mapred.JobClient:   Job Counters 
14/01/19 22:27:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=15591
14/01/19 22:27:45 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/19 22:27:45 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/19 22:27:45 INFO mapred.JobClient:     Launched map tasks=1
14/01/19 22:27:45 INFO mapred.JobClient:     Data-local map tasks=1
14/01/19 22:27:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/01/19 22:27:45 INFO mapred.JobClient:   File Output Format Counters 
14/01/19 22:27:45 INFO mapred.JobClient:     Bytes Written=530
14/01/19 22:27:45 INFO mapred.JobClient:   FileSystemCounters
14/01/19 22:27:45 INFO mapred.JobClient:     HDFS_BYTES_READ=847
14/01/19 22:27:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=22519
14/01/19 22:27:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=530
14/01/19 22:27:45 INFO mapred.JobClient:   File Input Format Counters 
14/01/19 22:27:45 INFO mapred.JobClient:     Bytes Read=520
14/01/19 22:27:45 INFO mapred.JobClient:   Map-Reduce Framework
14/01/19 22:27:45 INFO mapred.JobClient:     Map input records=10
14/01/19 22:27:45 INFO mapred.JobClient:     Physical memory (bytes) snapshot=70705152
14/01/19 22:27:45 INFO mapred.JobClient:     Spilled Records=0
14/01/19 22:27:45 INFO mapred.JobClient:     CPU time spent (ms)=380
14/01/19 22:27:45 INFO mapred.JobClient:     Total committed heap usage (bytes)=15728640
14/01/19 22:27:45 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=974467072
14/01/19 22:27:45 INFO mapred.JobClient:     Map output records=10
14/01/19 22:27:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
***********************************打印测试信息开始
14/01/19 22:27:46 INFO bayes.AnalyzeBayesModel: Standard NB Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          7	        70%
Incorrectly Classified Instances        :          3	        30%
Total Classified Instances              :         10

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	<--Classified as
3    	0    	0    	0    	 |  3     	a     = 1
0    	1    	0    	1    	 |  2     	b     = 2
1    	1    	2    	0    	 |  4     	c     = 3
0    	0    	0    	1    	 |  1     	d     = 4


看最后的混淆矩阵:第一行,说明a被分类为了a有3个;第二行,说明有一个正确标签是2,却被分为了4;第三行说明,有一个标签为3的被分为了1,另外一个标签为3的被分为了2,有两个标签为3的被分为了3;第四行,说明有一个标签为4的被分为了4。

综合上面混淆矩阵的解析,可以知道有3个被分错了(一共有10条记录),这个和Summary的结果一致。

下面是读取分类好的数据,读取出来的结果为:

key:1	value:{0:-0.9648420640410421,1:-0.9887510598012988,2:-0.9825450990081924,3:-0.9751164124745173}  1
key:1	value:{0:-1.3075365110427406,1:-1.3183347464017319,2:-1.311911294294109,3:-1.3105998253880884}   1
key:1	value:{0:-1.1693011980970653,1:-1.2084735175349208,2:-1.2029164271675432,3:-1.1868650353368242}  1
key:2	value:{0:-8.268914090881214,1:-8.239592165010823,2:-8.24988457628885,3:-8.239013935827634}       4  <--2
key:2	value:{0:-7.335239784640972,1:-7.250841105209526,2:-7.275795216083463,3:-7.2793125913358425}     2   
key:3	value:{0:-21.62735541679608,1:-21.752523315628576,2:-21.647378264642075,3:-21.65117653755887}    1  <--3
key:3	value:{0:-20.51606634444014,1:-20.434188569226844,2:-20.35905447985551,3:-20.437779899276315}    3
key:3	value:{0:-21.16521415402051,1:-21.093355942427706,2:-21.02858358865721,3:-21.09072342236577}     3  
key:3	value:{0:-19.34824288117226,1:-19.11585382282511,2:-19.158537241429514,3:-19.19592701923623}     2  <--3
key:4	value:{0:-39.52871529566568,1:-39.55004239205195,2:-39.55547612441188,3:-39.46710853846247}      4
其中,}符号后面的是lz自己加上的。根据读取的结果,可以看到针对一个记录,可以得到一个4维度的向量(向量的维度个数和所有标签的个数一致),由这个向量来判断这条记录的标签。具体如何判断的呢?这个要看TestNaiveBayesDriver 中的analyzeResults方法,如下:

private static void analyzeResults(Map labelMap,
                                     SequenceFileDirIterable dirIterable,
                                     ResultAnalyzer analyzer) {
    for (Pair pair : dirIterable) {
      int bestIdx = Integer.MIN_VALUE;
      double bestScore = Long.MIN_VALUE;
      for (Vector.Element element : pair.getSecond().get()) {
        if (element.get() > bestScore) {
          bestScore = element.get();
          bestIdx = element.index();
        }
      }
      if (bestIdx != Integer.MIN_VALUE) {
        ClassifierResult classifierResult = new ClassifierResult(labelMap.get(bestIdx), bestScore);
        analyzer.addInstance(pair.getFirst().toString(), classifierResult);
      }
    }
  }

其实,就是把向量中最大值的下标取出来,这个下标值就是这条记录被分类的标签。比如,第一条记录:

value:{0:-0.9648420640410421,1:-0.9887510598012988,2:-0.9825450990081924,3:-0.9751164124745173}
最大值的下标为0,所以这条记录就被分为了第1类(类别数比下标多1);

比如第4条记录:

key:2	value:{0:-8.268914090881214,1:-8.239592165010823,2:-8.24988457628885,3:-8.239013935827634}       4  <--2
可以看到最大值的下标是3,所以被分为了第4类,但是正确的标签是2,所以这条记录是被分错了;

最后,附带一个根据建立的模型来分类数据的代码:

package mahout.fansy.bayes.classify;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.naivebayes.AbstractNaiveBayesClassifier;
import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;

public class BayesClassifier {

	/**
	 * @param args
	 * @throws IOException 
	 */
	public static void main(String[] args) throws IOException {
		Configuration conf=new Configuration();
		conf.set("fs.default.name", "ubuntu:9000");
		conf.set("mapred.job.tracker", "ubuntu:9001");
		
		Path model=new Path("/bayes/output/model");
	//	Vector vector=new DenseVector(3);
		Vector vector= new RandomAccessSparseVector(3);
		vector.set(0, 0.2);
		vector.set(1, 0.3);
		vector.set(2, 0.4);
		int result=new BayesClassifier().classify(conf, model, vector);
		System.out.println(result);
	}
	
	/**
	 * get bayes model
	 * @param conf
	 * @param modelPath
	 * @return
	 * @throws IOException
	 */
	public NaiveBayesModel getBayesModel(Configuration  conf,Path modelPath) throws IOException{
		NaiveBayesModel model = NaiveBayesModel.materialize(modelPath, conf); 
		return model;
	}
	/**
	 * get classifier by bayes model
	 * @param model
	 * @return
	 */
	public AbstractNaiveBayesClassifier getClassifier(NaiveBayesModel model){
		AbstractNaiveBayesClassifier classifier=new StandardNaiveBayesClassifier(model);
		return classifier;
	}
	/**
	 * classify the given vector 
	 * @param classifier
	 * @param vector
	 */
	public int classify(AbstractNaiveBayesClassifier classifier,Vector vector){
		Vector result = classifier.classifyFull(vector);
		System.out.println(result);
		int bestIdx = Integer.MIN_VALUE;
	    double bestScore = Long.MIN_VALUE;
	    for (Vector.Element element : result) {
	      if (element.get() > bestScore) {
	          bestScore = element.get();
	          bestIdx = element.index();
	        }
	    }
	    return bestIdx;
	}
	
	/**
	 * classify the vector
	 * @param conf
	 * @param model
	 * @param vector
	 * @return
	 * @throws IOException
	 */
	public int classify(Configuration conf,Path model,Vector vector) throws IOException{
		return this.classify(this.getClassifier(this.getBayesModel(conf, model)),vector);
	}

}

运行程序的结果为:

{0:-0.9648420640410421,1:-0.9887510598012988,2:-0.9825450990081924,3:-0.9751164124745173}
0
和解析分类后的结果保持一致。


分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990




你可能感兴趣的:(贝叶斯算法,mahout,数据解析)