mahout之TestNaiveBayesDriver源码分析

有个参数sequential决定是否本地执行,这里只讲MapReduce执行。
源代码如下,

1
2
3
4
5
6
7
8
9
10
11
12
   private  boolean runMapReduce (Map < string , List <  String  >  > parsedArgs )  throws  IOException,
       InterruptedExceptionClassNotFoundException  {
    Path model  =  new Path (getOption ( "model" ) ) ;
    HadoopUtil. cacheFiles (model, getConf ( ) ) ;
     //the output key is the expected value, the output value are the scores for all the labels
    Job testJob  = prepareJob (getInputPath ( ), getOutputPath ( ), SequenceFileInputFormat. class, BayesTestMapper. class,
            Text. class, VectorWritable. class, SequenceFileOutputFormat. class ) ;
     boolean complementary  = parsedArgs. containsKey ( "testComplementary" ) ;
    testJob. getConfiguration ( ). set (COMPLEMENTARY,  String. valueOf (complementary ) ) ;
     boolean succeeded  = testJob. waitForCompletion ( true ) ;
     return succeeded ;
   }

首先从训练的模型中得到model,实例化model,也就是将写入的vectors重新读取出来罢了。
testJob只用到了map阶段,如下

1
2
3
4
5
   protected  void map (Text key, VectorWritable value,  Context context )  throws  IOExceptionInterruptedException  {
     Vector result  = classifier. classifyFull (value. get ( ) ) ;
     //the key is the expected value
    context. write ( new Text (key. toString ( ). split ( "/" ) [ 1 ] )new VectorWritable (result ) ) ;
   }

输出的key就是类别的text,value就是输入的向量在每个类的得分。
classifier.classifyFull()计算输入的向量在每个label的得分:

1
2
3
4
5
6
7
   public  Vector classifyFull ( Vector instance )  {
     Vector score  = model. createScoringVector ( ) ;
     for  ( int label  =  0 ; label  < model. numLabels ( ) ; label ++ )  {
      score. set (label, getScoreForLabelInstance (label, instance ) ) ;
     }
     return score ;
   }

getScoreForLabelInstance如下,计算此label下的feature得分和。

1
2
3
4
5
6
7
8
9
10
   protected  double getScoreForLabelInstance ( int label,  Vector instance )  {
     double result  =  0.0 ;
    Iterator <element > elements  = instance. iterateNonZero ( ) ;
     while  (elements. hasNext ( ) )  {
       Element e  = elements. next ( ) ;
      result  += e. get ( )  * getScoreForLabelFeature (label, e. index ( ) ) ;
     }
     return result ;
   }
</element >

getScoreForLabelFeature有两种计算方式,
1, 标准bayes ,log[(Wi+alphai)/(ƩWi + N)]

1
2
3
4
5
6
7
8
9
10
11
12
   public  double getScoreForLabelFeature ( int label,  int feature )  {
    NaiveBayesModel model  = getModel ( ) ;
return
computeWeight (model. weight (label, feature ), model. labelWeight (label ), model. alphaI ( ),
        model. numFeatures ( ) ) ;
   }
   public  static  double computeWeight ( double featureLabelWeight,  double labelWeight,  double alphaI,
       double numFeatures )  {
     double numerator  = featureLabelWeight  + alphaI ;
     double denominator  = labelWeight  + alphaI  * numFeatures ;
     return  Math. log (numerator  / denominator ) ;
   }

2, complementary bayes,也就是计算除此类之外的其他类的值。

1
2
3
4
5
6
7
8
9
10
11
12
//complementary bayes
     public  double getScoreForLabelFeature ( int label,  int feature )  {
    NaiveBayesModel model  = getModel ( ) ;
     return computeWeight (model. featureWeight (feature ), model. weight (label, feature ),
        model. totalWeightSum ( ), model. labelWeight (label ), model. alphaI ( ), model. numFeatures ( ) ) ;
   }
   public  static  double computeWeight ( double featureWeight,  double featureLabelWeight,
       double totalWeight,  double labelWeight,  double alphaI,  double numFeatures )  {
     double numerator  = featureWeight  - featureLabelWeight  + alphaI ;
     double denominator  = totalWeight  - labelWeight  + alphaI  * numFeatures ;
     return  - Math. log (numerator  / denominator ) ;
   }

最后就是analyze了,对每个key,通过score vector得到最大值,与label index比较。产生confusion matrix了。

http://hnote.org/big-data/mahout/mahout-testnaivebayesdriver-testnb

 

你可能感兴趣的:(Mahout)