一理论分析

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

贝叶斯的多项式模型如下公式。表示一个文档由一系列单词构成。为在类c的条件下，当前文档为文档d的概率。

表示类c的参数向量，表示文档共有m个类，一个类向量由n个单词的概率参数表示。如表示类c中单词i的概率。

为了求文档d的似然概率，通常再加上类c的先验概率就可得到d的似然函数。不过往往先验概率都相同。

似然函数的关键是求，根据论文Heckerman, D. (1995). A tutorial on learning with Bayesian networks 得到单词i在类c中的似然概率为

似然函数就为

Mahout的贝叶斯分类默认每个类的先验概率相同，所以执行比较取最大值时的c就为目标类别，这是StandardNaiveBayesClassifier.java的逻辑。

当为ComplementaryNaiveBayesClassifier.java时，公式对应为

其中，表示单词i不属于c的次数。即越大表示越不属于c，和上面的 StandardNaiveBayesClassifier相反。

一般情况下，都比要大，即样本更均衡，所以前者得到似然概率将更精确。

二代码分析

1 从20newsgroups文件中创建sequence文件

./bin/mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq -ow

输入为每个分类目录下的文件，输出为sequence文件，key为文件名，value为文件内容。

2 从sequence文件创建vectors

./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf

参数：-lnorm输出需要log规范化；-nv输出为NamedVectors；-wt tfidf词频统计模型，参见http://zh.wikipedia.org/zh/TF-IDF

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

把文件进行tokenize，输出为{doc1,[term1,...],….}

      if (processIdf) {
        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                outputDir,
                tfDirName,
                conf,
                minSupport,
                maxNGramSize,
                minLLRValue,
                -1.0f,
                false,
                reduceTasks,
                chunkSize,
                sequentialAccessOutput,
                namedVectors);
      }

这个函数首先统计所有文档所用的的单词库即每个单词的使用次数，输出为{word1,num1,…}

然后统计每个文档的单词使用数，输出为{doc1,[Num-term1,...],…}，term是word在单词库里对应的序号。

docFrequenciesFeatures =
                TFIDFConverter.calculateDF(new Path(outputDir, tfDirName), outputDir, conf, chunkSize);

计算所有文档每个单词的使用次数，输出为{-1,docs-num,term1,num1,…}

        TFIDFConverter.processTfIdf(
                new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
                outputDir, conf, docFrequenciesFeatures, minDf, maxDF, norm, logNormalize,
                sequentialAccessOutput, namedVectors, reduceTasks);

计算输出的每个term的idf值，输出为{doc1,[Num-term-idf1,...],…}

3 分割训练和测试样本

./bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
–trainingOutput ${WORK_DIR}/20news-train-vectors \
–testOutput ${WORK_DIR}/20news-test-vectors \
–randomSelectionPct 40 –overwrite –sequenceFiles -xm sequential

4 训练贝叶斯模型

./bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors -el \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c

long labelSize = createLabelIndex(labPath);

得到当前训练集的类目。

    Job indexInstances = prepareJob(getInputPath(),
                                    getTempPath(SUMMED_OBSERVATIONS),
                                    SequenceFileInputFormat.class,
                                    IndexInstancesMapper.class,
                                    IntWritable.class,
                                    VectorWritable.class,
                                    VectorSumReducer.class,
                                    IntWritable.class,
                                    VectorWritable.class,
                                    SequenceFileOutputFormat.class);
    indexInstances.setCombinerClass(VectorSumReducer.class);
    boolean succeeded = indexInstances.waitForCompletion(true);
    if (!succeeded) {
      return -1;
    }

统计每一类的所有文档term的和，并把类目转换为LabelIndex对应的index值。

对应的输出为{class1,[Num-term1,...],…}}

Job weightSummer = prepareJob(getTempPath(SUMMED_OBSERVATIONS),
                                  getTempPath(WEIGHTS),
                                  SequenceFileInputFormat.class,
                                  WeightsMapper.class,
                                  Text.class,
                                  VectorWritable.class,
                                  VectorSumReducer.class,
                                  Text.class,
                                  VectorWritable.class,
                                  SequenceFileOutputFormat.class);
    weightSummer.getConfiguration().set(WeightsMapper.NUM_LABELS, String.valueOf(labelSize));
    weightSummer.setCombinerClass(VectorSumReducer.class);

统计每一类的权重和每一feature（term）的权重，

输出为{WEIGHTS_PER_FEATURE,[feature1,num1,...],WEIGHTS_PER_LABEL,[label1,num1,...]}

5 测试贝叶斯模型

./bin/mahout testnb \
-i ${WORK_DIR}/20news-test-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c

  public Vector classifyFull(Vector r, Vector instance) {
    for (int label = 0; label &lt; model.numLabels(); label++) {
      r.setQuick(label, getScoreForLabelInstance(label, instance));
    }
    return r;
  }
 
  protected double getScoreForLabelInstance(int label, Vector instance) {
    double result = 0.0;
    for (Element e : instance.nonZeroes()) {
      result += e.get() * getScoreForLabelFeature(label, e.index());
    }
    return result;
  }
 
  @Override
  public double getScoreForLabelFeature(int label, int feature) {
    NaiveBayesModel model = getModel();
    return computeWeight(model.weight(label, feature), model.labelWeight(label), model.alphaI(),
        model.numFeatures());
  }
 
  public static double computeWeight(double featureLabelWeight, double labelWeight, double alphaI,
      double numFeatures) {
    double numerator = featureLabelWeight + alphaI;
    double denominator = labelWeight + alphaI * numFeatures;
    return Math.log(numerator / denominator);
  }

根据理论部分的公式，计算每个文档相应类的似然值。

贝叶斯分类（classify-20newsgroups）

一 理论分析

二 代码分析