原创文章,转载请注明: 转载自慢慢的回味
本文链接地址: 贝叶斯分类(classify-20newsgroups)
一 理论分析
Tackling the Poor Assumptions of Naive Bayes Text Classifiers
贝叶斯的多项式模型如下公式。表示一个文档由一系列单词构成。为在类c的条件下,当前文档为文档d的概率。
表示类c的参数向量,表示文档共有m个类,一个类向量由n个单词的概率参数表示。如表示类c中单词i的概率。
为了求文档d的似然概率,通常再加上类c的先验概率就可得到d的似然函数。不过往往先验概率都相同。
似然函数的关键是求,根据论文Heckerman, D. (1995). A tutorial on learning with Bayesian networks 得到单词i在类c中的似然概率为
似然函数就为
Mahout的贝叶斯分类默认每个类的先验概率相同,所以执行比较取最大值时的c就为目标类别,这是StandardNaiveBayesClassifier.java的逻辑。
当为ComplementaryNaiveBayesClassifier.java时,公式对应为
其中,表示单词i不属于c的次数。即越大表示越不属于c,和上面的 StandardNaiveBayesClassifier相反。
一般情况下,都比要大,即样本更均衡,所以前者得到似然概率将更精确。
二 代码分析
1 从20newsgroups文件中创建sequence文件
./bin/mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq -ow
输入为每个分类目录下的文件,输出为sequence文件,key为文件名,value为文件内容。
2 从sequence文件创建vectors
./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
参数:-lnorm输出需要log规范化;-nv输出为NamedVectors;-wt tfidf词频统计模型,参见http://zh.wikipedia.org/zh/TF-IDF
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf); |
把文件进行tokenize,输出为{doc1,[term1,...],….}
if (processIdf) {
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir,
tfDirName,
conf,
minSupport,
maxNGramSize,
minLLRValue,
-1.0f,
false,
reduceTasks,
chunkSize,
sequentialAccessOutput,
namedVectors);
} |
这个函数首先统计所有文档所用的的单词库即每个单词的使用次数,输出为{word1,num1,…}
然后统计每个文档的单词使用数,输出为{doc1,[Num-term1,...],…},term是word在单词库里对应的序号。
docFrequenciesFeatures =
TFIDFConverter.calculateDF(new Path(outputDir, tfDirName), outputDir, conf, chunkSize); |
计算所有文档每个单词的使用次数,输出为{-1,docs-num,term1,num1,…}
TFIDFConverter.processTfIdf(
new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
outputDir, conf, docFrequenciesFeatures, minDf, maxDF, norm, logNormalize,
sequentialAccessOutput, namedVectors, reduceTasks); |
计算输出的每个term的idf值,输出为{doc1,[Num-term-idf1,...],…}
3 分割训练和测试样本
./bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
–trainingOutput ${WORK_DIR}/20news-train-vectors \
–testOutput ${WORK_DIR}/20news-test-vectors \
–randomSelectionPct 40 –overwrite –sequenceFiles -xm sequential
4 训练贝叶斯模型
./bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors -el \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c
long labelSize = createLabelIndex(labPath); |
得到当前训练集的类目。
Job indexInstances = prepareJob(getInputPath(),
getTempPath(SUMMED_OBSERVATIONS),
SequenceFileInputFormat.class,
IndexInstancesMapper.class,
IntWritable.class,
VectorWritable.class,
VectorSumReducer.class,
IntWritable.class,
VectorWritable.class,
SequenceFileOutputFormat.class);
indexInstances.setCombinerClass(VectorSumReducer.class);
boolean succeeded = indexInstances.waitForCompletion(true);
if (!succeeded) {
return -1;
} |
统计每一类的所有文档term的和,并把类目转换为LabelIndex对应的index值。
对应的输出为{class1,[Num-term1,...],…}}
Job weightSummer = prepareJob(getTempPath(SUMMED_OBSERVATIONS),
getTempPath(WEIGHTS),
SequenceFileInputFormat.class,
WeightsMapper.class,
Text.class,
VectorWritable.class,
VectorSumReducer.class,
Text.class,
VectorWritable.class,
SequenceFileOutputFormat.class);
weightSummer.getConfiguration().set(WeightsMapper.NUM_LABELS, String.valueOf(labelSize));
weightSummer.setCombinerClass(VectorSumReducer.class); |
统计每一类的权重和每一feature(term)的权重 ,
输出为{WEIGHTS_PER_FEATURE,[feature1,num1,...],WEIGHTS_PER_LABEL,[label1,num1,...]}
5 测试贝叶斯模型
./bin/mahout testnb \
-i ${WORK_DIR}/20news-test-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c
public Vector classifyFull(Vector r, Vector instance) {
for (int label = 0; label < model.numLabels(); label++) {
r.setQuick(label, getScoreForLabelInstance(label, instance));
}
return r;
}
protected double getScoreForLabelInstance(int label, Vector instance) {
double result = 0.0;
for (Element e : instance.nonZeroes()) {
result += e.get() * getScoreForLabelFeature(label, e.index());
}
return result;
}
@Override
public double getScoreForLabelFeature(int label, int feature) {
NaiveBayesModel model = getModel();
return computeWeight(model.weight(label, feature), model.labelWeight(label), model.alphaI(),
model.numFeatures());
}
public static double computeWeight(double featureLabelWeight, double labelWeight, double alphaI,
double numFeatures) {
double numerator = featureLabelWeight + alphaI;
double denominator = labelWeight + alphaI * numFeatures;
return Math.log(numerator / denominator);
} |
根据理论部分的公式,计算每个文档相应类的似然值。