贝叶斯并行分类分析
1 贝叶斯训练器
The implementation is divided up into three parts:
The Trainer -- responsible for doing the counting of the words and the labels
The Model -- responsible for holding the training data in a useful way
The Classifier -- responsible for using the trainers output to determine the category of previously unseen documents
The trainer is manifested in several classes:
创建
Hadoop
贝叶斯作业,输出模型,这个类封装了
4
个
map/reduce
类。
训练器的输入是<noscript></noscript>
KeyValueTextInputFormat
格式,第一个字符时类标签,剩余的是特征(单词),如下面的格式:
<noscript></noscript> hockey 和 football 是类标签,剩下的是特征。<noscript></noscript>
hockey puck stick goalie forward defenseman referee ice checking slapshot helmet
football field football pigskin referee helmet turf tackle
所在包:
<noscript></noscript> org.apache.mahout.classifier.bayes负责训练贝叶斯分类器,输入的格式:每一行是一个文本,第一个字符时类的标签,剩下的是特征(单词)
这个类会根据命令行参数调用两个训练器:
|
|
|
|
trainCNaiveBayes
函数调用
CBayesDriver
类;
trainNaiveBayes
会调用
BayesDriver
类
下面分别分析
CBayesDriver
类
和
BayesDriver
类
BayesDriver
所在包:<noscript></noscript>
org.apache.mahout.classifier.bayes.mapreduce.bayes
public class BayesDriverextends
Object
implements BayesJob
实现了
BayesJob
接口
在这个类的
runJob
函数里会调用调用
4
个
map/reduce
作业类
第一个:
BayesFeatureDriver
负责
Read the features in each document normalized by length of each document
第二个:
BayesTfIdfDriver
负责
Calculate the TfIdf for each word in each label
第三个:
BayesWeightSummerDriver
负责
alculate the Sums of weights for each label, for each feature
第四个:
BayesThetaNormalizerDriver
负责:
Calculate the normalization factor Sigma_W_ij for each complement class
下面分别分析这个四个类:
一个
map/reduce
类:
BayesFeatureDriver
所在包:
package
org.apache.mahout.classifier.bayes.mapreduce.common;
输出
key
类型:
StringTuple.
class
输出
value
类型:
DoubleWritable.
class
输入格式:
KeyValueTextInputFormat.
class
输出格式:
BayesFeatureOutputFormat.
class
MAP
:
BayesFeatureMapper.
class
REDUCE
:
BayesFeatureReducer.
class
注意:
BayesFeatureDriver
可以独立运行,默认的输入和输出:
input =
new
Path(
"/home/drew/mahout/bayes/20news-input"
);
output = new Path("/home/drew/mahout/bayes/20-news-features");
p =
new
BayesParameters(1) gramsize
默认为
1
输出会生成三个文件
$OUTPUT/
trainer-termDocCount
$OUTPUT/
trainer-wordFreq
$OUTPUT/
trainer-featureCount
下来的第二个
map/reduce
类
BayesTfIdfDriver
会根据这第一个的输出文件
计算
TF-IDF
值,计算完毕后会删除这三个中间文件,并生成文件:
trainer-tfIdf
保存文本特征的
it-idf
值,
第三个:
BayesWeightSummerDriver
输出
key
:
StringTuple.
class
输出
value:DoubleWritable.
class
输入路径:就是第二个
map/reduce
生成的
trainer-tfIdf
文件
输出:
trainer-weights
文件
输入文件格式:
SequenceFileInputFormat.
class
输出文件格式:
BayesWeightSummerOutputFormat.
class
第四个
job
:
BayesThetaNormalizerDriver
输出
key
:
StringTuple.
class
输出
value:DoubleWritable.
class
输入路径:
FileInputFormat.addInputPath(conf,
new
Path(output,
"trainer-tfIdf/trainer-tfIdf"
));
就是需要使用第二个
job
的输出:
trainer-tfIdf
文件
输出路径:
Path outPath =
new
Path(output,
"trainer-thetaNormalizer"
);
会生成文件:
trainer-thetaNormalizer
输出文件格式:
SequenceFileOutputFormat.
class
这个四个
job
执行完毕后整个
bayes
模型就建立完毕了,最后总共生成并保存三个目录文件:
trainer-tfIdf
trainer-thetaNormalizer
trainer-weights
模型建好了,下来就是测试分类器的效果
调用类:
TestClassifier
所在包:
package
org.apache.mahout.classifier.bayes;
根据命令行参数会选择顺序执行还是并行
map/reduce
执行
并行执行回调用
BayesClassifierDriver
类
下面分析
BayesClassifierDriver
类
所在包:
package
org.apache.mahout.classifier.bayes.mapreduce.bayes;
输入格式:
KeyValueTextInputFormat.
class
输出格式:
SequenceFileOutputFormat.
class
执行完毕后会调用混合矩阵:
ConfusionMatrix
函数显示结果