贝叶斯公式与mahout贝叶斯分类器

贝叶斯公式与贝叶斯分类器
贝叶斯定理之所以有用,是因为我们在生活中经常遇到这种情况:
我们可以很容易直接得出P(A|B),P(B|A)则很难直接得出,但我们更关心P(B|A),贝叶斯定理就为我们打通从P(A|B)获得P(B|A)的道路
L(A|B)是在B发生的情况下A发生的可能性
Pr(A|B)是已知B发生后A的条件概率,也由于得自B的取值而被称作A的后验概率。
当前几个主要的Lucene中文分词器的比较
http://www.iteye.com/news/9637


P(H|X)  H代表邮件 X代表邮件中出现的词
P(H|X)在出现X这个词时,H邮件为垃圾邮件的概率
P(X|H) 在H为垃圾邮件时,X出现的概率
P(X)为在所有邮件中X出现的概率
P(H)为在所有邮件中H为垃圾邮件的概率
P(H|X)=P(X|H)P(H)/P(X)


#划分数据集——pig
读入分词后的文件
processed = load '/opt/digitalout/part-r-00000' as (category:chararray, doc:chararray);
随机抽取20%的样本作为测试集
test = sample processed 0.2;
提取剩余样本作为训练集
jnt = join processed by (category,doc) left outer, test by (category,doc);
filt_test = filter jnt by test::category is null;
train = foreach filt_test generate processed::category as category, processed::doc as doc;
– 先将原数据集processed左连接(left join)测试集test
– 把有test记彔的样本去除
输出
store test into '/opt/digitalout/test';
store train into '/opt/digitalout/train';
测试集统计
test_ct = foreach (group test by category) generate group,COUNT(test.category);
dump test_ct;
train_ct = foreach (group train by category) generate group,COUNT(train.category);
dump train_ct;
#文件序列化
#mahout seqdirectory  --input /opt/digitalout/train --output /opt/digitalout/train_byes --tempDir /opt/digitalout/temp




训练贝叶斯模型(与上面pig无关)
每个类别的文件,放到同一个文件夹下,文件夹的名称就是类别
/opt/digital 在我的docs目录下,有mp3 camera等目录,每个目录下都是一篇篇文章
./mahout seqdirectory -i /opt/digital -o /opt/digitalseq
将序列化文件分词,变成向量文件,然后分开成训练集和测试集
./mahout seq2sparse -i /opt/digitalseq  -o /opt/digitalout2/vectors -lnorm -nv -wt tfidf
./mahout split -i /opt/digitalout2/vectors/tfidf-vectors --trainingOutput /opt/digitalout2/train --testOutput /opt/digitalout2/test --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
训练贝叶斯模型
./mahout trainnb  -i /opt/digitalout2/train -el  -o /opt/digitalout2/model -li /opt/digitalout2/labelindex -ow -c
测试贝叶斯模型
./mahout testnb  -i /opt/digitalout2/test -m /opt/digitalout2/model -l /opt/digitalout2/labelindex -ow -o /opt/digitalout2/testresult -c


Summary
-------------------------------------------------------
Correctly Classified Instances          :       3188   96.5768%
Incorrectly Classified Instances        :        113    3.4232%
Total Classified Instances              :       3301


=======================================================
Confusion Matrix
-------------------------------------------------------
a     b     c     d     e     <--Classified as
549   3     3     9     13   |  577   a     = MP3
0     503   0     14   2     |  519   b     = camera
5     1     560   13   30   |  609   c     = computer
0     1     0     611   0     |  612   d     = household
2     2     0     15   965   |  984   e     = mobile


=======================================================
Statistics
-------------------------------------------------------
Kappa                                       0.9496
Accuracy                                   96.5768%
Reliability                                80.3207%
Reliability (standard deviation)            0.3944


mahout --help

你可能感兴趣的:(Mahout,贝叶斯,贝叶斯分类器)