本文用于记录传统机器学习算法之Bayers算法,一方面加深知识理解,另一方面作为自己的参考和学习资料,如果文章中错误性知识,欢迎评论指出!谢谢!!!
A u t h o r : S o p h e r Author: Sopher Author:Sopher
I d e n t i f i c a t i o n : S t u d e n t Identification: Student Identification:Student
D e s c r i p t i o n : Description: Description: 他常在深夜调试!因为她热爱这份职业!
1.贝叶斯算法被应用于解决:正向概率和逆向概率(以袋中的黑白求作出解释)
①正向概率:已知袋中黑球和白球的数量,问模中其中一球的概率
②逆向概率:不知道袋中的黑白球比例,随机从袋中摸出一个(或多个)球,问通过这个操作能对袋中的黑白求比例做出什么猜测?
2.理解:对于复杂环境,通过局部推出整体的条件;即是将复杂问题转换为简单问题进行求解
P ( A ∣ B ) = P ( B ∣ A ) ∗ P ( A ) P ( B ) P(A|B)=\frac{P(B|A)*P(A)}{P(B)} P(A∣B)=P(B)P(B∣A)∗P(A)
问题:学校中有学生 M M M人,其中理科生( S c i e n c e s t u d e n t s Science students Sciencestudents)为 60 % 60\% 60%,文科生( L i t e r a t u r e s t u d e n t s Literature students Literaturestudents)为 40 % 40\% 40%,现在要求所有的 S c i e n c e s t u d e n t s Science students Sciencestudents 均要学习计算机,而文科生中要求 30 % 30\% 30%需要学习计算机。问:在学习计算机中,文科生所占比例?
求解过程:
① 学习计算机理科生: S _ c o m p u t e r = M ∗ P ( S ) ∗ P ( c o m p u t e r ∣ S ) S\_computer=M*P(S)*P(computer|S) S_computer=M∗P(S)∗P(computer∣S)
其中:
② 学习计算机文科生: L _ c o m p u t e r = M ∗ P ( L ) ∗ P ( c o m p u t e r ∣ L ) L\_computer=M*P(L)*P(computer|L) L_computer=M∗P(L)∗P(computer∣L)
其中:
③ 总学习计算机人数: A l l _ c o m p u t e r = L _ c o m p u t e r + S _ c o m p u t e r All\_computer = L\_computer + S\_computer All_computer=L_computer+S_computer
④ 学习计算机中,文科生占比: P ( L | c o m p u t e r ) = L _ c o m p u t e r / A l l _ c o m p u t e r P\left(L\middle| c o m p u t e r\right)=L\_computer/All\_computer P(L∣computer)=L_computer/All_computer
⑤ 对④中公式化简:
P ( L | c o m p u t e r ) = P ( L ) ∗ P ( c o m p u t e r ∣ L ) ( P ( S ) ∗ P ( c o m p u t e r ∣ S ) + P ( L ) ∗ P ( c o m p u t e r ∣ L ) ) P\left(L\middle| c o m p u t e r\right)=\frac{P(L)\ast P(computer|L)}{(P(S)\ast P(computer|S)+P(L)\ast P(computer|L))} P(L∣computer)=(P(S)∗P(computer∣S)+P(L)∗P(computer∣L))P(L)∗P(computer∣L)
进一步观察说明:
⑥ 求解学习 c o m p u t e r computer computer中,文科生的比例和学校的总人数 M M M无关的
⑦ P ( c o m p u t e r ) = P ( S ) ∗ P ( c o m p u t e r ∣ S ) + P ( L ) ∗ P ( c o m p u t e r ∣ L ) P(computer) = P(S)\ast P(computer|S)+P(L)\ast P(computer|L) P(computer)=P(S)∗P(computer∣S)+P(L)∗P(computer∣L),即为学习计算机的总人数
⑧ P ( c o m p u t e r , L ) = P ( L ) ∗ P ( c o m p u t e r ∣ L ) P(computer,L) = P(L)\ast P(computer|L) P(computer,L)=P(L)∗P(computer∣L),即为文科生中学习计算机的人数
因此,对⑤中公式进行进一步的转换:
P ( L ∣ c o m p u t e r ) = P ( c o m p u t e r , L ) P ( c o m p u t e r ) P(L|computer)=\frac{P(computer,L)\ }{P(computer)} P(L∣computer)=P(computer)P(computer,L)
问题: Sopher开发一款Sopher lexicon用于单词纠正, 现词库中有单词 H = h 1 , h 2 , . . . , h n H={h_1,h_2,...,h_n} H=h1,h2,...,hn, 现在用户penny输入错误单词D(D不在Sopher lexicon中),我们需要用词库中单词去猜测penny真正想输入的单词是什么?
过程分析
① 问题结合Bayers公式可以转换成数学符号表示:
P ( h ∣ D ) = P ( h ) ∗ P ( D ∣ h ) P ( D ) P(h|D)=\frac{P(h)*P(D|h)}{P(D)} P(h∣D)=P(D)P(h)∗P(D∣h)
其中:
② 因为单词D是penny输入的,且纠错过程是我们假设考虑单个错误性单词,因此 P ( D ) P(D) P(D)表示常量1(或者我们理解:输入错误单词P(D)对于每一种猜测 P ( h ∣ D ) P(h|D) P(h∣D)都是一样的,因此我们忽略P(D)不看)。进一步,我们对公式进行修改,如下:
P ( h ∣ D ) = P ( h ) ∗ P ( D ∣ h ) P(h|D)=P(h)*P(D|h) P(h∣D)=P(h)∗P(D∣h)
③ 此时再衡量 P ( h ∣ D ) P(h|D) P(h∣D),发现只和两种因素相关:
④ 通过比较不同单词 h h h中 P ( h ∣ D ) P(h|D) P(h∣D)的大小,确定预测单词h表示penny想输入的单词
⑤ 模型比较的方法
问题: Sopher邮件中转站致力于邮件的刷选和分发,现在Sopher接收到一个邮件D,邮件由n个单词构成,即表示 D = d 1 , d 2 , . . . , d n D={d_1,d_2,...,d_n} D=d1,d2,...,dn他需要进行判断该邮件是否为垃圾邮件,其中我们规定是有 h + h_+ h+表示垃圾邮件不进行分发, h − h_- h−表示正常邮件进行分发。
过程分析:
① 仍然结合Bayers公式,将接收邮件为垃圾邮件概率表示如下:
P ( h + ∣ D ) = P ( h + ) ∗ P ( D ∣ h + ) P ( D ) P(h_+|D)=\frac{P(h_+)*P(D|h_+)}{P(D)} P(h+∣D)=P(D)P(h+)∗P(D∣h+)
其中:
② 对于接收邮件,不能以邮件与垃圾邮件的全部单词都相同,才认为该邮件是垃圾邮件。我们认为邮件D中出现垃圾邮件中的单词达到一定概率,则认为其为垃圾邮件,因此我们对 P ( d 1 , d 2 , . . . , d n ∣ h + ) P(d_1,d_2,...,d_n|h_+) P(d1,d2,...,dn∣h+)进一步展开表示为:
P ( d 1 , d 2 , . . . , d n ∣ h + ) = P ( d 1 ∣ h + ) ∗ P ( d 2 ∣ d 1 , h + ) ∗ P ( d 3 ∣ d 1 , d 2 , h + ) . . . ∗ P ( d n ∣ d 1 , d 2 , . . . , d n − 1 , h + ) P(d_1,d_2,...,d_n|h_+)=P(d_1|h_+)*P(d_2|d_1,h_+)*P(d_3|d_1,d_2, h_+)...*P(d_n|d_1,d_2,...,d_{n-1},h_+) P(d1,d2,...,dn∣h+)=P(d1∣h+)∗P(d2∣d1,h+)∗P(d3∣d1,d2,h+)...∗P(dn∣d1,d2,...,dn−1,h+)
③对于 P ( D ∣ h + ) P(D|h_+) P(D∣h+)展开式,由于单词之间是相互独立没有联系的。此时,问题转化为Naive Bayers:假设特征之间是完全独立的。因此,我们只需要统计每个单词 d i d_i di在垃圾邮件 h + h_+ h+出现的频率(即,垃圾邮件中出现单词 d 1 , d 2 , . . . d n d_1,d_2,...d_n d1,d2,...dn的频率),进一步公式表示如下:
P ( D ∣ h + ) = P ( d 1 ∣ h + ) ∗ P ( d 2 ∣ h + ) ∗ P ( d 3 ∣ h + ) . . . ∗ P ( d n ∣ h + ) P(D|h_+)=P(d_1|h_+)*P(d_2|h_+)*P(d_3| h_+)...*P(d_n|h_+) P(D∣h+)=P(d1∣h+)∗P(d2∣h+)∗P(d3∣h+)...∗P(dn∣h+)
1. 代码说明
2. 代码实现
import re, collections
'''
function: spelling check machine
methods: Bayers methods
'''
def words(text):
return re.findall('[a-z]+', text.lower()) # fetch words changed lower case matching regular expression:[a-z]+
def train(features):
model = collections.defaultdict(lambda :1) # dict generate: make input word appear once(probability is small) although the word not in lexicon
for feature in features:
model[feature] += 1
return model
numWords = train(words(open('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits(word):
'''
function: set distance equal 1 between input words with true words, get new form for input words
'''
n = len(word)
return set([word[0:i] + word[i+1:] for i in range(n)] + # delete single letter
[word[0:i] + word[i+1] + word[i] + word[i+2:] for i in range(n - 1)] + # transpose adjacent letter
[word[0:i] + c + word[i+1:] for i in range(n) for c in alphabet] + # alteration single letter with 26 letter
[word[0:i] + c + word[i:] for i in range(n + 1) for c in alphabet] # insert 26 letter
)
def know_edits(word):
'''
function: set distance equal 2 between e2 and e1, makesure e2 is true
'''
return set(e2 for e1 in edits(word) for e2 in edits(e1) if e2 in numWords)
def know(word):
return set(w for w in word if w in numWords)
def correct(word):
candidates = know([word]) or know(edits(word)) or know_edits(word) or [word] # stop when satisfy one
return max(candidates, key=lambda w : numWords[w])
result1 = correct("tha")
print("输入单词为tha,拼写检查器纠错后单词为:", result1)
result2 = correct("mrow")
print("输入单词为mat,拼写检查器纠错后单词为:", result2)
测试结果如下:
输入单词为tha,拼写检查器纠错后单词为: the
输入单词为mat,拼写检查器纠错后单词为: grow
3. 资料说明
①文中所有big.txt文件获取路径:https://github.com/dscape/spell
②具体方式:将路径中的项目使用Git或者Download ZIP形式下载到本地,具体文件所在在目录:test->resources->big.txt
"""
朴素贝叶斯算法进行文本分类
:return:None
"""
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
def naveibayes():
# 加载新闻数据
news = fetch_20newsgroups(subset= "all")
# 数据分割
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size= 0.25)
# print(x_train)
# 对数据集进行特征抽取(字符串数据)
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)
x_test = tf.transform(x_test)
print(tf.get_feature_names()) # 获取特征抽取的特征名
# 进行朴素贝叶斯算法
mlt = MultinomialNB(alpha = 1.0)
print(x_train.toarray())
mlt.fit(x_train, y_train)
# 对训练集特征值进行预测
y_predict = mlt.predict(x_test)
print("训练集预测结果为:", y_predict)
print("模型的准确率:", mlt.score(x_test, y_test))
# 精确率和召回率
print("每个类别的精确率和召回率:")
print(classification_report(y_test,y_predict,target_names= news.target_names))
return None
if __name__ == "__main__":
naveibayes()
Q W Q , I w i l l s u c c e e d u n t i l s u c c e s s !!!! QWQ, I\space will\space succeed\space until\space success!!!! QWQ,I will succeed until success!!!!