Demo1:简单的贝叶斯单词纠错器

原理

argmaxc P(c|w) -> argmaxc P(w|c) P(c) / P(w)

P(c) 文章中出现一个正确拼写词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大
P(w|c), 在用户想键入 c 的情况下敲成 w 的概率. 因为这个是代表用户会以多大的概率把 c 敲错成 w
argmaxc, 用来枚举所有可能的 c 并且选取概率最大的

代码

import re, collections
class WordCorrection():
    def __init__(self,text_path='data.txt'):
        self.alphabet = 'abcdefghijklmnopqrstuvwxyz'
        self.__word = self.preDeal(text_path)
        self.__WORDSF = self.countWordFrequency()

    # 把语料中的单词全部抽取出来, 转成小写, 并且去除单词中间的特殊符号
    def preDeal(self, text_path):
        text = open(text_path).read()
        return re.findall('[a-z]+', text.lower())

    #统计词频(建模)
    def countWordFrequency(self):
        model = collections.defaultdict(lambda: 1)
        for f in self.__word:
            model[f] += 1
        return model

    #去除错误候选词
    def known(self, words):
        return set(w for w in words if w in self.__WORDSF)

    #编辑词距(只改一个字符)
    def edits1(self, word):
        n = len(word)
        return set([word[0:i] + word[i + 1:] for i in range(n)] +  # 删除一个字符
                   [word[0:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(n - 1)] +  # 交换次序
                   [word[0:i] + c + word[i + 1:] for i in range(n) for c in self.alphabet] +  # 改一个字符
                   [word[0:i] + c + word[i:] for i in range(n + 1) for c in self.alphabet])  # 插入一个字符

    #编辑词距(2个字符)
    def known_edits2(self, word):
        return set(e2 for e1 in self.edits1(word) for e2 in self.edits1(e1) if e2 in self.__WORDSF)

    #纠正结果
    def correct(self, word):
        # 正确  改一个字符  2个 未出现字符
        candidates = self.known([word]) or self.known(self.edits1(word)) or self.known_edits2(word) or [word]
        return max(candidates, key=lambda w: self.__WORDSF[w])
if __name__ == "__main__":
    obj = WordCorrection()
    for _ in range(3):
        w = input("输入单词:")
        print("纠错结果:",obj.correct(w))

测试结果

Demo1:简单的贝叶斯单词纠错器_第1张图片

你可能感兴趣的:(贝叶斯算法)