机器学习实战读书笔记-朴素贝叶斯

机器学习实战读书笔记-朴素贝叶斯

核心思想:要求分类器给出一个最优类别的猜测结果,同时给出这个猜测概率的估计值

我们称之为朴素,是因为整个形式化过程只做最原始,最简单的假设

概率基础

其中的意义为:给定某个由x,y标注的数据点,那么该数据点来自类别的概率为多少

如果,那么属于类别1,反之亦然。

独立:如果每个特征需要N个样本,如果假设有10个特征,那么则需要个样本,如果特征之间相互独立,则需要的样本数就可以从减少到个,所谓独立,指的是统计意义上的独立,即一个特征或者单词出现的可能性与它和其他单词相邻没有关系。虽然我们知道这个假设并不正确,这也就是朴素的含义。

朴素贝叶斯假设

  • 特征之间相互独立(单词beacon出现在unhealty后面和出现在delicious后面的概率相同)
  • 每个特征同等重要(判断留言是否得等,需要看完所有的单词)

虽然这两个假设通常不成立,但是咱朴素贝叶斯就这么假设了。

使用python进行文本分类

训练算法

将以上公式中的换为,粗体表示这是一个向量,此外由于朴素贝叶斯的独立性假设,可以按如下公式计算,以此来简化计算过程。

计算每个类别中的文档数目
对于每篇训练文档:
    对于每个类别:
        如果词条出现在文档中,增加该词条的计数
        增加所有词条的计数
    对每个类别:
        对每个词条:
            将该词条的数目初一总词条数得到条件概率
    返回每个类别的条件概率

实战

  • 进行文本分类

    分类评论是否是恶意评论

    import numpy as np
    
    
    def load_data_set():
        """
        Generate train data set and associated classify result
        :return: (train_data_set, classify_result)
        """
        posting_list = [
            ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
            ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
            ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
            ['stop', 'posting', 'stupid', 'worthless', 'gar e'],
            ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
            ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
        class_vec = [0, 1, 0, 1, 0, 1]  # 1 is an abuse, 0 is not
        return posting_list, class_vec
    
    
    def create_vocab_list(data_set):
        """
        Get a set of words which appear in the train data set
        :param data_set: train_data_set
        :return: set of words
        """
        vocab_set = set()
        for doc in data_set:
            # union of 2 sets
            vocab_set = vocab_set | set(doc)
        return list(vocab_set)
    
    
    def word2vec_set(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector based on the words appear in the sentence Using Set Model
        :param vocab_list: All word appeared in train set
        :param input_sentence: Input sentence
        :return: The vector representative of the input sentence
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in vocab_list:
                ret_vector[vocab_list.index(word)] = 1
            else:
                print("the word {} is not in the vocabulary".format(word))
        return ret_vector
    
    
    def word2vec_bag(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
        :param vocab_list: 
        :param input_sentence: 
        :return: 
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in input_sentence:
                ret_vector[vocab_list.index(word)] += 1
        return ret_vector
    
    
    def train_naive(train_matrix, train_category):
        """
        Get probabilities to calculate bayes classify result
        :param train_matrix: All sentence vector of train set
        :param train_category: The classify result of train set
        :return: p(w|c_0) p(w|c_1) p(c_1)
        """
        # number of comment
        doc_num = len(train_matrix)
        # number of word in the vocabulary
        word_num = len(train_matrix[0])
    
        # probability of abusive p(c_1)
        # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
        p_abuse = sum(train_category) / float(doc_num)
    
        # p0_num = np.zeros(word_num)
        # p1_num = np.zeros(word_num)
        #
        # # p0_num/p0_denominator = p(w|c_0)
        # p0_denominator = 0.0
        # p1_denominator = 0.0
        p0_num = np.ones(word_num)
        p1_num = np.ones(word_num)
    
        p0_denominator = 2.0
        p1_denominator = 2.0
    
        for i in range(doc_num):
            # if this comment is abusive
            if train_category[i] == 1:
                p1_num += train_matrix[i]
                p1_denominator += sum(train_matrix[i])
            else:
                p0_num += train_matrix[i]
                p0_denominator += sum(train_matrix[i])
        # pi_condition is p(w|c_i)
        p1_condition = np.log(p1_num / p1_denominator)
        p0_condition = np.log(p0_num / p0_denominator)
        return p0_condition, p1_condition, p_abuse
    
    
    def classify_naive(test_vector, p0_condition, p1_condition, p_1):
        # because we already process np.log
        # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
        # Asterisk means element-wise multiply in numpy
        p1 = sum(test_vector * p1_condition) + np.log(p_1)
        p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
        if p1 > p0:
            return 1
        else:
            return 0
    
    
    def test_naive():
        post_list, class_list = load_data_set()
        vocab = create_vocab_list(post_list)
        train_matrix = []
        for post in post_list:
            train_matrix.append(word2vec_set(vocab, post))
        p0_condition, p1_conditon, p_aubsive = train_naive(train_matrix, class_list)
        test_entry = ["love", "my", "dalmation"]
        test_vector = word2vec_set(vocab, test_entry)
        print("The vector of input sentence is: ", test_vector)
        print("Classify result is: ", classify_naive(test_vector, p0_condition, p1_conditon, p_abusive))
    
    
    post_list, classes = load_data_set()
    print(post_list)
    vocab = create_vocab_list(post_list)
    print(word2vec_set(vocab, post_list[0]))
    print(vocab)
    
    train_matrix = []
    for post in post_list:
        train_matrix.append(word2vec_set(vocab, post))
    p_non_abusive_condition, p_abusive_condition, p_abusive = train_naive(train_matrix, classes)
    
    print(p_abusive)
    print(p_abusive_condition)
    
    max_index = p_abusive_condition.argmax()
    # argmax of p_abusive_condition is stupid, basically means the word 'stupid' contribute a lot to an abusive comment
    print(vocab[max_index])
    
    
  • 过滤垃圾邮件

    import re
    import random
    import numpy as np
    
    
    def create_vocab_list(data_set):
        """
        Get a set of words which appear in the train data set
        :param data_set: train_data_set
        :return: set of words
        """
        vocab_set = set()
        for doc in data_set:
            # union of 2 sets
            vocab_set = vocab_set | set(doc)
        return list(vocab_set)
    
    
    def word2vec_bag(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
        :param vocab_list:
        :param input_sentence:
        :return:
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in input_sentence:
                ret_vector[vocab_list.index(word)] += 1
        return ret_vector
    
    
    def train_naive(train_matrix, train_category):
        """
        Get probabilities to calculate bayes classify result
        :param train_matrix: All sentence vector of train set
        :param train_category: The classify result of train set
        :return: p(w|c_0) p(w|c_1) p(c_1)
        """
        # number of comment
        doc_num = len(train_matrix)
        # number of word in the vocabulary
        word_num = len(train_matrix[0])
    
        # probability of abusive p(c_1)
        # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
        p_abuse = sum(train_category) / float(doc_num)
    
        # p0_num = np.zeros(word_num)
        # p1_num = np.zeros(word_num)
        #
        # # p0_num/p0_denominator = p(w|c_0)
        # p0_denominator = 0.0
        # p1_denominator = 0.0
        p0_num = np.ones(word_num)
        p1_num = np.ones(word_num)
    
        p0_denominator = 2.0
        p1_denominator = 2.0
    
        for i in range(doc_num):
            # if this comment is abusive
            if train_category[i] == 1:
                p1_num += train_matrix[i]
                p1_denominator += sum(train_matrix[i])
            else:
                p0_num += train_matrix[i]
                p0_denominator += sum(train_matrix[i])
        # pi_condition is p(w|c_i)
        p1_condition = np.log(p1_num / p1_denominator)
        p0_condition = np.log(p0_num / p0_denominator)
        return p0_condition, p1_condition, p_abuse
    
    
    def classify_naive(test_vector, p0_condition, p1_condition, p_1):
        # because we already process np.log
        # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
        # Asterisk means element-wise multiply in numpy
        p1 = sum(test_vector * p1_condition) + np.log(p_1)
        p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
        if p1 > p0:
            return 1
        else:
            return 0
    
    
    def parse_text(input_sentence):
        token_list = re.split(r'\W+', input_sentence)
        return [token.lower() for token in token_list if len(token) > 2]
    
    
    def spam_test():
        # Import and parse files
        doc_list = []
        class_list = []
        for i in range(1, 26):
            try:
                words = parse_text(open("email/spam/{}.txt".format(i)).read())
            except:
                words = parse_text(open("email/spam/{}.txt".format(i), encoding='Windows 1252').read())
            doc_list.append(words)
            class_list.append(1)
    
            try:
                words = parse_text(open("email/ham/{}.txt".format(i)).read())
            except:
                words = parse_text(open("email/ham/{}.txt".format(i), encoding='Windows 1252').read())
            doc_list.append(words)
            class_list.append(0)
        vocab = create_vocab_list(doc_list)
    
        # Generate Training Set and Test Set
        test_set = [int(num) for num in random.sample(range(50), 10)]
        training_set = list(set(range(50)) - set(test_set))
    
        training_matrix = []
        training_class = []
        for doc_index in training_set:
            training_matrix.append(word2vec_bag(vocab, doc_list[doc_index]))
            training_class.append(class_list[doc_index])
        p0_conditon, p1_conditon, p_spam = train_naive(np.array(training_matrix), np.array(training_class))
    
        # Test the classify result
        err_count = 0
        for doc_index in test_set:
            test_vector = word2vec_bag(vocab, doc_list[doc_index])
            classify_result = classify_naive(test_vector, p0_conditon, p1_conditon, p_spam)
            if classify_result != class_list[doc_index]:
                err_count += 1
        print("The error rate is {}".format(err_count / len(test_set)))
    
    
    spam_test()
    

总结

  • 朴素贝叶斯以及贝叶斯准则提供了一种使用已知的值估算未知值的方法;
  • 通过特征间的条件独立性假设,可以用于降低对数据量的需求,虽然这个假设过于简单,但是贝叶斯假设仍然是一种有效的分类器
  • 在编程实现朴素贝叶斯时需要考虑很多问题,例如通过取自然对数来解决下溢出的问题等

你可能感兴趣的:(机器学习实战读书笔记-朴素贝叶斯)