自然语言处理(NLP)编程实战-1.2 使用朴素贝叶斯实现情感分类

内容汇总:https://blog.csdn.net/weixin_43093481/article/details/114989382?spm=1001.2014.3001.5501
课程笔记:1.2 情感分析与朴素贝叶斯法(Sentiment Analysis with Naïve Bayes)
代码:https://github.com/Ogmx/Natural-Language-Processing-Specialization
————————————————————————————————————

作业 2: 朴素贝叶斯(Naive Bayes)

学习目标:
 学习朴素贝叶斯原理,并应用其对推特进行情感分析。给出一条推特,判断其是正向情感还是负向情感。

具体而言,将会学习:

  • 训练朴素贝叶斯模型用于情感分析
  • 测试模型
  • 计算正向词和负向词比率
  • 进行错误分析
  • 使用自己的数据预测

你可能已经熟悉朴素贝叶斯法及其原理和条件概率与独立性

  • 在本项目中,将使用正向情感与负向情感的概率比率
  • 这种方法能简单快速的解决二分类问题

导入python库

from utils import process_tweet, lookup
import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer
from os import getcwd

下载数据

nltk.download('stopwords')
nltk.download('twitter_samples')

划分数据集

# get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

Part 1: 数据处理

对于任何机器学习项目,当获取完数据后,第一步操作一定是对数据进行处理,使其符合模型的输入

  • 去除噪音: 移除数据中的噪音,即移除那些不关键的单词,如一些常见词’ ‘I, you, are, is, etc…’ ,因为这些词不会提供任何情感信息。
  • 同样要移除标签符号,如转发符号、超链接和标签,因为它们同样不会提供任何情感信息.
  • 对于标点符号,虽然其会包含一些情感信息,但出于简单考虑,同样将其移除
  • 最后,对各单词进行词根化处理,如 “motivation”, “motivated”, and “motivate” 将其转换为同一词根 “motiv-”.

使用函数 process_tweet() 来处理数据.

custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(process_tweet(custom_tweet))

[‘hello’, ‘great’, ‘day’, ‘: )’, ‘good’, ‘morn’]

Part 1.1 实现帮助函数

为了训练朴素贝叶斯模型,需要先构建一个词频字典,键为(word, label),值为对应的频率。其中,label为1或0,表示正向情感和负向情感。

实现lookup() 帮助函数,其输入freqs 字典,一个单词,和一个标签(1 or 0),返回该(word, label)在语料库中出现次数。

例如:对于这两条推特 ["i am rather excited", "you are rather happy"] 和标签 1, 其频率字典如下:

{
  (“rather”, 1): 2
  (“happi”, 1) : 1
  (“excit”, 1) : 1
}

  • 对于语料库中的各个单词,都为其指定相同的标签1
  • 对于 “i” 和 “am” 这样的单词并没被保存,因为其作为停用词在数据处理时被移除
  • 因为 “rather” 在两文本中都出现一次,因此其频率为2

实现count_tweets()函数

实现 count_tweets()函数,其输入一系列推特,对其进行处理,最后返回词频字典

  • 键为单词词根和其标签, 如 (“happi”,1).
  • 值为该单词在语料库中出现次数 (一个整数).
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            # define the key, which is the word and label tuple
            pair = (word,y)

            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1

            # else, if the key is new, add it to the dictionary and set the count to 1
            else:
                result[pair] = 1
    ### END CODE HERE ###

    return result
# Testing your function
result = {
     }
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)

{(‘happi’, 1): 1, (‘trick’, 0): 1, (‘sad’, 0): 1, (‘tire’, 0): 2}

Part 2: 训练朴素贝叶斯模型

朴素贝叶斯是一种算法可用于情感分析,可以在较短时间内完成训练和预测

如何训练朴素贝叶斯分类器?

  • 训练朴素贝叶斯分类器的第一步是确定分类的类别
  • 对于每一类建立一个概率.
    P ( D p o s ) P(D_{pos}) P(Dpos) 是正向文本的概率.
    P ( D n e g ) P(D_{neg}) P(Dneg) 是负向文本的概率.
    可通过以下公式计算:

P ( D p o s ) = D p o s D (1) P(D_{pos}) = \frac{D_{pos}}{D}\tag{1} P(Dpos)=DDpos(1)

P ( D n e g ) = D n e g D (2) P(D_{neg}) = \frac{D_{neg}}{D}\tag{2} P(Dneg)=DDneg(2)

其中 D D D 是文本总数, 即总推特数, D p o s D_{pos} Dpos 是正向推特的总数, D n e g D_{neg} Dneg 是负向推特的总数。

先验(Prior)与对数先验(Logprior)

先验概率表示数据集中一条推特是正向还是负向的潜在概率。即当我们不知道具体信息时,随机从数据集中抽取一条推特,其为正向的概率是多少?为负向的概率是多少?这就是先验

先验是概率的比值 P ( D p o s ) P ( D n e g ) \frac{P(D_{pos})}{P(D_{neg})} P(Dneg)P(Dpos).
可以对其取对数进行缩放,即得到对数先验

logprior = l o g ( P ( D p o s ) P ( D n e g ) ) = l o g ( D p o s D n e g ) \text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right) logprior=log(P(Dneg)P(Dpos))=log(DnegDpos).

注意 l o g ( A B ) log(\frac{A}{B}) log(BA) 等价于 l o g ( A ) − l o g ( B ) log(A) - log(B) log(A)log(B). 所以对数先验也可表示为两对数的差值:

logprior = log ⁡ ( P ( D p o s ) ) − log ⁡ ( P ( D n e g ) ) = log ⁡ ( D p o s ) − log ⁡ ( D n e g ) (3) \text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3} logprior=log(P(Dpos))log(P(Dneg))=log(Dpos)log(Dneg)(3)

词的正向概率与负向概率

为了计算一个单词的正向概率和负向概率,使用如下输入:

  • f r e q p o s freq_{pos} freqpos f r e q n e g freq_{neg} freqneg 表示一个词在正向类和负向类中的频率,例如一个词的正向频率即其被标记为1的次数
  • N p o s N_{pos} Npos N n e g N_{neg} Nneg 是数据集(全部推特)中正向词和负向词的总数
  • V V V 数据集总单词数,不计算重复单词

通过下式来计算一个词的正向概率和负向概率
P ( W p o s ) = f r e q p o s + 1 N p o s + V (4) P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} P(Wpos)=Npos+Vfreqpos+1(4)
P ( W n e g ) = f r e q n e g + 1 N n e g + V (5) P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} P(Wneg)=Nneg+Vfreqneg+1(5)

注意分子中 “+1” 用于实现加法平滑.详细解释见 wiki article

对数似然(Log likelihood)

为了计算一个词的对数似然,可使用下式:

loglikelihood = log ⁡ ( P ( W p o s ) P ( W n e g ) ) (6) \text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6} loglikelihood=log(P(Wneg)P(Wpos))(6)

建立 freqs 字典
  • 给出 count_tweets() 函数, 计算建立 freqs 字典,包含全部频率.
  • freqs 字典中, 键为(word, label)
  • 值为对应键出现的次数

该字典将会被多次使用

# Build the freqs dictionary for later uses

freqs = count_tweets({
     }, train_x, train_y)

训练模型

给出频率字典, train_x (推特文本) 和 train_y (对应标签),实现朴素贝叶斯分类器

计算 V V V
  • 统计freqs字典中不重复单词个数 V V V (可使用 set 函数).
计算 f r e q p o s freq_{pos} freqpos f r e q n e g freq_{neg} freqneg
  • 使用 freqs 字典, 计算各单词的正向频率 f r e q p o s freq_{pos} freqpos 和负向频率 f r e q n e g freq_{neg} freqneg.
计算 N p o s N_{pos} Npos N n e g N_{neg} Nneg
  • 使用 freqs 字典,计算正向词总数 N p o s N_{pos} Npos 和负向词总数 N n e g N_{neg} Nneg.
计算 D D D, D p o s D_{pos} Dpos, D n e g D_{neg} Dneg
  • 使用 train_y 计算推特总数 D D D, 正向推特数 D p o s D_{pos} Dpos 和负向推特数 D n e g D_{neg} Dneg.
  • 计算一条推特是正向的概率 P ( D p o s ) P(D_{pos}) P(Dpos), 和是负向的概率 P ( D n e g ) P(D_{neg}) P(Dneg)
计算对数先验(logprior)
  • 对数先验为 l o g ( D p o s ) − l o g ( D n e g ) log(D_{pos}) - log(D_{neg}) log(Dpos)log(Dneg)
计算对数似然(loglikelihood)
  • 最后,遍历词典中的每个单词,使用 lookup 函数得到各单词的正向频率 f r e q p o s freq_{pos} freqpos,和负向频率 f r e q n e g freq_{neg} freqneg.
  • 计算各单词的正向概率 P ( W p o s ) P(W_{pos}) P(Wpos), 负向概率 P ( W n e g ) P(W_{neg}) P(Wneg) ,使用下式:

P ( W p o s ) = f r e q p o s + 1 N p o s + V (4) P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} P(Wpos)=Npos+Vfreqpos+1(4)
P ( W n e g ) = f r e q n e g + 1 N n e g + V (5) P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} P(Wneg)=Nneg+Vfreqneg+1(5)

注意: 使用字典存储各单词的对数似然,键为单词,值为该单词的对数似然

  • 最后计算对数似然: l o g ( P ( W p o s ) P ( W n e g ) ) log \left( \frac{P(W_{pos})}{P(W_{neg})} \right) log(P(Wneg)P(Wpos)).
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {
     }
    logprior = 0

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:

            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[pair]

        # else, the label is negative
        else:

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs[pair]

    # Calculate D, the number of documents
    D = len(train_y)

    # Calculate D_pos, the number of positive documents (*hint: use sum())
    D_pos = sum(train_y==1)

    # Calculate D_neg, the number of negative documents (*hint: compute using D and D_pos)
    D_neg = D - D_pos

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = freqs.get((word,1),0)
        freq_neg = freqs.get((word,0),0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos+1) / (N_pos+V)
        p_w_neg = (freq_neg+1) / (N_neg+V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)

    ### END CODE HERE ###

    return logprior, loglikelihood

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

0.0
9089

Part 3: 测试模型

现在我们有了 logpriorloglikelihood,可以通过对一些推特进行预测来验证模型

实现 naive_bayes_predict

方法
实现 naive_bayes_predict 函数,用于预测推特.

  • 该函数输入 tweet, logprior, loglikelihood.
  • 返回该条推特是正向还是负向的概率.
  • 对于每条推特, 对其中各单词的对数似然求和.
  • 最后再加上对数先验,来预测该推特的情感分类

p = l o g p r i o r + ∑ i N ( l o g l i k e l i h o o d i ) p = logprior + \sum_i^N (loglikelihood_i) p=logprior+iN(loglikelihoodi)

注意

通过训练数据计算先验,训练数据为平衡数据集(包含4000条正向推特和4000条负向推特)。因此正负数据比值为1,则对数先验为0。

本实验中对数先验为0,然而对于非平衡数据集,对数先验不为0,因此不要忘记加上对数先验。

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # process the tweet to get a list of words
    word_l = process_tweet(tweet)

    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior

    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    ### END CODE HERE ###

    return p

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Experiment with your own tweet.
my_tweet = 'She smiled.'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', p)

The expected output is 1.5740278623499175

实现 test_naive_bayes

方法

  • 实现 test_naive_bayes 用来检测预测的准确性.
  • 该函数输入 test_x, test_y, log_prior, 和 loglikelihood
  • 返回模型的准确度.
  • 使用 naive_bayes_predict 函数对每个 text_x 中的推特进行预测.
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0  # return this properly

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = sum(y_hats != test_y) / len(test_y)

    # Accuracy is 1 minus the error
    accuracy = 1 - error

    ### END CODE HERE ###

    return accuracy

print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

Naive Bayes accuracy = 0.9940

# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    # print( '%s -> %f' % (tweet, naive_bayes_predict(tweet, logprior, loglikelihood)))
    p = naive_bayes_predict(tweet, logprior, loglikelihood)
#     print(f'{tweet} -> {p:.2f} ({p_category})')
    print(f'{tweet} -> {p:.2f}')

I am happy -> 2.15
I am bad -> -1.29
this movie should have been great. -> 2.14
great -> 2.14
great great -> 4.28
great great great -> 6.41
great great great great -> 8.55

# Feel free to check the sentiment of your own tweet below
my_tweet = 'you are bad :('
naive_bayes_predict(my_tweet, logprior, loglikelihood)

-8.801622640492191

Part 4: 通过正负计数比划分单词

  • 一些词有更多的正计数,被认为是更"正向"。同样,也有一些词被认为是更"负向"的
  • 在不计算对数似然的情况下,定义积极或消极程度的一种方法是比较单词的积极频率和消极频率
    • 当然,也可以使用对数似然来比较单词的正负向程度
  • 可以计算一个单词的正负向频率比.
  • 当计算出这个比率,就可以根据其高低来划分单词

实现 get_ratio()

  • 给出 freqs 字典和一个单词,使用 lookup(freqs,word,1) 来得到该单词的正向计数
  • 类似的,使用lookup() 函数来得到该单词负向计数
  • 计算正负向计数比值

r a t i o = pos_words + 1 neg_words + 1 ratio = \frac{\text{pos\_words} + 1}{\text{neg\_words} + 1} ratio=neg_words+1pos_words+1

其中 pos_words 和 neg_words 对应于它们各自类别中单词的频率

Words Positive word count Negative Word Count
glad 41 2
arriv 57 4
:( 1 3663
:-( 0 378
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_ratio(freqs, word):
    '''
    Input:
        freqs: dictionary containing the words
        word: string to lookup

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
        Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    pos_neg_ratio = {
     'positive': 0, 'negative': 0, 'ratio': 0.0}
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # use lookup() to find positive counts for the word (denoted by the integer 1)
    pos_neg_ratio['positive'] = lookup(freqs,word,1)

    # use lookup() to find negative counts for the word (denoted by integer 0)
    pos_neg_ratio['negative'] = lookup(freqs,word,0)

    # calculate the ratio of positive to negative counts for the word
    pos_neg_ratio['ratio'] = (pos_neg_ratio['positive']+1) / (pos_neg_ratio['negative']+1)
    ### END CODE HERE ###
    return pos_neg_ratio

get_ratio(freqs, 'happi')['ratio']

8.526315789473685

实现 get_words_by_threshold(freqs,label,threshold)

  • 当 label 设为1, 选择正负计数比大于等于阈值的单词
  • 当 label 设为0, 选择正负计数比小于等于阈值的单词
  • 使用 get_ratio() 函数生成一个字典,其包含正向计数、负向计数、正负计数比
  • 构建一个字典到列表中,其中键为单词,值为一个字典类型pos_neg_ratio,即get_ratio()的返回值
    例如,其结构如下:
{'happi':
    {'positive': 10, 'negative': 20, 'ratio': 0.5}
}
for key in freqs.keys():
    word, _ = key
    print(freqs[(word,_)])

23
30
7
14
27
72
2847
60
7
2
5
80

# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_words_by_threshold(freqs, label, threshold):
    '''
    Input:
        freqs: dictionary of words
        label: 1 for positive, 0 for negative
        threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
    Output:
        word_set: dictionary containing the word and information on its positive count, negative count, and ratio of positive to negative counts.
        example of a key value pair:
        {'happi':
            {'positive': 10, 'negative': 20, 'ratio': 0.5}
        }
    '''
    word_list = {
     }

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    for key in freqs.keys():
        word, _ = key

        # get the positive/negative ratio for a word
        pos_neg_ratio = get_ratio(freqs, word)

        # if the label is 1 and the ratio is greater than or equal to the threshold...
        if label == 1 and pos_neg_ratio['ratio'] >= threshold :

            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio

        # If the label is 0 and the pos_neg_ratio is less than or equal to the threshold...
        elif label == 0 and pos_neg_ratio['ratio'] <= threshold:

            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio

        # otherwise, do not include this word in the list (do nothing)

    ### END CODE HERE ###
    return word_list

# Test your function: find negative words at or below a threshold
get_words_by_threshold(freqs, label=0, threshold=0.05)

{’’: {‘positive’: 1, ‘negative’: 3663, ‘ratio’: 0.0005458515283842794},
‘’: {‘positive’: 0, ‘negative’: 378, ‘ratio’: 0.002638522427440633},
‘zayniscomingbackonjuli’: {‘positive’: 0, ‘negative’: 19, ‘ratio’: 0.05},
‘26’: {‘positive’: 0, ‘negative’: 20, ‘ratio’: 0.047619047619047616},
‘’: {‘positive’: 0, ‘negative’: 43, ‘ratio’: 0.022727272727272728},
‘lost’: {‘positive’: 0, ‘negative’: 19, ‘ratio’: 0.05},
‘♛’: {‘positive’: 0, ‘negative’: 210, ‘ratio’: 0.004739336492890996},
‘》’: {‘positive’: 0, ‘negative’: 210, ‘ratio’: 0.004739336492890996},
‘beli̇ev’: {‘positive’: 0, ‘negative’: 35, ‘ratio’: 0.027777777777777776},
‘wi̇ll’: {‘positive’: 0, ‘negative’: 35, ‘ratio’: 0.027777777777777776},
‘justi̇n’: {‘positive’: 0, ‘negative’: 35, ‘ratio’: 0.027777777777777776},
‘see’: {‘positive’: 0, ‘negative’: 35, ‘ratio’: 0.027777777777777776},
‘me’: {‘positive’: 0, ‘negative’: 35, ‘ratio’: 0.027777777777777776}}

# Test your function; find positive words at or above a threshold
get_words_by_threshold(freqs, label=1, threshold=10)

{‘followfriday’: {‘positive’: 23, ‘negative’: 0, ‘ratio’: 24.0},
‘commun’: {‘positive’: 27, ‘negative’: 1, ‘ratio’: 14.0},
‘’: {‘positive’: 2847, ‘negative’: 2, ‘ratio’: 949.3333333333334},
‘flipkartfashionfriday’: {‘positive’: 16, ‘negative’: 0, ‘ratio’: 17.0},
‘’: {‘positive’: 498, ‘negative’: 0, ‘ratio’: 499.0},
‘:p’: {‘positive’: 103, ‘negative’: 0, ‘ratio’: 104.0},
‘influenc’: {‘positive’: 16, ‘negative’: 0, ‘ratio’: 17.0},
‘’: {‘positive’: 543, ‘negative’: 0, ‘ratio’: 544.0},
“here’”: {‘positive’: 20, ‘negative’: 0, ‘ratio’: 21.0},
‘youth’: {‘positive’: 14, ‘negative’: 0, ‘ratio’: 15.0},
‘bam’: {‘positive’: 44, ‘negative’: 0, ‘ratio’: 45.0},
‘warsaw’: {‘positive’: 44, ‘negative’: 0, ‘ratio’: 45.0},
‘shout’: {‘positive’: 11, ‘negative’: 0, ‘ratio’: 12.0},
‘’: {‘positive’: 22, ‘negative’: 0, ‘ratio’: 23.0},
‘stat’: {‘positive’: 51, ‘negative’: 0, ‘ratio’: 52.0},
‘arriv’: {‘positive’: 57, ‘negative’: 4, ‘ratio’: 11.6},
‘via’: {‘positive’: 60, ‘negative’: 1, ‘ratio’: 30.5},
‘glad’: {‘positive’: 41, ‘negative’: 2, ‘ratio’: 14.0},
‘blog’: {‘positive’: 27, ‘negative’: 0, ‘ratio’: 28.0},
‘fav’: {‘positive’: 11, ‘negative’: 0, ‘ratio’: 12.0},
‘fback’: {‘positive’: 26, ‘negative’: 0, ‘ratio’: 27.0},
‘pleasur’: {‘positive’: 10, ‘negative’: 0, ‘ratio’: 11.0}}

正负比率反映了一个单词的情感倾向,如表情 和单词 ‘me’ 更倾向于负面情感,而单词 ‘glad’, ‘community’, 和 ‘arrives’ 更倾向于出现在正向推特中。

Part 5: 错误分析

在该部分将会找出那些模型预测错误的推特。为什么会出错?朴素贝叶斯模型有什么假设吗?

# Some error analysis done for you
print('Truth Predicted Tweet')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

Truth Predicted Tweet
1   0.00  b’’
1   0.00  b’truli later move know queen bee upward bound movingonup’
1   0.00  b’new report talk burn calori cold work harder warm feel better weather :p’
1   0.00  b’harri niall 94 harri born ik stupid wanna chang ’
1   0.00  b’’
1   0.00  b’’
1   0.00  b’park get sunlight’
1   0.00  b’uff itna miss karhi thi ap :p’
0   1.00  b’hello info possibl interest jonatha close join beti great’
0   1.00  b’u prob fun david’
0   1.00  b’pat jay’
0   1.00  b’whatev stil l young ’

Part 6: 预测你自己的数据

输入自定义数据并预测其情感.

# Test with your own tweet - feel free to modify `my_tweet`
my_tweet = 'I am happy because I am learning :)'

p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(p)

9.574768961173339

你可能感兴趣的:(自然语言处理(NLP),人工智能)