感觉这本书很适合我这种菜菜,另外下面的笔记还有学习英语的目的,故大多数用英文摘录或总结
昨天刚在《机器学习实战》那本书里面看了以spam filtering为例子贯穿讲解的ML的landscape,今天就来跟着这个chapter实践一下
Spam filtering exemplifies a widely spread family of tasks —— text classification.
exemplify :to be or give a typical example of something 作为…的典范(或范例、典型、榜样等)
作者给了很多非常贴近生活的例子 ~ 讲了 classification的usefulness,概括一下
We refer to the name of each class as a class label
How do we perform classification ?
其实大一最早的时候做的经典的老掉牙if/else或者那个switch,输入学生成绩,给出成绩,就是simple classification
这里作者也给出了很simple的程序
For example, you can make the machine print out a warning that water is hot based on a simple threshold of 45°C (113°F), as listing 2.1 suggests. In this code, you define a function print_warning, which takes water temperature as input and prints out water status. The if statement checks if input temperature is above a predefined threshold(阈值) and prints out a warning message if it is. In this case, since the temperature is above 45°C, the code prints out Caution: Hot water!
# Simple code to tell whether water is cold or hot
def print_warning(temperature):
if temperature>=45:
print ("Caution: Hot water!")
else:
print ("You may use water as usual")
print_warning(46)
感觉写的不错,作者把这个例子过渡到了ML 和 监督学习
When there are multiple factors to consider in classification, it is better to let the machine learn the rules and patterns from data rather than hardcoding(写死) them. Machine learning involves providing a machine with examples and a general outline of a task, and allowing it to learn to solve the task independently. In supervised machine learning, the machine is given labeled data and learns to classify based on the provided features.
Supervised machine learning
Supervised machine learning refers to a family of machine-learning tasks in which the algorithm learns the correspondences between an input and an output based on the provided labeled examples. Classification is an example of a supervised machine learning task, where the algorithm tries to learn the mapping between the input data and the output class label.
Using the cold or hot water example, we can provide the machine with the samples of water labeled hot and samples of water labeled cold, tell it to use temperature as the predictive factor (feature), and this way let it learn independently from the provided data that the boundary between the two classes is around 45°C (113°F)
作者给出了scenario (a description of possible actions or events in the future可能发生的事态;设想)
Consider the following scenario: you have a collection of spam and normal emails from the past. You are tasked with building a spam filter, which for any future incoming email can predict whether this email is spam or not. Consider these questions:
In this section, we will discuss this scenario and look into the implementation steps. In total, the pipeline for this task will consist of five steps.
The pipeline
“normal” emails are sometimes called “ham” in the spam-detection context
First, you need to ask yourself what format the email messages are delivered in for this task. For instance, in a real-life situation, you might need to extract the messages from the mail agent application. However, for simplicity, let’s assume that someone has extracted the emails for you and stored them in text format. The normal emails are stored in a separate folder—let’s call it Ham, and spam emails are stored in a Spam folder.
If someone has already predefined past spam and ham emails for you (e.g., by extracting these emails from the INBOX and SPAM box), you don’t need to bother with labeling them. However, you still need to point the machine-learning algorithm at the two folders by clearly defining which one is ham and which one is spam. This way, you will define the class labels and identify the number of classes for the algorithm. This should be the first step in your spam-detection pipeline (and in any text-classification pipeline), after which you can preprocess the data, extract the relevant information, and then train and test your algorithm (figure 2.5).
You can set step 1 of your algorithm as follows: Define which data represents “ham” class and which data represents “spam” class for the machine-learning algorithm.
For a machine, the text comes in as a sequence of symbols, so the machine does not have an idea of what a word is.
For example, how will you split the following sentence into words? “Define which data represents each class for the machine learning algorithm.”
The first solution might be “Words are sequences of characters separated by whitespaces.”
simple code to split text string into words by whitespaces
text = "Define which data represents each class for the machine learning algorithm"
text.split(" ")
['Define',
'which',
'data',
'represents',
'each',
'class',
'for',
'the',
'machine',
'learning',
'algorithm']
So far(到目前为止), so good. However, what happens to this strategy when we have punctuation marks? For example: “Define which data represents “ham” class and which data represents “spam” class for the machine learning algorithm.”
['Define',
'which',
'data',
'represents',
'“ham”',
'class',
'and',
'which',
'data',
'represents',
'“spam”',
'class',
'for',
'the',
'machine',
'learning',
'algorithm.']
In the list of words, are [“ham”], [“spam”], and [algorithm.] any different from [ham], [spam], and [algorithm]? That is, the same words but without the punctuation marks attached to them? The answer is, these words are exactly the same, but because you are only splitting by whitespaces at the moment, there is no way of taking the punctuation marks into account.
However, each sentence will likely include one full stop (.), question (?), or exclamation mark (!) attached to the last word, and possibly more punctuation marks inside the sentence itself, so this is going to be a problem for properly extracting words from text. Ideally, you would like to be able to extract words and punctuation marks separately.
orz,感觉这图做得很好
Code to split text string into words by whitespaces and punctuation
delimiter 分隔符,定界符
text = "Define which data represents “ham” class and which data represents “spam” class for the machine learning algorithm."
delimiters = ['"', "."]
words = []
current_word = ""
for char in text:
if char==" ":
if not current_word=="":
words.append(current_word)
current_word = ""
elif char in delimiters:
if current_word=="": #current_word是空的说明前面一个是whitespace,只把标点加入list
words.append(char)
else:
words.append(current_word)
words.append(char)
current_word = ""
else:
current_word += char
print(words)
['Define', 'which', 'data', 'represents', '“ham”', 'class', 'and', 'which', 'data', 'represents', '“spam”', 'class', 'for', 'the', 'machine', 'learning', 'algorithm', '.']
这个时候完美了吗? 没有, 有很多时候我们用的缩写比如e.g.
,i.e.
是不需要拆的 ,但是用刚刚的算法会拆成[‘i’,‘.’,‘e’,'.]
This is problematic, since if the algorithm splits these examples in this way, it will lose track of the correct interpretation of words like i.e. or U.S.A., which should be treated as one word token rather than a combination of characters. How can this be achieved?
This is where the NLP tools come in handy: the tool that helps you split the running string of characters into meaningful words is called tokenizer, and it takes care of the cases like the ones we’ve just discussed—that is, it can recognize that ham. needs to be split into [‘ham’, ‘.’] while U.S.A. needs to be kept as one word [‘U.S.A.’]. Normally, tokenizers rely on extensive and carefully designed lists of regular expressions, and some are trained using machine-learning approaches.
这里作者介绍了NLTK
For an example of a regular expressions-based tokenizer, you can check the Natural Language Processing Toolkit’s regexp_tokenize() to get a general idea of the types of the rules that tokenizers take into account: see Section 3.7 on www.nltk.org/book/ch03.html. The lists of rules applied may differ from one tokenizer to another.
tokenizer还可以给我们解决哪些常见的problem呢? 缩写(contraction)
看几个例子
如何解决这个呢,正则表达式是一个神奇的东西
import nltk
import re
text1 = "What’s the best way to cook a pizza?"
text2 = "We’re going to use a baking stone."
text3 = "I haven’t used a baking stone before."
texts = [text1, text2, text3]
tokens = []
for text in texts:
# Replace certain characters with a space
text = re.sub(r'[’]', '\'', text)
print(text)
# text = re.sub(r'[^\w\s\.\?\']', ' ', text)
# print(text)
# Tokenize the text
tokens.append(nltk.word_tokenize(text))
for token in tokens:
print(token)
# print(tokens)
What's the best way to cook a pizza?
We're going to use a baking stone.
I haven't used a baking stone before.
['What', "'s", 'the', 'best', 'way', 'to', 'cook', 'a', 'pizza', '?']
['We', "'re", 'going', 'to', 'use', 'a', 'baking', 'stone', '.']
['I', 'have', "n't", 'used', 'a', 'baking', 'stone', 'before', '.']
作者举了个例子
Suppose two emails use a different format:
拆开的词语里面 lottery 和 Lottery 形式上确实不同,但是意思上是一样的,都是博彩
我们需要get rid of such formatting issues
The problem is that small formatting differences, such as the use of uppercase versus lowercase letters, can result in different word lists being generated. To address this issue, the passage suggests normalizing the extracted words by converting them to lowercase. This can help ensure that the meaning of the words is not affected by formatting differences and that the extracted words are more indicative(标示的) of the spam-related content of the emails.
At this point, you will end up with two sets of data—one linked to the spam class and another one linked to the ham class. Each data is preprocessed in the same way in steps 2 and 3, and the features are extracted.
Next, you need to let the machine use this data to build the connection between the set of features (properties) that describe each type of email (spam or ham) and the labels attached to each type. In step 4, a machine-learning algorithm tries to build a statistical model, a function, that helps it distinguish between the two classes. This is what happens during the learning (training) phase(阶段).
A refresher visualizing the training and test processes.
So, step 4 of the algorithm should be defined as follows: define a machine-learning model and train it on the data with the features predefined in the previous steps
Suppose that the algorithm has learned to map the features of each class of emails to the spam and ham labels, and has determined which features are more indicative of spam or ham. The final step in this process is to make sure the algorithm is doing such predictions well.
How will you do that? 划分数据集
The data used for training and testing should be chosen randomly and should not overlap(交叠) in order to avoid biasing the evaluation. The training set is typically larger, with a typical proportion being 80% for training and 20% for testing. The classifier should not be allowed to access the test set during the training phase, and it should only be used for evaluation at the final step.
shuffle 洗牌
Data splits for supervised machine learning
In supervised machine learning, the algorithm is trained on a subset of the labeled data called training set. It uses this subset to learn the function mapping the input data to the output labels. Test set is the subset of the data, disjointed(分离) from the training set, on which the algorithm can then be evaluated. The typical data split is 80% for training and 20% for testing. Note that it is important that the two sets are not overlapping. If your algorithm is trained and tested on the same data, you won’t be able to tell what it actually learned rather than memorized.
Suppose you trained your classifier in step 4 and then applied it to the test data. How will you measure the performance?
One approach would be to check what proportion of the test emails the algorithm classifies correctly—that is, assigns the spam label to a spam email and classifies ham emails as ham. This proportion is called accuracy, and its calculation is pretty straightforward:
A c c u r a c y = n u m ( correct predictions ) n u m ( all test instances ) Accuracy =\frac{{ num }(\text { correct predictions })}{n u m(\text { all test instances })} Accuracy=num( all test instances )num( correct predictions )
这里给了个练习2.4
Let’s discuss the solutions to this exercise (note that you can also find more detailed explanations at the end of the chapter). The prediction of the classifier based on the distribution of classes that you came across in this exercise is called baseline. In an equal class distribution case, the baseline is 50%, and if your classifier yields an accuracy of 60%, it outperforms this baseline. In the case of 60:40 split, the baseline, which can also be called the majority class baseline, is 60%. This means that if a dummy “classifier” does no learning at all and simply predicts the ham label for all emails, it will not filter out any spam emails from the inbox, but its accuracy will also be 60%—just like your classifier that is actually trained and performs some classification! This makes the classifier in the second case in this exercise much less useful because it does not outperform the majority class baseline.
书后给的解答
- An accuracy of 60% doesn’t seem very high, but how exactly can you interpret it? Note that the distribution of classes helps you to put the performance of your classifier in context because it tells you how challenging the problem itself is. For example, with the 50–50% split, there is no majority class in the data and the classifier’s random guess will be at 50%, so the classifier’s accuracy is higher than this random guess. In the second case, however, the classifier performs on a par with the majority class guesser: the 60% to 40% distribution of classes suggests that if some dummy “classifier” always selected the majority class, it would get 60% of the cases correctly—just like the classifier you trained.
- The single accuracy value of 60% does not tell you anything about the performance of the classifier on each class, so it is a bit hard to interpret. However, if you look into each class separately, you can tell that the classifier is better at classifying ham emails (it got 2/3 of those right) than at classifying spam emails (only 1/2 are correct).
In summary, accuracy is a good overall measure of performance, but you need to keep in mind
我用的是codespace,我想到三种解压方式,一个是vscode也许有extension,另一个是terminal里面使用unzip,还有就是用python来解压,我打算用后两种都尝试一下
我直接把这本书的GitHub仓库ekochmar/Getting-Started-with-NLP: This repository accompanies the book “Getting Started with Natural Language Processing” (github.com)在codespace打开,然后建了一个叫learningThrough的文件夹,观察一下目标文件位置,我怕里面的文件直接冲出来于是搞了个subfolder叫test
然后
unzip enron1.zip -d ./learningThrough/test
unzip enron2.zip -d learningThrough/test
这里的路径前面写
./
或者前面不加东西
用python来试试
但是遇到了Bad pipe message
难道是因为我指定的那个try2zip一开始不存在这个directory?
那我再来一次,直接解压到current directory 下面吧
import zipfile
# Specify the zip file name
zip_file = '../enron1.zip'
# Create a ZipFile object
zip_obj = zipfile.ZipFile(zip_file, 'r')
# Extract all the contents of zip file to the destination folder
zip_obj.extractall()
# close the Zip File
zip_obj.close()
import zipfile
# Specify the zip file name
zip_file = '../enron2.zip'
# Create a ZipFile object
zip_obj = zipfile.ZipFile(zip_file, 'r')
# Extract all the contents of zip file to the destination folder
zip_obj.extractall()
# close the Zip File
zip_obj.close()
好耶
Let’s define a function read_in
that will take a folder as an input, read the files in this folder, and store their contents as a Python list data structure.
不得不说这本书给的注释是真的详细…orz
import os
# helps with different text encoding
import codecs
def read_in(folder):
# Using os functionality, list all the files in the specified folder
the files in the specified folder.
files = os.listdir(folder)
a_list = []
# Iterate through the files in the folder
for a_file in files:
# Skip hidden files.
if not a_file.startswith("."):
# Read the contents of each file.
f = codecs.open(folder + a_file, "r",encoding="ISO-8859-1",errors="ignore")
# Add the content of each file to the list data structure.
a_list.append(f.read())
# Close the fileafter you readthe contents.
f.close()
return a_list
In this code, you rely on Python’s os module functionality to list all the files in the specified folder, and then you iterate through them, skipping hidden files (以.
开头的那些文件,such files can be easily identified because their names start with “.” ) that are sometimes automatically created by the operating systems.
Next, you read the contents of each file.
The encoding and errors arguments of codecs.open function will help you avoid errors in reading files that are related to text encoding.
codecs是啥捏
The
codecs
module in Python provides functions to encode and decode data using various codecs (encoding/decoding algorithms).Here are some common codecs that you can use with the
codecs
module:
utf-8
: A Unicode encoding that can handle any character in the Unicode standard. It’s the most widely used encoding for the Web.ascii
: An encoding for the ASCII character set, which consists of 128 characters.latin-1
: An encoding for the Latin-1 character set, which consists of 256 characters.utf-16
: A Unicode encoding that uses two bytes (16 bits) to represent each character.
You add the content of each file to a list data structure, and in the end, you return the list that contains the contents of the files from the specified folder.
现在我们可以定义spam_list和ham_list—letting the machine know what data represents spam emails and what data represents ham emails.
Let’s check if the data is uploaded correctly; for example, you can print out the lengths of the lists or check any particular member of the list.
Summary.txt里面列出来了,那我们就看看条数对不对
the length of the spam_list should equal the number of spam emails in the enron1/spam/ folder, which should be 1,500, while the length of the ham_list should equal the number of emails in the enron1/ham/, or 3,672. If you get these numbers, your data is uploaded and read in correctly.
代码如下
spam_list = read_in("./enron1/spam/")
print(len(spam_list))
# print(spam_list[0])
ham_list = read_in("./enron1/ham/")
print(len(ham_list))
print(ham_list[0])
条数是对的,好耶
Next, we need to preprocess the data (e.g., by splitting text strings into words) and extract the features. Wouldn’t it be easier if you could run all preprocessing steps over a single data structure rather than over two separate lists?
这里不采用for循环,为了加上标签,我们把每个邮件和其标签组成一个元组,然后再放入一个新的list all_emails 里面
verify一下是否读进去了,输出太长,用切片切一下
此外,我们还要考虑到划分数据集的随机性
We need to split the data randomly into the training and test sets. To that end(因为那个缘故), let’s shuffle the resulting list of emails with their labels, and make sure that the shuffle is reproducible by fixing the way in which the data is shuffled. For the shuffle to be reproducible, you need to define the seed for the random operator, which makes sure that all future runs will shuffle the data in exactly the same way.
# Python’s random module will help you shuffle the data randomly.
import random
# Use list comprehensions to create the all_emails list that will keep all emails with their labels.
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
# Select the seed of the random operator to make sure that all future runs will shuffle the data in the same way
random.seed(42)
random.shuffle(all_emails)
# Check the size of the dataset (lengthof the list); it should be equal to 1,500 + 3,672
# This kind of string is called formatted string literals or f-strings, and it is a new feature introduced in Python 3.6
print (f"Dataset size = {str(len(all_emails))} emails")
Dataset size = 5172 emails
Remember that the email contents you’ve read in so far each come as a single string of symbols. The first step of text preprocessing involves splitting the running text into words.
我们还是用NLTK
One of the benefits of this toolkit is that it comes with a thorough documentation and description of its functionality.
(该工具包的好处是,它具有详尽的文档)
NLTK :: nltk.tokenize package
It takes running text as input and returns a list of words based on a number of customized regular expressions, which help to delimit the text by whitespaces and punctuation marks, keeping common words like U.S.A. unsplit.
code
This code defines a tokenize function that takes a string as input and splits it into words.
The for-loop within this function appends each identified word from the tokenized string to the output word list; alternatively, you can use list comprehensions(列表生成式) for the same purpose. Finally, given the input, the function prints out a list of words. You can test your intuitions about the words and check your answers to previous exercises by changing the input to any string of your choice
import nltk
from nltk import word_tokenize
nltk.download('punkt')
def tokenize(input):
word_list = []
for word in word_tokenize(input):
word_list.append(word)
return word_list
input = "What's the best way to split a sentence into words?"
print(tokenize(input))
In addition to the toolkit itself, you need to install NLTK data as explained on www.nltk.org/data.html. Running nltk.download() will install all the data needed for text processing in one go; in addition, individual tools can be installed separately (e.g., nltk.download(‘punkt’) installs NLTK’s sentence tokenizer)
['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']
如果我用列表生成式(list comprehension),那就这样写
import nltk
from nltk import word_tokenize
def tokenize(input):
word_list = [word for word in word_tokenize(input)]
return word_list
input = "What's the best way to split a sentence into words?"
print(tokenize(input))
这里作者非常推崇list comprehensions, 确实,我得提高使用它的意识
Once the words are extracted from running text, you need to convert them into features. In particular, you need to put all words into lowercase to make your algorithm establish the connection between different formats like Lottery and lottery.
Putting all strings to lowercase can be achieved with Python’s string functionality. To extract the features (words) from the text, you need to iterate through the recognized words and put all words to lowercase. In fact, both tokenization and converting text to lowercase can be achieved using a single line of code with list comprehensions.
word_list = [word for word in word_tokenize(text.lower())]
Using list comprehensions, you can combine the two steps—tokenization and converting strings to lowercase—in one line. Here, you first normalize and then tokenize text, but the two steps are interchangeable.
We define a function get_features that extracts the features from the text of email passed in as input. Next, for each word in the email, you switch on the “flag” that the word is contained in this email by assigning it with a True value. The list data structure all_features keeps tuples containing the dictionary of features matched with the spam or ham label for each email.
import nltk
from nltk import word_tokenize
def get_features(text):
features = {}
word_list = [word for word in word_tokenize(text.lower())]
for word in word_list:
# For each word in the email, switch on the “flag” that this word is contained in the email.
features[word] = True
return features
# all_features will keep tuples containing the dictionary of features matched with the label for each email
all_features = [(get_features(email), label) for (email, label) in all_emails]
# Check what features are extracted from an input text
print(get_features("Participate In Our New Lottery NOW!"))
print(len(all_features))
# Check what all_features list data structure contains.
print(len(all_features[0][0]))
print(len(all_features[99][0]))
In the end, the code shows how you can check what features are extracted from an input text and what all_features list data structure contains (e.g., by printing out its length and the number of features detected in the first or any other email in the set)
{'participate': True, 'in': True, 'our': True, 'new': True, 'lottery': True, 'now': True, '!': True}
5172
321
97
随机shuffle确实是有效的,我再来一次得到的是
5172
72
157
其实就是创建了一个dictionary
With this bit of code, you iterate over the emails in your collection (all_emails) and store the features extracted from each email matched with the label. For example, if a spam email consists of a single sentence (e.g., “Participate In Our New Lottery NOW!”), your algorithm will first extract the list of features present in this email and assign a True value to each of them. The dictionary of features will be represented using this format: [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]. Then the algorithm will add this data structure to all_features together with the spam label: ([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True], “spam”)
刚刚那个代码创建了一个list叫all_features, a list of tuples. where each tuple represents an individual email.
So the total length of all_features is equal to the number of emails in your dataset.
As each tuple in this list corresponds to an individual email, you can access each one by the index in the list using all_features[index]: for example, you can access the first email in the dataset as all_features[0] (remember, that Python’s indexing starts with 0), and the hundredth as all_features[99] (for the same reason)
Let’s now clarify what each tuple structure representing an email contains. Tuples pair up two information fields. In this case, a dictionary of features extracted from the email and its label—that is, each tuple in all_features contains a pair (dict_of_ features, label). So if you’d like to access the first email in the list, you call on all_ features[0]; to access its features, you use all_features[0][0]
; and to access its label, you use all_features[0][1]
.
又是一张好图!
For example, if the first email in your collection is a spam email with the content “Participate in our lottery now!”, all_features[0] will return the tuple ([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True], “spam”); all_features[0][0]
will return the dictionary [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True], and all_features[0][1]
will return the value spam
Imagine your whole dataset contained only one spam email (“Participate In Our New Lottery NOW!”) and one ham email (“Participate in the Staff Survey”). What features will be extracted from this dataset with the code from listing 2.8?(就是上面那个get_feature的代码)
这里作者介绍了用朴素贝叶斯 Naive Bayes来处理
Next, let’s apply machine learning and teach the machine to distinguish between the features that describe each of the two classes.
There are several classification algorithms that you can use, and you will come across several of them in this book. But since you are at the beginning of your journey, let’s start with one of the most interpretable(可解释的) ones—an algorithm called Naïve Bayes. Don’t be misled by the word Naïve in its title, though. Despite the relative simplicity of the approach compared to other ones, this algorithm often works well in practice and sets a competitive performance baseline that is hard to beat with more sophisticated(复杂的) approaches. For the spam-filtering algorithm that you are building in this chapter, you will rely on the Naïve Bayes implementation provided with the NLTK library, so don’t worry if some details of the algorithm seem challenging to you. However, if you would like to see what is happening “under the hood,” this section will walk you through the details of the algorithm.
Naïve Bayes is a probabilistic classifier, which means that it makes the class prediction based on the estimate of which outcome is most likely (i.e., it assesses the probability of an email being spam and compares it with the probability of it being ham), and then selects the outcome that is most probable between the two. In fact, this is quite similar to how humans assess whether an email is spam or ham. When you receive an email that says “Participate in our lottery now! Click on this link,” before clicking on the (potentially harmful) link, you assess how likely it is (i.e., what is the probability?) that it is a ham email and compare it to how likely it is that this email is spam. Based on your experience and all the previous spam and ham emails you have seen before, you might judge that it is much more likely (more probable) that it is a spam email. By the time the machine makes a prediction, it has also accumulated some experience in distinguishing spam from ham that is based on processing a dataset of labeled spam and ham emails.
In this step, the machine will try to predict whether the email content represents spam or ham. In other words, it will try to predict whether the email is spam or ham given or conditioned on its content. This type of probability, when the outcome (class of spam or ham) depends on the condition (words used as features), is called conditional probability(条件概率). For spam detection, you estimate P(spam | email content)
and P(ham | email content)
, or generally P(outcome | (given) condition)
. Then you compare one estimate to another and return the most probable class. For example:
If P(spam | content) = 0.58 and P(ham | content) = 0.42, predict spam
If P(spam | content) = 0.37 and P(ham | content) = 0.63, predict ham
Reminder on the notation(符号):
P
is used to represent all probabilities;|
is used in conditional probabilities, when you are trying to estimate the probability of some event (that is specified before|
) given that the condition (that is specified after|
) applies.
In summary, this boils down to(归结为) the following set of actions illustrated in the figure below.
我们人往往根据经验有个判断,那么机器呢
Similarly, a machine can estimate the probability that an email is spam or ham conditioned on its content, taking the number of times it has seen this content leading to a particular outcome:
P ( s p a m ∣ “Participate in our lottery now!” ) = n u m ( spam emails with “Participate in our lottery now!”) n u m ( all emails with “Participate in our lottery now!”) P ( h a m ∣ “Participate in our lottery now!” ) = n u m ( ham emails with “Participate in our lottery now!”) n u m ( all emails with “Participate in our lottery now!”) P( spam \mid \text{“Participate in our lottery now!”} )=\frac{n u m(\text { spam emails with “Participate in our lottery now!”) }}{n u m(\text { all emails with “Participate in our lottery now!”) }}\\ P( ham \mid \text{“Participate in our lottery now!”} )=\frac{n u m(\text { ham emails with “Participate in our lottery now!”) }}{n u m(\text { all emails with “Participate in our lottery now!”) }} P(spam∣“Participate in our lottery now!”)=num( all emails with “Participate in our lottery now!”) num( spam emails with “Participate in our lottery now!”) P(ham∣“Participate in our lottery now!”)=num( all emails with “Participate in our lottery now!”) num( ham emails with “Participate in our lottery now!”)
In the general form, this can be expressed as
P ( o u t c o m e ∣ c o n d i t i o n ) = numOfTimes ( condition led to outcome ) numOfTimes ( condition applied ) P( outcome \mid condition )=\frac{\text { numOfTimes }(\text { condition led to outcome })}{\text { numOfTimes }(\text { condition applied })} P(outcome∣condition)= numOfTimes ( condition applied ) numOfTimes ( condition led to outcome )
此时问题出现了
At the moment, you are trying to predict a single outcome (class of spam or ham) given a single condition that is the whole text of the email; for example, “Participate in our lottery now!” In the previous step, you converted this single text into a set of features as [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]. Note that the conditional probabilities like P(spam| “Participate in our lottery now!”) and P(spam| [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]) are the same because this set of features encodes the text. Therefore, if the chances of seeing “Participate in our lottery now!” are low, the chances of seeing the set of features [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True] encoding this text are equally low.
Is there a way to split this set to get at more fine-grained(细粒度), individual probabilities, such as to establish a link between [‘lottery’: True] and the class of spam?
Unfortunately, there is no way to split the conditional probability estimation like P(outcome | conditions) when there are multiple conditions specified; however, it is possible to split the probability estimation like P(outcomes | condition) when there is a single condition and multiple outcomes. In spam detection, the class is a single value (it is spam or ham), while features are a set ([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]). If you can flip(浏览) around the single value of class and the set of features in such a way that the class becomes the new condition and the features become the new outcomes, you can split the probability into smaller components and establish the link between individual features like [‘lottery’: True] and class values like spam.
下面要学习公式推导咯
Luckily, there is a way to flip the outcomes (class) and conditions (features extracted from the content) around! Let’s look into the estimation of conditional probabilities again: you estimate the probability that the email is spam given that its content is “Participate in our new lottery now!” based on how often in the past an email with such content was spam. For that, you take the proportion of the times you have seen “Participate in our new lottery now!” in a spam email among the emails with this content. You can express it as
P ( s p a m ∣ “Participate in our lottery now!” ) = P (“Participate in our lottery now!” is in a spam email ) P (“Participate in our lottery now!” is used in an email) P( spam \mid \text{“Participate in our lottery now!”} )=\\\frac{\mathrm{P} \text { (“Participate in our lottery now!” is in a spam email })}{P \text { (“Participate in our lottery now!” is used in an email) }} P(spam∣“Participate in our lottery now!”)=P (“Participate in our lottery now!” is used in an email) P (“Participate in our lottery now!” is in a spam email )
Let’s call this Formula 1. What is the conditional probability of the content “Participate in our new lottery now!” given class spam, then? Similarly the probabilities earlier, you need the proportion of times you have seen “Participate in our new lottery now!” in a spam email among all spam emails. You can express it as
P ( “Participate in our lottery now!” ∣ s p a m ) = P (“Participate in our lottery now!" is in a spam email) P ( an email is spam) P( \text{“Participate in our lottery now!”} \mid spam )= \frac{\mathrm{P} \text { (“Participate in our lottery now!" is in a spam email) }}{P(\text { an email is spam) }} P(“Participate in our lottery now!”∣spam)=P( an email is spam) P (“Participate in our lottery now!" is in a spam email)
Let’s call this Formula 2. Every time you use conditional probabilities, you need to divide how likely it is that you see the condition and outcome together by how likely it is that you see the condition on its own—this is the bit(约束)after |
. Now you can see that both Formulas 1 and 2 rely on how often you see particular content in an email from a particular class. They share this bit, so you can use it to connect the two formulas. For instance, from Formula 2 you know that
P (“Participate in our lottery now!" is in a spam email) = P ( “Participate in our lottery now!” ∣ s p a m ) × P ( an email is spam) {\mathrm{P} \text { (“Participate in our lottery now!" is in a spam email) }}= \\P( \text{“Participate in our lottery now!”} \mid spam )\times{P(\text { an email is spam) }} P (“Participate in our lottery now!" is in a spam email) =P(“Participate in our lottery now!”∣spam)×P( an email is spam)
Now you can fit this into Formula 1:
P ( s p a m ∣ “Participate in our lottery now!” ) = P (“Participate in our lottery now!” is in a spam email ) P (“Participate in our lottery now!” is used in an email) = P ( “Participate in our lottery now!” ∣ s p a m ) × P ( an email is spam) P (“Participate in our lottery now!” is used in an email) P( spam \mid \text{“Participate in our lottery now!”} )=\\\frac{\mathrm{P} \text { (“Participate in our lottery now!” is in a spam email })}{P \text { (“Participate in our lottery now!” is used in an email) }}=\\\\\frac{P( \text{“Participate in our lottery now!”} \mid spam )\times{P(\text { an email is spam) }}}{P \text { (“Participate in our lottery now!” is used in an email) }} P(spam∣“Participate in our lottery now!”)=P (“Participate in our lottery now!” is used in an email) P (“Participate in our lottery now!” is in a spam email )=P (“Participate in our lottery now!” is used in an email) P(“Participate in our lottery now!”∣spam)×P( an email is spam)
图解如下
总结一下
Now you can replace the conditional probability of P(class | content)
with P(content | class)
; for example, whereas before you had to calculate P("spam" | “Participate in our new lottery now!”)
or equally P("spam" | ['participate': True, 'in': True, ..., 'now': True, '!': True])
, which is hard to do because you will often end up with too few examples of exactly the same email content or exactly the same combination of features, now you can estimate P(['participate': True, 'in': True, ..., 'now': True, '!': True] | "spam")
instead. But how does this solve the problem? Aren’t you still dealing with a long sequence of features?
这时候朴素贝叶斯又派上用场辣
Here is where the “naïve” assumption in Naïve Bayes helps: it assumes that the features are independent of each other, or that your chances of seeing a word lottery in an email are independent of seeing a word new or any other word in this email before. Therefore, you can estimate the probability of the whole sequence of features given a class as a product of probabilities of each feature given this class:
P ( [ ’participate’: True, ’in’: True, … , ’now’: True, ’!’: True ] "spam" ) = P ( ’participate’: True ∣ "spam") × P ( ’in’: True ∣ "spam" ) × … × P ( ’!" : True ∣ "spam") \begin{array}{c}\\P([\text { 'participate': True, 'in': True, } \ldots, \text { 'now': True, '!': True }] \text { "spam" })= \\\\P(\text { 'participate': True } \mid \text { "spam") } \times P(\text { 'in': True } \mid \text { "spam" }) \times \ldots \times P(\text { '!" : True } \mid \text { "spam") }\\\end{array}\\\\ P([ ’participate’: True, ’in’: True, …, ’now’: True, ’!’: True ] "spam" )=P( ’participate’: True ∣ "spam") ×P( ’in’: True ∣ "spam" )×…×P( ’!" : True ∣ "spam")
If you express ['participate': True]
as the first feature in the feature list, or f 1 , [ ′ i n ′ : T r u e ] f_{1}, ['in': True] f1,[′in′:True] as$ f_{2}$, and so on, until f n = [ ′ ! ′ : T r u e ] f_{n}= ['!': True] fn=[′!′:True], you can use the general formula
P ( [ f 1 , f 2 , … , f n ] ∣ class ) = P ( f 1 ∣ class ) × P ( f 2 ∣ class ) × … × P ( f n ∣ class ) \\P\left(\left[f_{1}, f_{2}, \ldots, f_{n}\right] \mid \text { class }\right)=P\left(f_{1} \mid \text { class }\right) \times P\left(f_{2} \mid \text { class }\right) \times \ldots \times P\left(f_{n} \mid \text { class }\right) P([f1,f2,…,fn]∣ class )=P(f1∣ class )×P(f2∣ class )×…×P(fn∣ class )
Now that you have broken down the probability of the whole feature set given class into the probabilities for each word given that class, how do you actually estimate them? Since for each email you note which words occur in it, the total number of times you can switch on the flag ['feature': True]
equals the total number of emails in that class, while the actual number of times you switch on this flag is the number of emails where this feature is actually present. The conditional probability P(feature | class)
is simply the proportion:
KaTeX parse error: Expected 'EOF', got '_' at position 108: …}{\text { total_̲num(emails in c…
Now you have all the components in place. Let’s iterate through the classification steps again.
First, during the training phase, the algorithm learns prior class probabilities. This is simply class distribution—for example, P(ham)=0.71
and P(spam)=0.29
.
Secondly, the algorithm learns probabilities for each feature given each of the classes. This is the proportion of emails with each feature in each class—for example, P('meeting':True | ham) = 0.50
. During the test phase, or when the algorithm is applied to a new email and is asked to predict its class, the following comparison from the beginning of this section is applied
{ P ( spam ∣ content ) ≥ P ( ham ∣ content ) ⇒ spam otherwise ⇒ ham \left\{\begin{array}{ll}P(\text { spam } \mid \text { content }) \geq P(\text { ham } \mid \text { content }) & \Rightarrow \text { spam } \\ \text { otherwise } & \Rightarrow \text { ham }\end{array}\right. {P( spam ∣ content )≥P( ham ∣ content ) otherwise ⇒ spam ⇒ ham
This is what we started with originally, but we said that the conditions are flipped, so it becomes
{ P ( content ∣ spam ) × P ( spam ) P ( content ) ≥ P ( content ∣ ham ) × P ( ham ) P ( content ) ⇒ spam otherwise ⇒ ham \left\{\begin{array}{ll}\frac{P(\text { content } \mid \text { spam }) \times P(\text { spam })}{P(\text { content })} \geq \frac{P(\text { content } \mid \text { ham }) \times P(\text { ham })}{P(\text { content })} & \Rightarrow \text { spam } \\ \text { otherwise } & \Rightarrow \text { ham }\end{array}\right. {P( content )P( content ∣ spam )×P( spam )≥P( content )P( content ∣ ham )×P( ham ) otherwise ⇒ spam ⇒ ham
Note that we end up with P(content)
in the denominator(分母) on both sides of the expression, so the absolute value of this probability doesn’t matter, and it can be removed from the expression altogether. As an aside(顺便说一句), since the probability always has a positive value, it won’t change the comparative values on the two sides; for example, if you were comparing 10 to 4, you would get 10 > 4 whether you divide the two sides by the same positive number like (10/2) > (4/2) or not. So we can simplify the expression as
{ P ( content ∣ spam ) × P ( spam ) ≥ P ( content ∣ ham ) × P ( ham ) ⇒ spam otherwise ⇒ ham \left\{\begin{array}{ll}P(\text { content } \mid \text { spam }) \times P(\text { spam }) \geq P(\text { content } \mid \text { ham }) \times P(\text { ham }) & \Rightarrow \text { spam } \\ \text { otherwise } & \Rightarrow \text { ham }\end{array}\right. {P( content ∣ spam )×P( spam )≥P( content ∣ ham )×P( ham ) otherwise ⇒ spam ⇒ ham
P(spam)
and P(ham)
are class probabilities estimated during training, and P(content | class)
, using naïve independence assumption, are products(乘积) of probabilities, so
{ P ( [ f 1 , f 2 , … , f n ] ∣ spam ) × P ( spam ) ≥ P ( [ f 1 , f 2 , … , f n ] ∣ h a m ) × P ( ham ) ⇒ spam otherwise ⇒ ham \left\{\begin{array}{ll}\\P\left(\left[f_{1}, f_{2}, \ldots, f_{n}\right] \mid \text { spam }\right) \times P(\text { spam }) \geq P\left(\left[f_{1}, f_{2}, \ldots, f_{n}\right] \mid h a m\right) \times P(\text { ham }) & \Rightarrow \text { spam } \\\\\text { otherwise } & \Rightarrow \text { ham }\\\end{array}\right.\\ ⎩ ⎨ ⎧P([f1,f2,…,fn]∣ spam )×P( spam )≥P([f1,f2,…,fn]∣ham)×P( ham ) otherwise ⇒ spam ⇒ ham
is split into the individual feature probabilities as
{ P ( f 1 ∣ spam ) × … × P ( f n ∣ spam ) × P ( spam ) ≥ P ( f 1 ∣ ham ) × … × P ( f n ∣ ham ) × P ( ham ) ⇒ spam otherwise ⇒ ham \left\{\begin{array}{ll}\\P\left(f_{1} \mid \text { spam }\right) \times \ldots \times P\left(f_{n} \mid \text { spam }\right) \times P(\text { spam }) \geq P\left(f_{1} \mid \text { ham }\right) \times \ldots \times P\left(f_{n} \mid \text { ham }\right) \times P(\text { ham }) & \Rightarrow \text { spam } \\\\\text { otherwise } & \Rightarrow \text { ham }\\\end{array}\right. ⎩ ⎨ ⎧P(f1∣ spam )×…×P(fn∣ spam )×P( spam )≥P(f1∣ ham )×…×P(fn∣ ham )×P( ham ) otherwise ⇒ spam ⇒ ham
图解,又是好图~
from nltk import NaiveBayesClassifier, classify
def train(features, propotion):
train_size = int(len(features) * propotion)
#initialize the training and testing sets
train_set, test_set = features[:train_size], features[train_size:]
print(f"Training set size: {str(len(train_set))} emails")
print(f"Test set size: {str(len(test_set))} emails")
#train the classifier
classifier = NaiveBayesClassifier.train(train_set)
return train_set, test_set, classifier
#Apply the train function using 80% (or a similar proportion) of emails for training
train_set, test_set, classifier = train(all_features, 0.8)
Suppose you have five spam emails and ten ham emails. What are the conditional probabilities for P('prescription':True | spam), P('meeting':True | ham), P('stock':True | spam)
, and P('stock':True | ham)
, if
Finally, let’s evaluate how well the classifier performs in detecting whether an email is spam or ham. For that, let’s use the accuracy score returned by the NLTK’s classifier. In addition, the NLTK’s classifier allows you to inspect the most informative features (words). For that, you need to specify the number of the top most informative features to look into (e.g., 50 here).
code
def evaluate(train_set, test_set,classifier):
# chech how the classfiier performs on the training and test sets
print(f"Accuracy on the training set: {str(classify.accuracy(classifier, train_set))}")
print(f"Accuracy of the test set: {str(classify.accuracy(classifier, test_set))}")
# check which words are most informative for the classifier
classifier.show_most_informative_features(50)
evaluate(train_set, test_set, classifier)
the most informative features—that is, the list of words that are most strongly associated with a particular class. This is functionality of the classifier that is implemented in NLTK, so all you need to do is call on this function as classifier.show_most_informative_features
and specify the number of words n
that you want to see as an argument. This function then returns the top n words ordered by their “informativeness” or predictive power. Behind the scenes, the function measures “informativeness” as the highest value of the difference in probabilities between P(feature | spam)
and P(feature | ham)
— that is, max[P(word: True | ham) / P(word: True | spam)]
for most predictive ham features, and max[P(word: True | spam) / P(word: True | ham)]
for most predictive spam features (check out NLTK’s documentation for more information: https://www.nltk.org/api/nltk.classify.naivebayes.html). The output shows that such words (features) as prescription(处方), pain, health, and so on are much more strongly associated with spam emails. The ratios on the right show the comparative probabilities for the two classes: for instance, P("prescription" | spam)
is 130.9 times higher than P("prescription" | ham)
. On the other hand, nomination is more strongly associated with ham emails. As you can see, many spam emails in this dataset are related to medications, which shows a particular bias; the most typical spam that you personally get might be on a different topic altogether.
One other piece of information presented in this output is accuracy. Test accuracy shows the proportion of test emails that are correctly classified by Naïve Bayes among all test emails. The preceding(前面的) code measures the accuracy on both the training data and test data. Note that since the classifier is trained on the training data, it actually gets to “see” all the correct labels for the training examples. Shouldn’t it then know the correct answers and perform at 100% accuracy on the training data? Well, the point here is that the classifier doesn’t just retrieve(挽回) the correct answers; during training, it has built some probabilistic model (i.e., learned about the distribution of classes and the probability of different features), and then it applies this model to the data. So, it is actually very likely that the probabilistic model doesn’t capture all the things in the data 100% correctly. For example, there might be noise and inconsistencies in the real emails. Note that “2004” gets strongly associated with the spam emails and “2001” with the ham emails, although it does not mean that there is anything peculiar(奇怪的) about the spam originating from 2004. This might simply show a bias in the particular dataset, and such phenomena are hard to filter out in any real data, especially when you rely on the data collected by other researchers. This means that if some ham email in training data contains “2004” and a variety of other words that are otherwise related to spam, this email might get misclassified as spam by the algorithm. Similarly, as many medication-related words are strongly associated with spam, a rare ham email that is actually talking about some medication the user ordered might get misclassified as spam.
这里的训练集测试集的accuracy都不错,说明泛化能力还可以
If you’d like to gain any further insight into how the words are used in the emails from different classes, you can also check the occurrences of any particular word in all available contexts. For example, the word stocks features as a very strong predictor of spam messages. Why is that? You might be thinking, Okay, some emails containing “stocks” will be spam, but surely there must be contexts where “stocks” is used in a completely legitimate way?
from nltk.text import Text
def concordance(data_list, search_word):
for email in data_list:
word_list = [word for word in word_tokenize(email.lower())]
text_list = Text(word_list)
if search_word in word_list:
text_list.concordance(search_word)
print ("STOCKS in HAM:")
concordance(ham_list, "stocks")
print ("\n\nSTOCKS in SPAM:")
concordance(spam_list, "stocks")
This code shows how you can use NLTK’s concordancer(语料库检索工具), a tool that checks for the occurrences of the specified word and prints out this word in its contexts. By default, NLTK’s concordancer prints out the search_word surrounded by the previous 36 and the following 36 characters, so note that it doesn’t always result in full words. Once the use of concordancer is defined in the function concordance, you can apply it to both ham_list and spam_list to find the different contexts of the word stocks.
test_spam_list = ["Participate in our new lottery!", "Try out this new medicine"]
test_ham_list = ["See the minutes from the last meeting attached",
"Investors are coming to our office on Monday"]
test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]
new_test_set = [(get_features(email), label) for (email, label) in test_emails]
evaluate(train_set, new_test_set, classifier)
for email in test_spam_list:
print (email)
print (classifier.classify(get_features(email)))
for email in test_ham_list:
print (email)
print (classifier.classify(get_features(email)))
Participate in our new lottery!
spam
Try out this new medicine
spam
See the minutes from the last meeting attached
ham
Investors are coming to our office on Monday
ham
while True:
email = input("Type in your email here (or press 'Enter'): ")
if len(email)==0:
break
else:
prediction = classifier.classify(get_features(email))
print (f"This email is likely {prediction}\n")
这还挺好玩的~
感觉这本书的逻辑很清晰,而且很详细,很多地方真的是手把手step-by-step的讲解了