临风而眠

手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example

《Getting Started with NLP》chap2：Your first NLP example

感觉这本书很适合我这种菜菜,另外下面的笔记还有学习英语的目的，故大多数用英文摘录或总结

文章目录

《Getting Started with NLP》chap2：Your first NLP example
2.1 Introducing NLP in practice: Spam filtering
- classification
- - 一个简单的例子
2.2 Understanding the task
- Step 1: Define the data and classes
- Step 2: Split the text into words
- - 为何需要split into words?
  - 如何split into words(exercise 2.2)
  - - split text string into words by whitespaces
    - split text string into words by whitespaces and punctuation
  - tokenizers
- Step 3: Extract and normalize the features
- Step 4: Train a classifier
- Step 5: Evaluate the classifier
2.3 Implementing your own spam filter
- Step 1: Define the data and classes
- - 解压文件
  - read in the contents of the files
  - verify that the data is uploaded and read in correctly
  - combine the data into a single structure
- Step 2: Split the text into words
- - run a tokenizer over text
- Step 3: Extract and normalize the features
- - Code to extract the features
  - - 这份代码做了啥，很清晰的解释图！
  - 仔细研究一下上面的数据结构
  - exercise2.5
- Step 4: Train the classifier
- - Code to train a Naïve Bayes classifier
  - Exercise 2.6
- Step 5: Evaluate your classifier
- - Code to check the contexts of specific words
2.4 Deploying your spam filter in practice
- apply spam filtering to new emails
- print out the predicted label
- classify the emails read in from the keyboard
Summary

This chapter covers
- Implementing your first practical NLP application from scratch
- Structuring an NLP project from beginning to end
- Exploring NLP concepts, including tokenization and text normalization
- Applying a machine learning algorithm to textual data

昨天刚在《机器学习实战》那本书里面看了以spam filtering为例子贯穿讲解的ML的landscape，今天就来跟着这个chapter实践一下

2.1 Introducing NLP in practice: Spam filtering

Spam filtering exemplifies a widely spread family of tasks —— text classification.

exemplify :to be or give a typical example of something 作为…的典范（或范例、典型、榜样等）

classification

作者给了很多非常贴近生活的例子 ~ 讲了 classification的usefulness，概括一下
- Classification helps us reason about things and adjust our behavior based on them.
- Classification allows us to group things into categories, making it easier to deal with individual instances.
We refer to the name of each class as a class label
- When we are dealing with two labels only, it is called binary classification
- Classification that implies more than two classes is called multiclass classification
How do we perform classification ?
- Classification is performed based on certain characteristics or features of the concepts being classified.
- The specific characteristics or features used in classification depend on the task at hand.
- In machine-learning terms, we call such characteristics “features”.

作者给了一段summary
- Classification refers to the process of identifying which category or class among the set of categories (classes) an observation belongs to based on its properties. In machine learning, such properties are called features and the class names are called class labels. If you classify observations into two classes, you are dealing with binary classification; tasks with more than two classes are examples of multiclass classification.

一个简单的例子

其实大一最早的时候做的经典的老掉牙if/else或者那个switch，输入学生成绩，给出成绩，就是simple classification

这里作者也给出了很simple的程序

For example, you can make the machine print out a warning that water is hot based on a simple threshold of 45°C (113°F), as listing 2.1 suggests. In this code, you define a function print_warning, which takes water temperature as input and prints out water status. The if statement checks if input temperature is above a predefined threshold（阈值） and prints out a warning message if it is. In this case, since the temperature is above 45°C, the code prints out Caution: Hot water!

#  Simple code to tell whether water is cold or hot
def print_warning(temperature): 
    if temperature>=45: 
        print ("Caution: Hot water!")
    else:
        print ("You may use water as usual")
print_warning(46)

感觉写的不错，作者把这个例子过渡到了ML 和监督学习

When there are multiple factors to consider in classification, it is better to let the machine learn the rules and patterns from data rather than hardcoding（写死） them. Machine learning involves providing a machine with examples and a general outline of a task, and allowing it to learn to solve the task independently. In supervised machine learning, the machine is given labeled data and learns to classify based on the provided features.

Supervised machine learning

Supervised machine learning refers to a family of machine-learning tasks in which the algorithm learns the correspondences between an input and an output based on the provided labeled examples. Classification is an example of a supervised machine learning task, where the algorithm tries to learn the mapping between the input data and the output class label.

Using the cold or hot water example, we can provide the machine with the samples of water labeled hot and samples of water labeled cold, tell it to use temperature as the predictive factor (feature), and this way let it learn independently from the provided data that the boundary between the two classes is around 45°C (113°F)

2.2 Understanding the task

作者给出了scenario （a description of possible actions or events in the future可能发生的事态；设想）

Consider the following scenario: you have a collection of spam and normal emails from the past. You are tasked with building a spam filter, which for any future incoming email can predict whether this email is spam or not. Consider these questions:

How can you use the provided data?
What characteristics of the emails might be particularly useful, and how will you extract them?
What will be the sequence of steps in this application?

In this section, we will discuss this scenario and look into the implementation steps. In total, the pipeline for this task will consist of five steps.

The pipeline

Step 1: Define the data and classes

“normal” emails are sometimes called “ham” in the spam-detection context

First, you need to ask yourself what format the email messages are delivered in for this task. For instance, in a real-life situation, you might need to extract the messages from the mail agent application. However, for simplicity, let’s assume that someone has extracted the emails for you and stored them in text format. The normal emails are stored in a separate folder—let’s call it Ham, and spam emails are stored in a Spam folder.

If someone has already predefined past spam and ham emails for you (e.g., by extracting these emails from the INBOX and SPAM box), you don’t need to bother with labeling them. However, you still need to point the machine-learning algorithm at the two folders by clearly defining which one is ham and which one is spam. This way, you will define the class labels and identify the number of classes for the algorithm. This should be the first step in your spam-detection pipeline (and in any text-classification pipeline), after which you can preprocess the data, extract the relevant information, and then train and test your algorithm (figure 2.5).

You can set step 1 of your algorithm as follows: Define which data represents “ham” class and which data represents “spam” class for the machine-learning algorithm.

Step 2: Split the text into words

Next, you will need to define the features for the machine to know what type of information, or what properties of the emails to pay attention to, but before you can do that, there is one more step to perform – split the text into words.

为何需要split into words?

The content of an email can be useful for identifying spam, but using the entire email as a single feature may not work well because even small changes can affect it. Instead, smaller chunks of text like individual words might be more effective features because they are more likely to carry spam-related information and are repetitive enough to appear in multiple emails.

如何split into words(exercise 2.2)

For a machine, the text comes in as a sequence of symbols, so the machine does not have an idea of what a word is.
- How would you define what a word is from the human perspective?
- How would you code this for a machine?
For example, how will you split the following sentence into words? “Define which data represents each class for the machine learning algorithm.”

split text string into words by whitespaces

The first solution might be “Words are sequences of characters separated by whitespaces.”

simple code to split text string into words by whitespaces

text = "Define which data represents each class for the machine learning algorithm"
text.split(" ")

['Define',
 'which',
 'data',
 'represents',
 'each',
 'class',
 'for',
 'the',
 'machine',
 'learning',
 'algorithm']

So far(到目前为止), so good. However, what happens to this strategy when we have punctuation marks? For example: “Define which data represents “ham” class and which data represents “spam” class for the machine learning algorithm.”
```
['Define',
 'which',
 'data',
 'represents',
 '“ham”',
 'class',
 'and',
 'which',
 'data',
 'represents',
 '“spam”',
 'class',
 'for',
 'the',
 'machine',
 'learning',
 'algorithm.']
```
In the list of words, are [“ham”], [“spam”], and [algorithm.] any different from [ham], [spam], and [algorithm]? That is, the same words but without the punctuation marks attached to them? The answer is, these words are exactly the same, but because you are only splitting by whitespaces at the moment, there is no way of taking the punctuation marks into account.

However, each sentence will likely include one full stop (.), question (?), or exclamation mark (!) attached to the last word, and possibly more punctuation marks inside the sentence itself, so this is going to be a problem for properly extracting words from text. Ideally, you would like to be able to extract words and punctuation marks separately.

split text string into words by whitespaces and punctuation

那么如何把标点符号punctuation考虑进来？书上给出了很详细的算法描述！

Store words list and a variable that keeps track of the current word—let’s call it current_word for simplicity.
Read text character by character:
- If a character is a whitespace, add the current_word to the words list and update the current_word variable to be ready to start a new word.
- Else if a character is a punctuation mark:
  - If the previous character is not a whitespace, add the current_word to the words list; then add the punctuation mark as a separate word token, and update the current_word variable.
  - Else if the previous character is a whitespace, just add the punctuation mark as a separate word token.
- Else if a character is a letter other than a whitespace or punctuation mark, add it to the current_word.

手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example_第3张图片

orz，感觉这图做得很好

Code to split text string into words by whitespaces and punctuation

delimiter 分隔符，定界符

text = "Define which data represents “ham” class and which data represents “spam” class for the machine learning algorithm."
delimiters = ['"', "."]
words = []
current_word = ""
for char in text:
    if char==" ":
        if not current_word=="":
            words.append(current_word)
            current_word = ""
    elif char in delimiters:
        if current_word=="":  #current_word是空的说明前面一个是whitespace，只把标点加入list
            words.append(char)
        else:
            words.append(current_word)
            words.append(char)
            current_word = ""
    else:
        current_word += char

print(words)

['Define', 'which', 'data', 'represents', '“ham”', 'class', 'and', 'which', 'data', 'represents', '“spam”', 'class', 'for', 'the', 'machine', 'learning', 'algorithm', '.']

这个时候完美了吗? 没有，有很多时候我们用的缩写比如e.g.,i.e.是不需要拆的，但是用刚刚的算法会拆成[‘i’,‘.’,‘e’,'.]

tokenizers

This is problematic, since if the algorithm splits these examples in this way, it will lose track of the correct interpretation of words like i.e. or U.S.A., which should be treated as one word token rather than a combination of characters. How can this be achieved?
This is where the NLP tools come in handy: the tool that helps you split the running string of characters into meaningful words is called tokenizer, and it takes care of the cases like the ones we’ve just discussed—that is, it can recognize that ham. needs to be split into [‘ham’, ‘.’] while U.S.A. needs to be kept as one word [‘U.S.A.’]. Normally, tokenizers rely on extensive and carefully designed lists of regular expressions, and some are trained using machine-learning approaches.
这里作者介绍了NLTK

For an example of a regular expressions-based tokenizer, you can check the Natural Language Processing Toolkit’s regexp_tokenize() to get a general idea of the types of the rules that tokenizers take into account: see Section 3.7 on www.nltk.org/book/ch03.html. The lists of rules applied may differ from one tokenizer to another.
tokenizer还可以给我们解决哪些常见的problem呢？缩写（contraction）

看几个例子
- What’s the best way to cook a pizza?
  - The first bit, What’s, should also be split into two words: this is a contraction for what and is, and it is important that the classifier knows that these two are separate words. Therefore, the word list for this sentence will include [What, 's, the, best, way, to, cook, a, pizza, ?].
- We’re going to use a baking stone.
  - we’re should be split into we and 're (“are”). Therefore, the full word list will be [We, 're, going, to, use, a, baking, stone, .].
- I haven’t used a baking stone before.
  - [I, have, n’t, used, a, baking, stone, before, .]. Note that the contraction of have and not here results in an apostrophe inside the word not; however, you should still be able to recognize that the proper English words in this sequence are have and n’t (“not”) rather than haven and 't. This is what the tokenizer will automatically do for you. (Note that the tokenizers do not automatically map contracted forms like n’t and ’re to full form like not and are. Although such mapping would be useful in some cases, this is beyond the functionality of tokenizers.)
如何解决这个呢，正则表达式是一个神奇的东西
```
import nltk
import re

text1 = "What’s the best way to cook a pizza?"
text2 = "We’re going to use a baking stone."
text3 = "I haven’t used a baking stone before."

texts = [text1, text2, text3]

tokens = []
for text in texts:
  # Replace certain characters with a space
  text = re.sub(r'[’]', '\'', text)
  print(text)
  # text = re.sub(r'[^\w\s\.\?\']', ' ', text)
  # print(text)
  # Tokenize the text
  tokens.append(nltk.word_tokenize(text))

for token in tokens:
  print(token)
# print(tokens)
```
```
What's the best way to cook a pizza?
We're going to use a baking stone.
I haven't used a baking stone before.
['What', "'s", 'the', 'best', 'way', 'to', 'cook', 'a', 'pizza', '?']
['We', "'re", 'going', 'to', 'use', 'a', 'baking', 'stone', '.']
['I', 'have', "n't", 'used', 'a', 'baking', 'stone', 'before', '.']
```

Step 3: Extract and normalize the features

作者举了个例子
- Suppose two emails use a different format:
  - one says Collect your lottery winnings
  - while another one says Collect Your Lottery Winnings
  拆开的词语里面 lottery 和 Lottery 形式上确实不同，但是意思上是一样的，都是博彩
我们需要get rid of such formatting issues
The problem is that small formatting differences, such as the use of uppercase versus lowercase letters, can result in different word lists being generated. To address this issue, the passage suggests normalizing the extracted words by converting them to lowercase. This can help ensure that the meaning of the words is not affected by formatting differences and that the extracted words are more indicative（标示的） of the spam-related content of the emails.

Step 4: Train a classifier

At this point, you will end up with two sets of data—one linked to the spam class and another one linked to the ham class. Each data is preprocessed in the same way in steps 2 and 3, and the features are extracted.
Next, you need to let the machine use this data to build the connection between the set of features (properties) that describe each type of email (spam or ham) and the labels attached to each type. In step 4, a machine-learning algorithm tries to build a statistical model, a function, that helps it distinguish between the two classes. This is what happens during the learning (training) phase（阶段）.
A refresher visualizing the training and test processes.
So, step 4 of the algorithm should be defined as follows: define a machine-learning model and train it on the data with the features predefined in the previous steps
Suppose that the algorithm has learned to map the features of each class of emails to the spam and ham labels, and has determined which features are more indicative of spam or ham. The final step in this process is to make sure the algorithm is doing such predictions well.
How will you do that? 划分数据集
The data used for training and testing should be chosen randomly and should not overlap（交叠） in order to avoid biasing the evaluation. The training set is typically larger, with a typical proportion being 80% for training and 20% for testing. The classifier should not be allowed to access the test set during the training phase, and it should only be used for evaluation at the final step.

shuffle 洗牌

Data splits for supervised machine learning

In supervised machine learning, the algorithm is trained on a subset of the labeled data called training set. It uses this subset to learn the function mapping the input data to the output labels. Test set is the subset of the data, disjointed（分离） from the training set, on which the algorithm can then be evaluated. The typical data split is 80% for training and 20% for testing. Note that it is important that the two sets are not overlapping. If your algorithm is trained and tested on the same data, you won’t be able to tell what it actually learned rather than memorized.

Step 5: Evaluate the classifier

Suppose you trained your classifier in step 4 and then applied it to the test data. How will you measure the performance?
One approach would be to check what proportion of the test emails the algorithm classifies correctly—that is, assigns the spam label to a spam email and classifies ham emails as ham. This proportion is called accuracy, and its calculation is pretty straightforward:
$=\frac{{ num }(\text { correct predictions })}{n u m(\text { all test instances })}$
这里给了个练习2.4

Let’s discuss the solutions to this exercise (note that you can also find more detailed explanations at the end of the chapter). The prediction of the classifier based on the distribution of classes that you came across in this exercise is called baseline. In an equal class distribution case, the baseline is 50%, and if your classifier yields an accuracy of 60%, it outperforms this baseline. In the case of 60:40 split, the baseline, which can also be called the majority class baseline, is 60%. This means that if a dummy “classifier” does no learning at all and simply predicts the ham label for all emails, it will not filter out any spam emails from the inbox, but its accuracy will also be 60%—just like your classifier that is actually trained and performs some classification! This makes the classifier in the second case in this exercise much less useful because it does not outperform the majority class baseline.
书后给的解答
1. An accuracy of 60% doesn’t seem very high, but how exactly can you interpret it? Note that the distribution of classes helps you to put the performance of your classifier in context because it tells you how challenging the problem itself is. For example, with the 50–50% split, there is no majority class in the data and the classifier’s random guess will be at 50%, so the classifier’s accuracy is higher than this random guess. In the second case, however, the classifier performs on a par with the majority class guesser: the 60% to 40% distribution of classes suggests that if some dummy “classifier” always selected the majority class, it would get 60% of the cases correctly—just like the classifier you trained.
2. The single accuracy value of 60% does not tell you anything about the performance of the classifier on each class, so it is a bit hard to interpret. However, if you look into each class separately, you can tell that the classifier is better at classifying ham emails (it got 2/3 of those right) than at classifying spam emails (only 1/2 are correct).
In summary, accuracy is a good overall measure of performance, but you need to keep in mind
- (1) the distribution of classes to have a comparison point for the classifier’s performance
- (2) the performance on each class, which is hidden within a single accuracy value but might suggest what the strengths and weaknesses of your classifier are. Therefore, the final step, step 5, in your algorithm is applying your classifier to the test data and evaluating its performance

2.3 Implementing your own spam filter

Step 1: Define the data and classes

数据集：Enron Email Dataset (cmu.edu)
- This is a dataset of emails, including both ham (extracted from the original Enron dataset using personal messages of three Enron employees) and spam emails. To make processing more manageable, we will use a subset of this large dataset.
- All folders in the Enron dataset contain spam and ham emails in separate subfolders. Each email is stored as a text file in these subfolders. Let’s read in the contents of these text files in each subfolder, store the spam emails’ contents and the ham emails’ contents as two separate data structures, and point our algorithm at each, clearly defining which one is spam and which one is ham.

解压文件

我用的是codespace，我想到三种解压方式，一个是vscode也许有extension，另一个是terminal里面使用unzip，还有就是用python来解压，我打算用后两种都尝试一下
我直接把这本书的GitHub仓库ekochmar/Getting-Started-with-NLP: This repository accompanies the book “Getting Started with Natural Language Processing” (github.com)在codespace打开，然后建了一个叫learningThrough的文件夹，观察一下目标文件位置，我怕里面的文件直接冲出来于是搞了个subfolder叫test

然后
```
unzip enron1.zip -d ./learningThrough/test
unzip enron2.zip -d learningThrough/test
```
这里的路径前面写./ 或者前面不加东西

用python来试试

但是遇到了Bad pipe message

难道是因为我指定的那个try2zip一开始不存在这个directory？

那我再来一次，直接解压到current directory 下面吧

import zipfile

# Specify the zip file name
zip_file = '../enron1.zip'

# Create a ZipFile object
zip_obj = zipfile.ZipFile(zip_file, 'r')

# Extract all the contents of zip file to the destination folder
zip_obj.extractall()

# close the Zip File
zip_obj.close()

import zipfile

# Specify the zip file name
zip_file = '../enron2.zip'

# Create a ZipFile object
zip_obj = zipfile.ZipFile(zip_file, 'r')

# Extract all the contents of zip file to the destination folder
zip_obj.extractall()

# close the Zip File
zip_obj.close()

好耶

read in the contents of the files

Let’s define a function read_in that will take a folder as an input, read the files in this folder, and store their contents as a Python list data structure.

不得不说这本书给的注释是真的详细…orz

import os
# helps with different text encoding
import codecs
def read_in(folder):
# Using os functionality, list all the files in the specified folder
the files in the specified folder.
    files = os.listdir(folder)
    a_list = []
    # Iterate through the files in the folder
    for a_file in files:
        # Skip hidden files.
        if not a_file.startswith("."):
            # Read the contents of each file.
            f = codecs.open(folder + a_file, "r",encoding="ISO-8859-1",errors="ignore")
            # Add the content of each file to the list data structure.
            a_list.append(f.read())
            # Close the fileafter you readthe contents.
            f.close()
    return a_list

In this code, you rely on Python’s os module functionality to list all the files in the specified folder, and then you iterate through them, skipping hidden files (以.开头的那些文件,such files can be easily identified because their names start with “.” ) that are sometimes automatically created by the operating systems.

Next, you read the contents of each file.

The encoding and errors arguments of codecs.open function will help you avoid errors in reading files that are related to text encoding.

codecs是啥捏

The codecs module in Python provides functions to encode and decode data using various codecs (encoding/decoding algorithms).

Here are some common codecs that you can use with the codecs module:

utf-8: A Unicode encoding that can handle any character in the Unicode standard. It’s the most widely used encoding for the Web.

ascii: An encoding for the ASCII character set, which consists of 128 characters.

latin-1: An encoding for the Latin-1 character set, which consists of 256 characters.

utf-16: A Unicode encoding that uses two bytes (16 bits) to represent each character.

You add the content of each file to a list data structure, and in the end, you return the list that contains the contents of the files from the specified folder.

verify that the data is uploaded and read in correctly

现在我们可以定义spam_list和ham_list—letting the machine know what data represents spam emails and what data represents ham emails.
Let’s check if the data is uploaded correctly; for example, you can print out the lengths of the lists or check any particular member of the list.

Summary.txt里面列出来了，那我们就看看条数对不对

the length of the spam_list should equal the number of spam emails in the enron1/spam/ folder, which should be 1,500, while the length of the ham_list should equal the number of emails in the enron1/ham/, or 3,672. If you get these numbers, your data is uploaded and read in correctly.

代码如下

spam_list = read_in("./enron1/spam/")
print(len(spam_list))
# print(spam_list[0])
ham_list = read_in("./enron1/ham/")
print(len(ham_list))
print(ham_list[0])

条数是对的，好耶

combine the data into a single structure

Next, we need to preprocess the data (e.g., by splitting text strings into words) and extract the features. Wouldn’t it be easier if you could run all preprocessing steps over a single data structure rather than over two separate lists?

这里不采用for循环，为了加上标签，我们把每个邮件和其标签组成一个元组，然后再放入一个新的list all_emails 里面

verify一下是否读进去了，输出太长，用切片切一下

手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example_第7张图片

此外，我们还要考虑到划分数据集的随机性

We need to split the data randomly into the training and test sets. To that end（因为那个缘故）, let’s shuffle the resulting list of emails with their labels, and make sure that the shuffle is reproducible by fixing the way in which the data is shuffled. For the shuffle to be reproducible, you need to define the seed for the random operator, which makes sure that all future runs will shuffle the data in exactly the same way.

# Python’s random module will help you shuffle the data randomly.
import random
# Use list comprehensions to create the all_emails list that will keep all emails with their labels.
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
# Select the seed of the random operator to make sure that all future runs will shuffle  the data in the same way
random.seed(42)
random.shuffle(all_emails)
# Check the size of the dataset (lengthof the list); it should be equal to 1,500 + 3,672 
# This kind of string is called formatted string literals or f-strings, and it is a new feature introduced in Python 3.6
print (f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails

Step 2: Split the text into words

Remember that the email contents you’ve read in so far each come as a single string of symbols. The first step of text preprocessing involves splitting the running text into words.
我们还是用NLTK
- One of the benefits of this toolkit is that it comes with a thorough documentation and description of its functionality.
  
  （该工具包的好处是，它具有详尽的文档）
- NLTK :: nltk.tokenize package
It takes running text as input and returns a list of words based on a number of customized regular expressions, which help to delimit the text by whitespaces and punctuation marks, keeping common words like U.S.A. unsplit.

run a tokenizer over text

code

This code defines a tokenize function that takes a string as input and splits it into words.

The for-loop within this function appends each identified word from the tokenized string to the output word list; alternatively, you can use list comprehensions(列表生成式) for the same purpose. Finally, given the input, the function prints out a list of words. You can test your intuitions about the words and check your answers to previous exercises by changing the input to any string of your choice
```
import nltk
from nltk import word_tokenize 
nltk.download('punkt')
def tokenize(input): 
    word_list = []
    for word in word_tokenize(input):
        word_list.append(word) 
    return word_list
input = "What's the best way to split a sentence into words?"
print(tokenize(input))
```
In addition to the toolkit itself, you need to install NLTK data as explained on www.nltk.org/data.html. Running nltk.download() will install all the data needed for text processing in one go; in addition, individual tools can be installed separately (e.g., nltk.download(‘punkt’) installs NLTK’s sentence tokenizer)
```
['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']
```
如果我用列表生成式（list comprehension）,那就这样写
```
import nltk
from nltk import word_tokenize 
def tokenize(input): 
    word_list = [word for word in word_tokenize(input)]
    return word_list
input = "What's the best way to split a sentence into words?"
print(tokenize(input))
```

Step 3: Extract and normalize the features

这里作者非常推崇list comprehensions，确实，我得提高使用它的意识

Once the words are extracted from running text, you need to convert them into features. In particular, you need to put all words into lowercase to make your algorithm establish the connection between different formats like Lottery and lottery.
Putting all strings to lowercase can be achieved with Python’s string functionality. To extract the features (words) from the text, you need to iterate through the recognized words and put all words to lowercase. In fact, both tokenization and converting text to lowercase can be achieved using a single line of code with list comprehensions.

word_list = [word for word in word_tokenize(text.lower())]

Using list comprehensions, you can combine the two steps—tokenization and converting strings to lowercase—in one line. Here, you first normalize and then tokenize text, but the two steps are interchangeable.
We define a function get_features that extracts the features from the text of email passed in as input. Next, for each word in the email, you switch on the “flag” that the word is contained in this email by assigning it with a True value. The list data structure all_features keeps tuples containing the dictionary of features matched with the spam or ham label for each email.

Code to extract the features

import nltk
from nltk import word_tokenize

def get_features(text):
    features = {}
    word_list = [word for word in word_tokenize(text.lower())]
    for word in word_list:
    # For each word in the email, switch on the “flag” that this word is contained in the email.
        features[word] = True
    return features
# all_features will keep tuples containing the dictionary of features matched with the label for each email

all_features = [(get_features(email), label) for (email, label) in all_emails]
# Check what features are extracted from an input text
print(get_features("Participate In Our New Lottery NOW!"))
print(len(all_features))
# Check what all_features list data structure contains.

print(len(all_features[0][0]))
print(len(all_features[99][0]))

In the end, the code shows how you can check what features are extracted from an input text and what all_features list data structure contains (e.g., by printing out its length and the number of features detected in the first or any other email in the set)

{'participate': True, 'in': True, 'our': True, 'new': True, 'lottery': True, 'now': True, '!': True}
5172
321
97

随机shuffle确实是有效的，我再来一次得到的是

5172
72
157

其实就是创建了一个dictionary

With this bit of code, you iterate over the emails in your collection (all_emails) and store the features extracted from each email matched with the label. For example, if a spam email consists of a single sentence (e.g., “Participate In Our New Lottery NOW!”), your algorithm will first extract the list of features present in this email and assign a True value to each of them. The dictionary of features will be represented using this format: [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]. Then the algorithm will add this data structure to all_features together with the spam label: ([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True], “spam”)

这份代码做了啥，很清晰的解释图！

仔细研究一下上面的数据结构

刚刚那个代码创建了一个list叫all_features, a list of tuples. where each tuple represents an individual email.

So the total length of all_features is equal to the number of emails in your dataset.

As each tuple in this list corresponds to an individual email, you can access each one by the index in the list using all_features[index]: for example, you can access the first email in the dataset as all_features[0] (remember, that Python’s indexing starts with 0), and the hundredth as all_features[99] (for the same reason)

Let’s now clarify what each tuple structure representing an email contains. Tuples pair up two information fields. In this case, a dictionary of features extracted from the email and its label—that is, each tuple in all_features contains a pair (dict_of_ features, label). So if you’d like to access the first email in the list, you call on all_ features[0]; to access its features, you use all_features[0][0]; and to access its label, you use all_features[0][1].

又是一张好图！

For example, if the first email in your collection is a spam email with the content “Participate in our lottery now!”, all_features[0] will return the tuple ([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True], “spam”); all_features[0][0] will return the dictionary [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True], and all_features[0][1] will return the value spam

exercise2.5

Imagine your whole dataset contained only one spam email (“Participate In Our New Lottery NOW!”) and one ham email (“Participate in the Staff Survey”). What features will be extracted from this dataset with the code from listing 2.8?(就是上面那个get_feature的代码)

手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example_第9张图片

Step 4: Train the classifier

这里作者介绍了用朴素贝叶斯 Naive Bayes来处理

Next, let’s apply machine learning and teach the machine to distinguish between the features that describe each of the two classes.
There are several classification algorithms that you can use, and you will come across several of them in this book. But since you are at the beginning of your journey, let’s start with one of the most interpretable（可解释的） ones—an algorithm called Naïve Bayes. Don’t be misled by the word Naïve in its title, though. Despite the relative simplicity of the approach compared to other ones, this algorithm often works well in practice and sets a competitive performance baseline that is hard to beat with more sophisticated（复杂的） approaches. For the spam-filtering algorithm that you are building in this chapter, you will rely on the Naïve Bayes implementation provided with the NLTK library, so don’t worry if some details of the algorithm seem challenging to you. However, if you would like to see what is happening “under the hood,” this section will walk you through the details of the algorithm.
Naïve Bayes is a probabilistic classifier, which means that it makes the class prediction based on the estimate of which outcome is most likely (i.e., it assesses the probability of an email being spam and compares it with the probability of it being ham), and then selects the outcome that is most probable between the two. In fact, this is quite similar to how humans assess whether an email is spam or ham. When you receive an email that says “Participate in our lottery now! Click on this link,” before clicking on the (potentially harmful) link, you assess how likely it is (i.e., what is the probability?) that it is a ham email and compare it to how likely it is that this email is spam. Based on your experience and all the previous spam and ham emails you have seen before, you might judge that it is much more likely (more probable) that it is a spam email. By the time the machine makes a prediction, it has also accumulated some experience in distinguishing spam from ham that is based on processing a dataset of labeled spam and ham emails.
In this step, the machine will try to predict whether the email content represents spam or ham. In other words, it will try to predict whether the email is spam or ham given or conditioned on its content. This type of probability, when the outcome (class of spam or ham) depends on the condition (words used as features), is called conditional probability（条件概率）. For spam detection, you estimate P(spam | email content) and P(ham | email content), or generally P(outcome | (given) condition). Then you compare one estimate to another and return the most probable class. For example:

If P(spam | content) = 0.58 and P(ham | content) = 0.42, predict spam

If P(spam | content) = 0.37 and P(ham | content) = 0.63, predict ham

Reminder on the notation(符号): P is used to represent all probabilities; | is used in conditional probabilities, when you are trying to estimate the probability of some event (that is specified before |) given that the condition (that is specified after |) applies.
In summary, this boils down to(归结为) the following set of actions illustrated in the figure below.
我们人往往根据经验有个判断，那么机器呢

Similarly, a machine can estimate the probability that an email is spam or ham conditioned on its content, taking the number of times it has seen this content leading to a particular outcome:
$\mid \text{“Participate in our lottery now!”} )=\frac{n u m(\text { spam emails with “Participate in our lottery now!”) }}{n u m(\text { all emails with “Participate in our lottery now!”) }}\\ P( ham \mid \text{“Participate in our lottery now!”} )=\frac{n u m(\text { ham emails with “Participate in our lottery now!”) }}{n u m(\text { all emails with “Participate in our lottery now!”) }}$
In the general form, this can be expressed as
$\mid condition )=\frac{\text { numOfTimes }(\text { condition led to outcome })}{\text { numOfTimes }(\text { condition applied })}$
此时问题出现了
- You (and the machine) will need to make such estimations for all types of content in your collection, including for the email contents that are much longer than “Participate in our new lottery now!” Do you think you will come across enough examples to reliably make such estimations? In other words, do you think you will see any particular combination of words (that you use as features), no matter how long, multiple times so that you can reliably estimate the probabilities from these examples? The answer is you will probably see “Participate in our new lottery now!” only a few times, and you might see longer combinations of words only once, so such small numbers won’t tell the algorithm much and you won’t be able to use them in the previous expression effectively.
- Additionally, you will constantly get new emails where the words will be used in a different order and different combinations, so for some of these new combinations you will not have any counts at all, even though you might have counts for individual words in such new emails. The solution to this problem is to split the estimation into smaller bits. For instance, remember that you used tokenization to split long texts into separate words to let the algorithm access the smaller bits of information—words rather than whole sequences. The idea of estimating probabilities based on separate features rather than based on the whole sequence of features (i.e., whole text) is rather similar.
At the moment, you are trying to predict a single outcome (class of spam or ham) given a single condition that is the whole text of the email; for example, “Participate in our lottery now!” In the previous step, you converted this single text into a set of features as [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]. Note that the conditional probabilities like P(spam| “Participate in our lottery now!”) and P(spam| [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]) are the same because this set of features encodes the text. Therefore, if the chances of seeing “Participate in our lottery now!” are low, the chances of seeing the set of features [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True] encoding this text are equally low.

Is there a way to split this set to get at more fine-grained（细粒度）, individual probabilities, such as to establish a link between [‘lottery’: True] and the class of spam?
- Unfortunately, there is no way to split the conditional probability estimation like P(outcome | conditions) when there are multiple conditions specified; however, it is possible to split the probability estimation like P(outcomes | condition) when there is a single condition and multiple outcomes. In spam detection, the class is a single value (it is spam or ham), while features are a set ([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True]). If you can flip（浏览） around the single value of class and the set of features in such a way that the class becomes the new condition and the features become the new outcomes, you can split the probability into smaller components and establish the link between individual features like [‘lottery’: True] and class values like spam.

下面要学习公式推导咯

Luckily, there is a way to flip the outcomes (class) and conditions (features extracted from the content) around! Let’s look into the estimation of conditional probabilities again: you estimate the probability that the email is spam given that its content is “Participate in our new lottery now!” based on how often in the past an email with such content was spam. For that, you take the proportion of the times you have seen “Participate in our new lottery now!” in a spam email among the emails with this content. You can express it as
$\mid \text{“Participate in our lottery now!”} )=\\\frac{\mathrm{P} \text { (“Participate in our lottery now!” is in a spam email })}{P \text { (“Participate in our lottery now!” is used in an email) }}$
Let’s call this Formula 1. What is the conditional probability of the content “Participate in our new lottery now!” given class spam, then? Similarly the probabilities earlier, you need the proportion of times you have seen “Participate in our new lottery now!” in a spam email among all spam emails. You can express it as
$\text{“Participate in our lottery now!”} \mid spam )= \frac{\mathrm{P} \text { (“Participate in our lottery now!" is in a spam email) }}{P(\text { an email is spam) }}$
Let’s call this Formula 2. Every time you use conditional probabilities, you need to divide how likely it is that you see the condition and outcome together by how likely it is that you see the condition on its own—this is the bit（约束）after |. Now you can see that both Formulas 1 and 2 rely on how often you see particular content in an email from a particular class. They share this bit, so you can use it to connect the two formulas. For instance, from Formula 2 you know that
${\mathrm{P} \text { (“Participate in our lottery now!" is in a spam email) }}= \\P( \text{“Participate in our lottery now!”} \mid spam )\times{P(\text { an email is spam) }}$
Now you can fit this into Formula 1：
$\mid \text{“Participate in our lottery now!”} )=\\\frac{\mathrm{P} \text { (“Participate in our lottery now!” is in a spam email })}{P \text { (“Participate in our lottery now!” is used in an email) }}=\\\\\frac{P( \text{“Participate in our lottery now!”} \mid spam )\times{P(\text { an email is spam) }}}{P \text { (“Participate in our lottery now!” is used in an email) }}$
图解如下
总结一下
- In other words, you can express the probability of a class given email content via the probability of the content given the class. Let’s look into these two new probabilities, P(content | class) and P(class), more closely as they have interesting properties:
  - P(class) expresses the probability of each class. This is simply the distribution of the classes in your data. Imagine opening your inbox and seeing a new incoming email. What do you expect this email to be—spam or ham? If you mostly get normal emails and your spam filter is working well, you will probably expect a new email to also be ham rather than spam. For example, the enron1/ ham folder contains 3,672 emails, and the spam folder contains 1,500 emails, making the distribution approximately 71:29, or P(“ham”)=0.71 and P(“spam”)=0.29. This is often referred to as prior probability, as it reflects the beliefs of the classifier about where the data comes from prior to any particular evidence; for example, here the classifier will expect that it is much more likely (chances are 71:29) that a random incoming email is ham.
  - P(content | class) is the evidence, as it expresses how likely it is that you (or the algorithm) will see this particular content given that the email is spam or ham. For example, imagine you have opened this new email and now you can assess how likely it is that these words are used in a spam email versus（与） how likely it is that they are used in a ham email. The combination of these factors may in the end change your, or the classifier’s, original belief about the most likely class that you had before seeing the content (evidence).
Now you can replace the conditional probability of P(class | content) with P(content | class); for example, whereas before you had to calculate P("spam" | “Participate in our new lottery now!”) or equally P("spam" | ['participate': True, 'in': True, ..., 'now': True, '!': True]), which is hard to do because you will often end up with too few examples of exactly the same email content or exactly the same combination of features, now you can estimate P(['participate': True, 'in': True, ..., 'now': True, '!': True] | "spam")instead. But how does this solve the problem? Aren’t you still dealing with a long sequence of features?

这时候朴素贝叶斯又派上用场辣

Here is where the “naïve” assumption in Naïve Bayes helps: it assumes that the features are independent of each other, or that your chances of seeing a word lottery in an email are independent of seeing a word new or any other word in this email before. Therefore, you can estimate the probability of the whole sequence of features given a class as a product of probabilities of each feature given this class:
$\begin{array}{c}\\P([\text { 'participate': True, 'in': True, } \ldots, \text { 'now': True, '!': True }] \text { "spam" })= \\\\P(\text { 'participate': True } \mid \text { "spam") } \times P(\text { 'in': True } \mid \text { "spam" }) \times \ldots \times P(\text { '!" : True } \mid \text { "spam") }\\\end{array}\\\\$
If you express ['participate': True] as the first feature in the feature list, or $f_{1}, ['in': True]$ as$ f_{2}$, and so on, until $f_{n}= ['!': True]$ , you can use the general formula
$\\P\left(\left[f_{1}, f_{2}, \ldots, f_{n}\right] \mid \text { class }\right)=P\left(f_{1} \mid \text { class }\right) \times P\left(f_{2} \mid \text { class }\right) \times \ldots \times P\left(f_{n} \mid \text { class }\right)$
Now that you have broken down the probability of the whole feature set given class into the probabilities for each word given that class, how do you actually estimate them? Since for each email you note which words occur in it, the total number of times you can switch on the flag ['feature': True] equals the total number of emails in that class, while the actual number of times you switch on this flag is the number of emails where this feature is actually present. The conditional probability P(feature | class) is simply the proportion:
$KaTeX parse error: Expected 'EOF', got '_' at position 108: …}{\text { total_̲num(emails in c…$
Now you have all the components in place. Let’s iterate through the classification steps again.
- First, during the training phase, the algorithm learns prior class probabilities. This is simply class distribution—for example, P(ham)=0.71 and P(spam)=0.29.
- Secondly, the algorithm learns probabilities for each feature given each of the classes. This is the proportion of emails with each feature in each class—for example, P('meeting':True | ham) = 0.50. During the test phase, or when the algorithm is applied to a new email and is asked to predict its class, the following comparison from the beginning of this section is applied
  $\left\{\begin{array}{ll}P(\text { spam } \mid \text { content }) \geq P(\text { ham } \mid \text { content }) & \Rightarrow \text { spam } \\ \text { otherwise } & \Rightarrow \text { ham }\end{array}\right.$
  This is what we started with originally, but we said that the conditions are flipped, so it becomes
  $\left\{\begin{array}{ll}\frac{P(\text { content } \mid \text { spam }) \times P(\text { spam })}{P(\text { content })} \geq \frac{P(\text { content } \mid \text { ham }) \times P(\text { ham })}{P(\text { content })} & \Rightarrow \text { spam } \\ \text { otherwise } & \Rightarrow \text { ham }\end{array}\right.$
  Note that we end up with P(content) in the denominator(分母) on both sides of the expression, so the absolute value of this probability doesn’t matter, and it can be removed from the expression altogether. As an aside（顺便说一句）, since the probability always has a positive value, it won’t change the comparative values on the two sides; for example, if you were comparing 10 to 4, you would get 10 > 4 whether you divide the two sides by the same positive number like (10/2) > (4/2) or not. So we can simplify the expression as
  $\left\{\begin{array}{ll}P(\text { content } \mid \text { spam }) \times P(\text { spam }) \geq P(\text { content } \mid \text { ham }) \times P(\text { ham }) & \Rightarrow \text { spam } \\ \text { otherwise } & \Rightarrow \text { ham }\end{array}\right.$
  P(spam) and P(ham) are class probabilities estimated during training, and P(content | class), using naïve independence assumption, are products(乘积) of probabilities, so
  $\left\{\begin{array}{ll}\\P\left(\left[f_{1}, f_{2}, \ldots, f_{n}\right] \mid \text { spam }\right) \times P(\text { spam }) \geq P\left(\left[f_{1}, f_{2}, \ldots, f_{n}\right] \mid h a m\right) \times P(\text { ham }) & \Rightarrow \text { spam } \\\\\text { otherwise } & \Rightarrow \text { ham }\\\end{array}\right.\\$
  is split into the individual feature probabilities as
  $\left\{\begin{array}{ll}\\P\left(f_{1} \mid \text { spam }\right) \times \ldots \times P\left(f_{n} \mid \text { spam }\right) \times P(\text { spam }) \geq P\left(f_{1} \mid \text { ham }\right) \times \ldots \times P\left(f_{n} \mid \text { ham }\right) \times P(\text { ham }) & \Rightarrow \text { spam } \\\\\text { otherwise } & \Rightarrow \text { ham }\\\end{array}\right.$
  图解，又是好图~

Code to train a Naïve Bayes classifier

NLTK有朴素贝叶斯的封装可以直接用

from nltk import NaiveBayesClassifier, classify

def train(features, propotion):
    train_size = int(len(features) * propotion)
    #initialize the training and testing sets
    train_set, test_set = features[:train_size], features[train_size:]
    print(f"Training set size: {str(len(train_set))} emails")
    print(f"Test set size: {str(len(test_set))} emails")
    #train the classifier
    classifier = NaiveBayesClassifier.train(train_set)
    return train_set, test_set, classifier
#Apply the train function using 80% (or a similar proportion) of emails for training
train_set, test_set, classifier = train(all_features, 0.8)

Exercise 2.6

Suppose you have five spam emails and ten ham emails. What are the conditional probabilities for P('prescription':True | spam), P('meeting':True | ham), P('stock':True | spam), and P('stock':True | ham), if

Two spam emails contain the word prescription?
One spam email contains the word stock?
Three ham emails contain the word stock?
Five ham emails contain the word meeting?

Step 5: Evaluate your classifier

Finally, let’s evaluate how well the classifier performs in detecting whether an email is spam or ham. For that, let’s use the accuracy score returned by the NLTK’s classifier. In addition, the NLTK’s classifier allows you to inspect the most informative features (words). For that, you need to specify the number of the top most informative features to look into (e.g., 50 here).
code
```
def evaluate(train_set, test_set,classifier):
    # chech how the classfiier performs on the training and test sets
    print(f"Accuracy on the training set: {str(classify.accuracy(classifier, train_set))}")
    print(f"Accuracy of the test set: {str(classify.accuracy(classifier, test_set))}")
    # check which words are most informative for the classifier
    classifier.show_most_informative_features(50)
evaluate(train_set, test_set, classifier)
```
the most informative features—that is, the list of words that are most strongly associated with a particular class. This is functionality of the classifier that is implemented in NLTK, so all you need to do is call on this function as classifier.show_most_informative_features and specify the number of words nthat you want to see as an argument. This function then returns the top n words ordered by their “informativeness” or predictive power. Behind the scenes, the function measures “informativeness” as the highest value of the difference in probabilities between P(feature | spam) and P(feature | ham)— that is, max[P(word: True | ham) / P(word: True | spam)] for most predictive ham features, and max[P(word: True | spam) / P(word: True | ham)] for most predictive spam features (check out NLTK’s documentation for more information: https://www.nltk.org/api/nltk.classify.naivebayes.html). The output shows that such words (features) as prescription(处方), pain, health, and so on are much more strongly associated with spam emails. The ratios on the right show the comparative probabilities for the two classes: for instance, P("prescription" | spam) is 130.9 times higher than P("prescription" | ham). On the other hand, nomination is more strongly associated with ham emails. As you can see, many spam emails in this dataset are related to medications, which shows a particular bias; the most typical spam that you personally get might be on a different topic altogether.

One other piece of information presented in this output is accuracy. Test accuracy shows the proportion of test emails that are correctly classified by Naïve Bayes among all test emails. The preceding(前面的) code measures the accuracy on both the training data and test data. Note that since the classifier is trained on the training data, it actually gets to “see” all the correct labels for the training examples. Shouldn’t it then know the correct answers and perform at 100% accuracy on the training data? Well, the point here is that the classifier doesn’t just retrieve（挽回） the correct answers; during training, it has built some probabilistic model (i.e., learned about the distribution of classes and the probability of different features), and then it applies this model to the data. So, it is actually very likely that the probabilistic model doesn’t capture all the things in the data 100% correctly. For example, there might be noise and inconsistencies in the real emails. Note that “2004” gets strongly associated with the spam emails and “2001” with the ham emails, although it does not mean that there is anything peculiar（奇怪的） about the spam originating from 2004. This might simply show a bias in the particular dataset, and such phenomena are hard to filter out in any real data, especially when you rely on the data collected by other researchers. This means that if some ham email in training data contains “2004” and a variety of other words that are otherwise related to spam, this email might get misclassified as spam by the algorithm. Similarly, as many medication-related words are strongly associated with spam, a rare ham email that is actually talking about some medication the user ordered might get misclassified as spam.

这里的训练集测试集的accuracy都不错，说明泛化能力还可以

Code to check the contexts of specific words

If you’d like to gain any further insight into how the words are used in the emails from different classes, you can also check the occurrences of any particular word in all available contexts. For example, the word stocks features as a very strong predictor of spam messages. Why is that? You might be thinking, Okay, some emails containing “stocks” will be spam, but surely there must be contexts where “stocks” is used in a completely legitimate way?

from nltk.text import Text

def concordance(data_list, search_word):
    for email in data_list:
        word_list = [word for word in word_tokenize(email.lower())]
        text_list = Text(word_list)
        if search_word in word_list:
            text_list.concordance(search_word)
print ("STOCKS in HAM:")
concordance(ham_list, "stocks")
print ("\n\nSTOCKS in SPAM:")
concordance(spam_list, "stocks")

This code shows how you can use NLTK’s concordancer(语料库检索工具), a tool that checks for the occurrences of the specified word and prints out this word in its contexts. By default, NLTK’s concordancer prints out the search_word surrounded by the previous 36 and the following 36 characters, so note that it doesn’t always result in full words. Once the use of concordancer is defined in the function concordance, you can apply it to both ham_list and spam_list to find the different contexts of the word stocks.

2.4 Deploying your spam filter in practice

apply spam filtering to new emails

test_spam_list = ["Participate in our new lottery!", "Try out this new medicine"]
test_ham_list = ["See the minutes from the last meeting attached", 
                 "Investors are coming to our office on Monday"]

test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

new_test_set = [(get_features(email), label) for (email, label) in test_emails]

evaluate(train_set, new_test_set, classifier)

print out the predicted label

for email in test_spam_list:
    print (email)
    print (classifier.classify(get_features(email)))
for email in test_ham_list:
    print (email)
    print (classifier.classify(get_features(email)))

Participate in our new lottery!
spam
Try out this new medicine
spam
See the minutes from the last meeting attached
ham
Investors are coming to our office on Monday
ham

classify the emails read in from the keyboard

while True:
    email = input("Type in your email here (or press 'Enter'): ")
    if len(email)==0:
        break
    else: 
        prediction = classifier.classify(get_features(email))
        print (f"This email is likely {prediction}\n")

这还挺好玩的~

感觉这本书的逻辑很清晰，而且很详细，很多地方真的是手把手step-by-step的讲解了

Summary

Classification is concerned with assigning objects to a predefined set of categories, groups, or classes based on their characteristic properties. There is a whole family of NLP and machine-learning tasks that deal with classification.
Humans perform classification on a regular basis, and machine-learning algorithms can be taught to do that if provided with a sufficient number of examples and some guidance from humans. When the labeled examples and the general outline of the task are provided for the machine, this is called supervised learning.
Spam filtering is an example of a binary classification task. The machine has to learn to distinguish between exactly two classes—spam and normal email (often called ham).
Classification relies on specific properties of the classified objects. In machine learning terms, such properties are called features. For spam filtering, some of the most informative features are words used in the emails.
To build a spam-filtering algorithm, you can use one of the publicly available spam datasets. One such dataset is the Enron spam dataset.
A classifier can be built in five steps: (1) reading emails and defining classes, (2) extracting content, (3) converting the content into features, (4) training an algorithm on the training set, and (5) evaluating it on the test set.
The data comes in as a single string of symbols. To extract the words from it, you may rely on the NLP tools called tokenizers (or use regular expressions to implement one yourself).
NLP libraries, such as the Natural Language Processing Toolkit (NLTK), come with such tools, as well as implementations of a range of frequently used classifiers.
There are several machine-learning classifiers, and one of the most interpretable among them is Naïve Bayes. This is a probabilistic classifier. It assumes that the data in two classes is generated by different probability distributions, which are learned from the training data. Despite its simplicity and “naïve” feature independence assumption, Naïve Bayes often performs well in practice and sets a competitive baseline for other more sophisticated algorithms.
It is important that you split your data into training sets (e.g., 80% of the original dataset) and test sets (the rest of the data), and train the classifier on the training data only so that you can assess it on the test set in a fair way. The test set serves as new unseen data for the algorithm, so you can come to a realistic conclusion about how your classifier may perform in practice.
Processing Toolkit (NLTK), come with such tools, as well as implementations of a range of frequently used classifiers.
There are several machine-learning classifiers, and one of the most interpretable among them is Naïve Bayes. This is a probabilistic classifier. It assumes that the data in two classes is generated by different probability distributions, which are learned from the training data. Despite its simplicity and “naïve” feature independence assumption, Naïve Bayes often performs well in practice and sets a competitive baseline for other more sophisticated algorithms.
It is important that you split your data into training sets (e.g., 80% of the original dataset) and test sets (the rest of the data), and train the classifier on the training data only so that you can assess it on the test set in a fair way. The test set serves as new unseen data for the algorithm, so you can come to a realistic conclusion about how your classifier may perform in practice.
Once satisfied with the performance of your classifier on the test data, you can deploy（部署，利用） it in practice.

你可能感兴趣的:(NLP,自然语言处理,分类,人工智能)

决策树算法及其python实例 m0_74831463 算法决策树 python
一、决策数的概念什么是决策树算法呢？决策树（DecisionTree）是一种基本的分类与回归方法，本文主要讨论分类决策树。决策树模型呈树形结构，在分类问题中，表示基于特征对数据进行分类的过程。它可以认为是if-then规则的集合。每个内部节点表示在属性上的一个测试，每个分支代表一个测试输出，每个叶节点代表一种类别二、决策树的构造1、决策树的构造步骤输入：训练集D={(21,11),(z2,32),
存算一体与存算分离：架构设计的深度解析与实现方案克里斯蒂亚诺罗纳尔多阿维罗大数据数据库
随着数据量的不断增大和对计算能力的需求日益提高，存算一体作为一种新型架构设计理念，在大数据处理、云计算和人工智能等领域正逐步引起广泛关注。在深入探讨存算一体之前，我们需要先了解存储和计算的基本概念，以及存算分离和存算一体之间的区别。什么是存算一体？存算一体，顾名思义，是将数据存储与计算资源紧密结合，形成一个统一的架构。在这种架构下，存储和计算不仅在物理层面上结合，更在架构设计上深度融合。具体来说，
CCF CSP 历年真题 C语言版满分代码集合 (至2021.9 持续更新中 JY_0329 CCF c语言开发语言 csp ccf 算法
CCFCSP历年真题C语言版满分代码集合（全部原创）2021-9-1数组推导2021-9-2非零段划分2021-4-1灰度直方图2021-4-2领域均值2020-12-1期末预测之安全指数2020-12-2期末预测之最佳阈值2020-9-1称检测点查询2020-9-2风险人群筛查2020-6-1线性分类器2020-6-2稀疏向量2019-12-1报数2019-12-2回收站选址2019-9-1小明
python学智能算法（八）|决策树西猫雷婶人工智能 python学习笔记机器学习 python 决策树开发语言
【1】引言前序学习进程中，已经对KNN邻近算法有了探索，相关文章链接为：python学智能算法（七）|KNN邻近算法-CSDN博客但KNN邻近算法有一个特点是：它在分类的时候，不能知晓每个类别内事物的具体面貌，只能获得类别，停留在事物的表面。为了进一步探索事物的内在特征，就需要学习新的算法。本篇文章就是在KNN的基础上学习新算法：决策树。【2】原理分析在学习决策树执之前，需要先了解香农熵。本科学控
自动语音识别（ASR）：技术、应用与未来 ajie1117 语音识别人工智能
自动语音识别（ASR）：技术、应用与未来1.ASR简介自动语音识别（ASR，AutomaticSpeechRecognition）是一种将语音转换为文本的技术。它利用人工智能（AI）、深度学习和自然语言处理（NLP）技术来识别和理解人类的语言，使计算机能够与人类进行更自然的交互。2.ASR的工作原理ASR的核心流程通常包括以下几个步骤：语音信号采集：通过麦克风或其他设备获取音频数据。预处理：去除噪
机器学习是怎么一步一步由神经网络发展到今天的Transformer架构的？ yuanpan 机器学习神经网络 transformer
机器学习和神经网络的发展经历了一系列重要的架构和技术阶段。以下是更全面的总结，涵盖了从早期神经网络到卷积神经网络之前的架构演变：1.早期神经网络：感知机（Perceptron）时间：1950年代末至1960年代。背景：感知机由FrankRosenblatt提出，是第一个具有学习能力的神经网络模型。它由单层神经元组成，可以用于简单的二分类任务。特点：输入层和输出层之间直接连接，没有隐藏层。使用简单的
30秒生成电子合同：B2B系统+AI引擎缩短80%交易周期|数商云数商云网络 B2B系统数字化电商平台人工智能大数据云计算数据库运维 java spring
引言在数字经济时代，B2B（Business-to-Business）电子商务正在以前所未有的速度改变着企业的运营模式。随着交易量的不断攀升，传统的合同生成和审核流程逐渐成为制约交易效率的瓶颈。然而，随着人工智能（AI）技术的飞速发展，结合B2B系统的智能化升级，我们正见证一场合同生成效率的革命。本文将深入探讨“30秒生成电子合同：B2B系统+AI引擎缩短80%交易周期”这一创新模式，解析其背后的
【北京迅为】iTOP-RK3568开发板OpenHarmony系统南向驱动开发UART接口运作机制迅为电子 RK3568开发板 RK3568开发板 OpenHarmony
瑞芯微RK3568芯片是一款定位中高端的通用型SOC，采用22nm制程工艺，搭载一颗四核Cortex-A55处理器和MaliG522EE图形处理器。RK3568支持4K解码和1080P编码，支持SATA/PCIE/USB3.0外围接口。RK3568内置独立NPU，可用于轻量级人工智能应用。RK3568支持安卓11和linux系统，主要面向物联网网关、NVR存储、工控平板、工业检测、工控盒、卡拉OK
大学期间如何学习利用AI der丸子吱吱吱学习人工智能
一、引言人工智能（AI）是当今世界技术发展的重要方向，它已经渗透到医疗、金融、交通、娱乐等各个领域。随着AI技术的快速发展，它不仅改变了我们的生活，也带来了巨大的职业机会。然而，面对如此广阔的领域，作为大学生，如何在本科阶段有效地学习和利用AI，成了许多同学的困惑。本文将详细介绍大学生在本科阶段如何通过合理的学习路线、方法和工具，逐步掌握AI的核心技术，并为日后进入AI行业打下坚实的基础。通过这篇
全面掌握Python：从安装到基础再到进阶的系统学习之路（附代码，建议新手收藏） der丸子吱吱吱 python 学习开发语言新手入门代码
Python，作为一种现代化的高级编程语言，因其简洁易懂的语法和强大的功能，成为了数据科学、人工智能、Web开发等多个领域的首选语言。在这篇文章中，我们将从大学课本的结构来详细介绍Python，帮助大家从零基础开始，逐步深入掌握Python的各个方面。目录第一章：Python简介与安装1.1Python语言概述1.2安装Python1.3Python的开发环境1.4第一个Python程序第二章：基
Scrum实施情况调查之案例分析 zhijie435 项目管理 thoughtworks 敏捷项目管理敏捷开发工作框架
导读：社区Agile主题敏捷实施,企业级敏捷标签Scrum作者李剑，在InfoQ中文站上发表了一篇"Scrum在中国——企业实施情况调查实录"。这份调查实录，分别调查了五个实施SCRUM的公司，其中三家公司实施成功，二家公司失败。我建议所有准备或者正在实施SCRUM的人们都能来读一下。在此，我们会对这篇文章中的案例分类进行分析、诊断。并探讨什么是敏捷开发方法、什么是SCRUM、使用敏捷方法需要什么
yum install locate出现Error: Unable to find match: locate解决方案爱编程的喵喵 Linux解决方案 linux locate yum 解决方案
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了yuminstalllocate出现
【人工智能机器学习基础篇】——深入详解无监督学习之降维：PCA与t-SNE的关键概念与核心原理猿享天开人工智能数学基础专讲人工智能机器学习无监督学习降维
深入详解无监督学习之降维：PCA与t-SNE的关键概念与核心原理在当今数据驱动的世界中，数据维度的增多带来了计算复杂性和存储挑战，同时也可能导致模型性能下降，这一现象被称为“维度诅咒”（CurseofDimensionality）。降维作为一种重要的特征提取和数据预处理技术，旨在通过减少数据的维度，保留其主要信息，从而简化数据处理过程，并提升模型的性能。本文将深入探讨两种广泛应用于无监督学习中的降
模型上下文协议 (MCP)是什么？Model Context Protocol 需要你了解一下同学小张学习 AIGC AI-native agi gpt 开源协议
大家好，我是同学小张，+v:jasper_8017一起交流，持续学习AI大模型应用实战案例，持续分享，欢迎大家点赞+关注，订阅我的大模型专栏，共同学习和进步。在人工智能领域，ModelContextProtocol（MCP）正逐渐成为连接AI模型与各类数据源及工具的重要标准。MCP究竟为何物？它又将如何改变AI应用的开发与使用？文章目录0.概念1.MCP的总体架构2.为何使用MCP？3.我的理解4
生成式对抗网络在人工智能艺术创作中的应用与创新研究辛迎蕌人工智能
摘要本文深入探究生成式对抗网络（GAN）在人工智能艺术创作领域的应用与创新。通过剖析GAN核心原理，阐述其在图像、音乐、文学等艺术创作中的实践，分析面临的挑战与创新方向，呈现GAN对艺术创作模式的变革，为理解人工智能与艺术融合发展提供全面视角。一、引言在人工智能与艺术深度融合的时代浪潮中，生成式对抗网络（GAN）作为一项突破性技术，为艺术创作带来了全新的可能性。它打破传统创作边界，以独特的对抗学习
知识图谱在人工智能语义理解与推理中的关键作用及发展研究 @王威& 人工智能
摘要本文聚焦知识图谱，深入剖析其在人工智能语义理解与推理中的核心作用。阐述知识图谱的构建原理、表示方法，分析其在自然语言处理、智能问答系统、推荐系统等多领域助力语义理解与推理的应用，探讨面临的挑战并展望未来发展方向，全面呈现知识图谱对人工智能发展的重要价值与深远影响。一、引言在人工智能追求更精准理解和处理人类语言与知识的进程中，知识图谱成为关键技术。它以结构化形式组织海量知识，揭示实体间复杂关系，
耦合与解耦：软件工程中的核心矛盾与破局之道以恒1 软件工程
耦合与解耦：软件工程中的核心矛盾与破局之道在软件开发领域，耦合与解耦是贯穿始终的核心矛盾。它们如同硬币的两面，既相互对立又紧密依存。本文将从概念解析、类型分类、解耦策略到实际应用，全面剖析这对矛盾体的本质与破局之道。一、耦合的本质：依赖关系的多维透视耦合（Coupling）指软件系统中不同模块、组件或服务之间的相互依赖程度。这种依赖可能表现为数据传递、控制流交互或资源共享。根据耦合强度，可分为七种
HarmonyOS实战开发-如何打造购物商城APP。码牛程序猿鸿蒙工程师 HarmonyOS 鸿蒙 harmonyos OpenHarmony 鸿蒙鸿蒙应用开发华为鸿蒙开发 HarmonyOS
今天给大家分享一个非常好的实战项目，购物商城，购物商城是一个集购物、娱乐、服务于一体的综合性平台，致力于为消费者提供一站式的购物体验。各种功能都有涉及，最适合实现学习。做好商城项目，肯定会把开发中遇到的百分之60的技术得到实战的经验。下面介绍一下商城的主要模块：首页1，搜索框，点击进入搜索页面2，顶部分类，通过不同分类查询对应信息3，广告轮播，自动切换图片，可以进行点击进入4，商品列表，展示每个项
Flink启动任务 swg321321 flink 大数据
Flink以本地运行作为解读例如：第一章Python机器学习入门之pandas的使用提示：写完文章后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录Flink前言StreamExecutionEnvironmentLocalExecutorMiniClusterStreamGraph二、使用步骤1.引入库2.读入数据总结前言提示：这里可以添加本文要记录的大概内容：例如：随着人工智能的不断发
计算机专业毕业设计题目推荐（新颖选题）本科计算机人工智能专业相关毕业设计选题大全✅ 会写代码的羊毕设选题课程设计人工智能毕业设计毕设题目毕业设计题目 ai AI编程
文章目录前言最新毕设选题（建议收藏起来）本科计算机人工智能专业相关的毕业设计选题毕设作品推荐前言2025全新毕业设计项目博主介绍：✌全网粉丝10W+,CSDN全栈领域优质创作者，博客之星、掘金/华为云/阿里云等平台优质作者。技术范围：SpringBoot、Vue、SSM、HLMT、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、小程序、大数据、机器学习等设计与开发。主要内容：免费功能
AI人工智能 Agent：在赋能传统行业中的应用 AI天才研究院计算 DeepSeek R1 &大数据AI人工智能大模型计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
AI人工智能Agent：在赋能传统行业中的应用1.背景介绍1.1人工智能的发展历程1.1.1人工智能的起源与发展1.1.2人工智能的三次浪潮1.1.3人工智能的现状与挑战1.2传统行业面临的困境1.2.1效率低下1.2.2成本高企1.2.3决策滞后1.3人工智能赋能传统行业的必要性1.3.1提高效率1.3.2降低成本1.3.3优化决策2.核心概念与联系2.1人工智能Agent的定义2.1.1Age
《深度剖析：BERT与GPT——自然语言处理架构的璀璨双星》人工智能深度学习
在自然语言处理（NLP）的广袤星空中，BERT（BidirectionalEncoderRepresentationsfromTransformers）与GPT（GenerativePretrainedTransformer）系列模型宛如两颗最为耀眼的星辰，引领着NLP技术不断迈向新的高度。它们基于独特的架构设计，以强大的语言理解与生成能力，彻底革新了NLP的研究与应用范式，成为学界和业界竞相探索
“四预”驱动数字孪生水利：让智慧治水守护山河安澜 GeoSaaS 实景三维智慧城市人工智能 gis 大数据安全
近年来，从黄河秋汛到海河特大洪水，从珠江流域性洪灾到长江罕见骤旱，极端天气频发让水安全问题备受关注。如何实现“治水于未发”？数字孪生水利以“预报、预警、预演、预案”（四预）为核心，正在掀起一场水利治理的智慧革命。一、数字孪生水利：从物理世界到虚拟镜像的跃迁数字孪生水利并非简单的“数字建模”，而是通过高精度传感器、大数据、人工智能等技术，在虚拟空间构建与物理流域完全映射的“数字分身”，实现水情、工情
硬件NAS将成为电子垃圾？ DeepSeek+NAS 家用NAS WinNAS 飞牛NAS 人工智能安卓NAS
随着人工智能（AI）技术的快速发展，传统的NAS设备正面临一场深刻的变革。过去，NAS的主要功能是提供数据存储和共享服务，但在AI时代，单纯的存储功能已无法满足用户需求。未来的NAS必须集成本地AI能力，才能成为真正的AI-NAS。然而，当前市场上的NAS产品硬件配置普遍较低，无法支持本地AI的运行。因此，现有的硬件NAS在三年内可能会被淘汰，取而代之的将是集成了AI和NAS功能的家用AI服务器。
Hugging Face预训练GPT微调ChatGPT（微调入门！新手友好！） y江江江江机器学习大模型 gpt chatgpt
HuggingFace预训练GPT微调ChatGPT（微调入门！新手友好！）在实战中，⼤多数情况下都不需要从0开始训练模型，⽽是使⽤“⼤⼚”或者其他研究者开源的已经训练好的⼤模型。在各种⼤模型开源库中，最具代表性的就是HuggingFace。HuggingFace是⼀家专注于NLP领域的AI公司，开发了⼀个名为Transformers的开源库，该开源库拥有许多预训练后的深度学习模型，如BERT、G
【DeepSeek】全方位使用指南————简版諰. 人工智能 ai AI写作
一、平台概述DeepSeek（深度求索）是专注实现AGI的中国的人工智能公司，提供多款AI产品：智能对话（Chat）文生图（Art）代码助手（Coder）API开发接口企业定制解决方案二、注册与登录2.1账号创建访问官网https://www.deepseek.com点击右上角「注册」支持三种方式：手机号+短信验证邮箱注册（需验证邮件）第三方登录（微信/Google账号）2.2订阅计划套餐类型免费
CAPL变量输出的格式说明符正当少年 CAPL CAPL
在CAPL（CANAccessProgrammingLanguage）中，变量输出的格式说明符用于控制变量在输出时的显示格式。以下是常用的CAPL变量输出格式说明符分类整理：以下是CAPL变量格式说明符的具体实例，展示了如何使用这些说明符来输出不同类型的变量：1.整数类型%d输出有符号十进制整数。intx=123;write("Value:%d",x);//输出:Value:123%u输出无符号十
自学网络安全（黑客技术）2025年 —三个月学习计划 csbDD web安全学习安全网络 python
基于入门网络安全/黑客打造的：黑客&网络安全入门&进阶学习资源包前言什么是网络安全网络安全可以基于攻击和防御视角来分类，我们经常听到的“红队”、“渗透测试”等就是研究攻击技术，而“蓝队”、“安全运营”、“安全运维”则研究防御技术。如何成为一名黑客很多朋友在学习安全方面都会半路转行，因为不知如何去学，在这里，我将这个整份答案分为黑客（网络安全）入门必备、黑客（网络安全）职业指南、黑客（网络安全）学习
图像识别技术与应用课后总结（20）一元钱面包人工智能
图像分割概念图像分割是把图像中不同像素划分到不同类别，预测目标轮廓，属于细粒度分类。比如将图像里不同物体、背景等区分开来，就像把一幅画里的各个元素精准归类。应用场景人像抠图：能精准分离人物和背景，用于图片编辑、影视制作等，比如去除照片背景换背景。医学组织提取：在医学影像（如CT、MRI图像）中分离出不同组织，辅助疾病诊断、手术规划等。遥感图像分析：分析卫星或航空遥感图像时，区分土地、植被、建筑等不
GOT-OCR2.0：突破性端到端架构与高精度文本识别的技术创新 XianxinMao 人工智能深度学习
GOT-OCR2.0在技术上的突破与优势GOT-OCR2.0在技术上实现了对传统OCR系统的显著超越，主要体现在其采用了统一的端到端（End-to-End）架构。这一架构的创新性设计带来了多方面的提升，具体包括以下几个关键方面：1.统一的端到端架构传统OCR系统的局限：传统的OCR流程通常由多个独立的模块组成，如图像预处理、字符分割、特征提取、分类识别等。这种多步处理方式不仅增加了系统的复杂性，还
设计模式介绍 tntxia 设计模式
设计模式来源于土木工程师克里斯托弗亚历山大（http://en.wikipedia.org/wiki/Christopher_Alexander）的早期作品。他经常发表一些作品，内容是总结他在解决设计问题方面的经验，以及这些知识与城市和建筑模式之间有何关联。有一天，亚历山大突然发现，重复使用这些模式可以让某些设计构造取得我们期望的最佳效果。亚历山大与萨拉-石川佳纯和穆雷西乐弗斯坦合作
android高级组件使用(一) 百合不是茶 android RatingBar Spinner
1、自动完成文本框（AutoCompleteTextView） AutoCompleteTextView从EditText派生出来，实际上也是一个文本编辑框，但它比普通编辑框多一个功能：当用户输入一个字符后，自动完成文本框会显示一个下拉菜单，供用户从中选择，当用户选择某个菜单项之后，AutoCompleteTextView按用户选择自动填写该文本框。使用AutoCompleteTex
[网络与通讯]路由器市场大有潜力可挖掘 comsci 网络
如果国内的电子厂商和计算机设备厂商觉得手机市场已经有点饱和了,那么可以考虑一下交换机和路由器市场的进入问题..... 这方面的技术和知识,目前处在一个开放型的状态,有利于各类小型电子企业进入 &nbs
自写简单Redis内存统计shell 商人shang Linux shell 统计Redis内存
#!/bin/bash address="192.168.150.128:6666,192.168.150.128:6666" hosts=(${address//,/ }) sfile="staticts.log" for hostitem in ${hosts[@]} do ipport=(${hostitem
单例模式(饿汉 vs懒汉) oloz 单例模式
package 单例模式; /* * 应用场景:保证在整个应用之中某个对象的实例只有一个 * 单例模式种的《懒汉模式》 * */ public class Singleton { //01 将构造方法私有化，外界就无法用new Singleton()的方式获得实例 private Singleton(){}; //02 申明类得唯一实例 priva
springMvc json支持杨白白 json springmvc
1.Spring mvc处理json需要使用jackson的类库，因此需要先引入jackson包 2在spring mvc中解析输入为json格式的数据:使用@RequestBody来设置输入 @RequestMapping("helloJson") public @ResponseBody JsonTest helloJson() {
android播放，掃描添加本地音頻文件小桔子
最近幾乎沒有什麽事情，繼續鼓搗我的小東西。想在項目中加入一個簡易的音樂播放器功能，就像華為p6桌面上那麼大小的音樂播放器。用過天天動聽或者QQ音樂播放器的人都知道，可已通過本地掃描添加歌曲。不知道他們是怎麼實現的，我覺得應該掃描設備上的所有文件，過濾出音頻文件，每個文件實例化為一個實體，記錄文件名、路徑、歌手、類型、大小等信息。具體算法思想，
oracle常用命令 aichenglong oracle dba 常用命令
1 创建临时表空间 create temporary tablespace user_temp tempfile 'D:\oracle\oradata\Oracle9i\user_temp.dbf' size 50m autoextend on next 50m maxsize 20480m extent management local
25个Eclipse插件 AILIKES eclipse插件
提高代码质量的插件1. FindBugsFindBugs可以帮你找到Java代码中的bug，它使用Lesser GNU Public License的自由软件许可。2. CheckstyleCheckstyle插件可以集成到Eclipse IDE中去，能确保Java代码遵循标准代码样式。3. ECLemmaECLemma是一款拥有Eclipse Public License许可的免费工具，它提供了
Spring MVC拦截器+注解方式实现防止表单重复提交 baalwolf spring mvc
原理：在新建页面中Session保存token随机码，当保存时验证，通过后删除，当再次点击保存时由于服务器端的Session中已经不存在了，所有无法验证通过。 1.新建注解： ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
《Javascript高级程序设计(第3版)》闭包理解 bijian1013 JavaScript
“闭包是指有权访问另一个函数作用域中的变量的函数。”--《Javascript高级程序设计(第3版)》看以下代码： <script type="text/javascript"> function outer() { var i = 10; return f
AngularJS Module类的方法 bijian1013 JavaScript AngularJS Module
AngularJS中的Module类负责定义应用如何启动，它还可以通过声明的方式定义应用中的各个片段。我们来看看它是如何实现这些功能的。一.Main方法在哪里如果你是从Java或者Python编程语言转过来的，那么你可能很想知道AngularJS里面的main方法在哪里？这个把所
[Maven学习笔记七]Maven插件和目标 bit1129 maven插件
插件(plugin)和目标(goal) Maven，就其本质而言，是一个插件执行框架，Maven的每个目标的执行逻辑都是由插件来完成的，一个插件可以有1个或者几个目标，比如maven-compiler-plugin插件包含compile和testCompile，即maven-compiler-plugin提供了源代码编译和测试源代码编译的两个目标使用插件和目标使得我们可以干预
【Hadoop八】Yarn的资源调度策略 bit1129 hadoop
1. Hadoop的三种调度策略 Hadoop提供了3中作业调用的策略， FIFO Scheduler Fair Scheduler Capacity Scheduler 以上三种调度算法，在Hadoop MR1中就引入了，在Yarn中对它们进行了改进和完善.Fair和Capacity Scheduler用于多用户共享的资源调度 2. 多用户资源共享的调度
Nginx使用Linux内存加速静态文件访问 ronin47
Nginx是一个非常出色的静态资源web服务器。如果你嫌它还不够快，可以把放在磁盘中的文件，映射到内存中，减少高并发下的磁盘IO。先做几个假设。nginx.conf中所配置站点的路径是/home/wwwroot/res，站点所对应文件原始存储路径：/opt/web/res shell脚本非常简单，思路就是拷贝资源文件到内存中，然后在把网站的静态文件链接指向到内存中即可。具体如下：
关于Unity3D中的Shader的知识 brotherlamp unity unity资料 unity教程 unity视频 unity自学
首先先解释下Unity3D的Shader，Unity里面的Shaders是使用一种叫ShaderLab的语言编写的，它同微软的FX文件或者NVIDIA的CgFX有些类似。传统意义上的vertex shader和pixel shader还是使用标准的Cg/HLSL 编程语言编写的。因此Unity文档里面的Shader，都是指用ShaderLab编写的代码，然后我们来看下Unity3D自带的60多个S
CopyOnWriteArrayList vs ArrayList bylijinnan java
package com.ljn.base; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import java.util.concurrent.CopyOnWriteArrayList; /** * 总述： * 1.ArrayListi不是线程安全的，CopyO
内存中栈和堆的区别 chicony 内存
1、内存分配方面：堆：一般由程序员分配释放，若程序员不释放，程序结束时可能由OS回收。注意它与数据结构中的堆是两回事，分配方式是类似于链表。可能用到的关键字如下：new、malloc、delete、free等等。栈：由编译器(Compiler)自动分配释放，存放函数的参数值，局部变量的值等。其操作方式类似于数据结构中
回答一位网友对Scala的提问 chenchao051 scala map
本来准备在私信里直接回复了，但是发现不太方便，就简要回答在这里。问题写道对于scala的简洁十分佩服，但又觉得比较晦涩，例如一例，Map("a" -> List(11,111)).flatMap(_._2)，可否说下最后那个函数做了什么，真正在开发的时候也会如此简洁？谢谢先回答一点，在实际使用中，Scala毫无疑问就是这么简单。
mysql 取每组前几条记录 daizj mysql 分组最大值最小值每组三条记录
一、对分组的记录取前N条记录：例如：取每组的前3条最大的记录 1.用子查询： SELECT * FROM tableName a WHERE 3> (SELECT COUNT(*) FROM tableName b WHERE b.id=a.id AND b.cnt>a. cnt) ORDER BY a.id,a.account DE
HTTP深入浅出 http请求 dcj3sjt126com http
HTTP(HyperText Transfer Protocol)是一套计算机通过网络进行通信的规则。计算机专家设计出HTTP，使HTTP客户（如Web浏览器）能够从HTTP服务器(Web服务器)请求信息和服务，HTTP目前协议的版本是1.1.HTTP是一种无状态的协议，无状态是指Web浏览器和Web服务器之间不需要建立持久的连接，这意味着当一个客户端向服务器端发出请求，然后We
判断MySQL记录是否存在方法比较 dcj3sjt126com mysql
把数据写入到数据库的时，常常会碰到先要检测要插入的记录是否存在，然后决定是否要写入。　　我这里总结了判断记录是否存在的常用方法：　　sql语句： select count ( * ) from tablename; 　　然后读取count(*)的值判断记录是否存在。对于这种方法性能上有些浪费，我们只是想判断记录记录是否存在，没有必要全部都查出来。
对HTML XML的一点认识 e200702084 html xml
感谢http://www.w3school.com.cn提供的资料 HTML 文档中的每个成分都是一个节点。节点根据 DOM，HTML 文档中的每个成分都是一个节点。 DOM 是这样规定的：整个文档是一个文档节点每个 HTML 标签是一个元素节点包含在 HTML 元素中的文本是文本节点每一个 HTML 属性是一个属性节点注释属于注释节点 Node 层次
jquery分页插件 genaiwei jquery Web 前端分页插件
//jquery页码控件// 创建一个闭包 (function($) { // 插件的定义 $.fn.pageTool = function(options) { var totalPa
Mybatis与Ibatis对照入门于学习 Josh_Persistence mybatis ibatis 区别联系
一、为什么使用IBatis/Mybatis 对于从事 Java EE 的开发人员来说，iBatis 是一个再熟悉不过的持久层框架了，在 Hibernate、JPA 这样的一站式对象 / 关系映射（O/R Mapping）解决方案盛行之前，iBaits 基本是持久层框架的不二选择。即使在持久层框架层出不穷的今天，iBatis 凭借着易学易用、
C中怎样合理决定使用那种整数类型？秋风扫落叶 c 数据类型
如果需要大数值(大于32767或小于32767), 使用long 型。否则, 如果空间很重要 (如有大数组或很多结构), 使用 short 型。除此之外, 就使用 int 型。如果严格定义的溢出特征很重要而负值无关紧要, 或者你希望在操作二进制位和字节时避免符号扩展的问题, 请使用对应的无符号类型。但是, 要注意在表达式中混用有符号和无符号值的情况。 &nbs
maven问题 zhb8015 maven问题
问题1： Eclipse 中新建maven项目无法添加src/main/java 问题 eclipse创建maevn web项目，在选择maven_archetype_web原型后，默认只有src/main/resources这个Source Floder。按照maven目录结构，添加src/main/ja
(二)androidpn-server tomcat版源码解析之--push消息处理 spjich java androdipn 推送
在 (一)androidpn-server tomcat版源码解析之--项目启动这篇中，已经描述了整个推送服务器的启动过程，并且把握到了消息的入口即XmppIoHandler这个类，今天我将继续往下分析下面的核心代码，主要分为3大块，链接创建，消息的发送，链接关闭。先贴一段XmppIoHandler的部分代码 /** * Invoked from an I/O proc
用js中的formData类型解决ajax提交表单时文件不能被serialize方法序列化的问题中华好儿孙 JavaScript Ajax Web 上传文件 FormData
var formData = new FormData($("#inputFileForm")[0]); $.ajax({ type:'post', url:webRoot+"/electronicContractUrl/webapp/uploadfile", data:formData, async: false, ca
mybatis常用jdbcType数据类型 ysj5125094 mybatis mapper jdbcType
MyBatis 通过包含的jdbcType 类型 BIT FLOAT CHAR

手把手实现邮件分类 《Getting Started with NLP》chap2：Your first NLP example

《Getting Started with NLP》chap2：Your first NLP example

文章目录

2.1 Introducing NLP in practice: Spam filtering

classification

一个简单的例子

2.2 Understanding the task

Step 1: Define the data and classes

Step 2: Split the text into words

为何需要split into words?

如何split into words(exercise 2.2)

split text string into words by whitespaces

split text string into words by whitespaces and punctuation

tokenizers

Step 3: Extract and normalize the features

Step 4: Train a classifier

Step 5: Evaluate the classifier

2.3 Implementing your own spam filter

Step 1: Define the data and classes

解压文件

read in the contents of the files

verify that the data is uploaded and read in correctly

combine the data into a single structure

Step 2: Split the text into words

run a tokenizer over text

Step 3: Extract and normalize the features

Code to extract the features

这份代码做了啥，很清晰的解释图！

仔细研究一下上面的数据结构

exercise2.5

Step 4: Train the classifier

Code to train a Naïve Bayes classifier

Exercise 2.6

Step 5: Evaluate your classifier

Code to check the contexts of specific words

2.4 Deploying your spam filter in practice

apply spam filtering to new emails

print out the predicted label

classify the emails read in from the keyboard

Summary

你可能感兴趣的:(NLP,自然语言处理,分类,人工智能)

手把手实现邮件分类《Getting Started with NLP》chap2：Your first NLP example