机器学习实战 knn代码
The KNN classifier is a very popular and well known supervised machine learning technique. This article will explain KNN classifier with an example
KNN分类器是一种非常流行且广为人知的监督式机器学习技术。 本文将以一个示例说明KNN分类器
什么是监督学习模型? (What is a supervised learning model?)
I will explain it in detail. But here is what Wikipedia has to say:
我将详细解释。 但是,这是维基百科必须说的:
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.
监督学习 是 学习一个功能 的 机器学习 任务,该功能基于示例输入-输出对将输入映射到输出。 它从标记的 训练数据( 由一组训练示例组成)中 推断出功能 。
Supervised learning models take input features (X) and output (y) to train a model. The goal of the model is to define a function that can use the input features and calculate the output.
监督学习模型采用输入特征(X)和输出(y)来训练模型。 该模型的目标是定义一个可以使用输入要素并计算输出的函数。
An example will make it more clear
Here is a dataset that contains the mass, width, height, and color_score of some fruit samples.
The purpose of this dataset was to train a model so that if we input the mass, width, height, and color-score to the model, the model can let us know the name of the fruit. Like if we input the mass, width, height, and color_score of a piece of fruit as 175, 7.3, 7.2, 0.61 respectively, the model should output the name of the fruit as an apple.
该数据集的目的是训练模型,以便我们在模型中输入质量,宽度,高度和颜色分数时,模型可以让我们知道水果的名称。 就像如果我们分别将一块水果的质量,宽度,高度和color_score输入为175、7.3、7.2、0.61一样,模型应该将水果的名称输出为苹果。
Here mass, width, height, and color_score are the input features(X). And the name of the fruit is the output variable or label(y).
质量,宽度,高度和color_score是输入要素(X)。 水果的名称是输出变量或label(y)。
This example may sound silly to you. But this is the mechanism that is used in very high level supervised machine learning models.
这个例子听起来很愚蠢。 但这是在非常高级的有监督的机器学习模型中使用的机制。
I will show a practical example with a real dataset later.
KNN分类器 (KNN Classifier)
The KNN classifier is an example of a memory-based machine learning model.
That means this model memorizes the labeled training examples and they use that to classify the objects it hasn’t seen before.
The k in KNN classifier is the number of training examples it will retrieve in order to predict a new test example.
KNN classifier works in three steps:
- When it is given a new instance or example to classify, it will retrieve training examples that it memorized before and find the k number of closest examples from it. 当给定一个新的实例或示例进行分类时,它将检索以前存储的训练示例,并从中找到k个最接近的示例。
- Then the classifier looks up the labels (the name of the fruit in the example above) of those k numbers of closest examples. 然后,分类器会查找最接近的k个数字的标签(上面示例中的水果名称)。
- Finally, the model combines those labels to make a prediction. Usually, it will predict the majority labels. For example, if we choose our k to be 5, from the closest 5 examples, if we have 3 oranges and 2 apples, the prediction for the new instance will be orange. 最后,该模型将这些标签结合起来进行预测。 通常,它将预测多数标签。 例如,如果我们从最接近的5个示例中选择k为5,则如果我们有3个橘子和2个苹果,则新实例的预测将是橙色。
资料准备 (Data Preparation)
Before we start, I encourage you to check if you have the following resources available in your computer:
Numpy Library
Pandas Library
Matplotlib Library
Scikit-Learn Library
Jupyter Notebook environment.
Jupyter Notebook环境。
If you do not have Jupyter Notebook installed, use any other notebook of your choice. I suggest a Google Colaboratory notebook. Follow this link to start. Just remember one thing,
如果您尚未安装Jupyter Notebook,请使用您选择的任何其他笔记本。 我建议使用Google合作笔记本。 单击此链接开始 。 只要记住一件事,
Google Colaboratory notebook is not private. So, do not do any professional or sensitive work there. But great for practice. Because lots of commonly used packages are already installed in it.
Google合作笔记本不是私有的。 因此,不要在那里做任何专业或敏感的工作。 但是非常适合练习。 因为已经安装了许多常用的软件包。
I suggest, download the dataset. I provided the link at the bottom of the page. Run every line of code yourself if you are reading to learn this.
我建议下载数据集。 我在页面底部提供了链接。 如果您正在阅读,请自己运行每一行代码。
First, import the necessary libraries:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
For this tutorial, I will use the Titanic dataset from Kaggle. I have this dataset uploaded in the same folder as my notebook.
在本教程中,我将使用Kaggle的Titanic数据集。 我将此数据集上载到与笔记本相同的文件夹中。
Here is how I can import the dataset in the notebook using pandas.
titanic = pd.read_csv('titanic_data.csv')
#titaninc.head() gives the first five rows of the dataset. We will #print first five rows only to examine the dataset.
Look at the second column. It contains the information, if the person is survived or not. 0 means the person survived and 1 means the person did not survive.
看第二列。 它包含该信息(无论该人是否幸存)。 0表示该人幸存,1表示该人未幸存。
For this tutorial, our goal will be to predict the ‘Survived’ feature.
To make it simple, I will keep a few key features that are more important for the algorithm and get rid of the rest.
This dataset is very simple. Just from intuition, we can see that there are columns that cannot be important to predict the ‘Survived’ feature.
该数据集非常简单。 凭直觉,我们可以看到有些列对于预测“生存”功能并不重要。
For example, ‘PassengerId’, ‘Name’, ‘Ticket’ and, ‘Cabin’ does not seem to be useful to predict that if a passenger is survived or not.
例如,“ PassengerId”,“ Name”,“ Ticket”和“ Cabin”在预测乘客是否幸存时似乎没有用。
I will make a new DataFrame with a few key features and name the new DataFrame titanic1.
我将制作一个具有一些关键功能的新DataFrame并将其命名为新的DataFrame titanic1。
titanic1 = titanic[['Pclass', 'Sex', 'Fare', 'Survived']]
The ‘Sex’ column has the string value and that needs to be changed. Because computers do not understand words. It only understands numbers. I will change the ‘male’ for 0 and ‘female’ for 1.
“性别”列具有字符串值,需要更改。 因为计算机不懂单词。 它只了解数字。 我将'male'更改为0,将'female'更改为1。
titanic1['Sex'] = titanic1.Sex.replace({'male':0, 'female':1})
This is how the DataFrame titanic1 looks like:
这是DataFrame titanic1的样子:
Our goal is to predict the ‘Survived’ parameter, based on the other information in the titanic1 DataFram. So, the output variable or label(y) is ‘Survived’. The input features(X) are ‘P-class’, ‘Sex’, and, ‘Fare’.
我们的目标是根据titanic1 DataFram中的其他信息预测'Survived'参数。 因此,输出变量或label(y)为“生存”。 输入要素(X)是“ P级”,“性”和“票价”。
X = titanic1[['Pclass', 'Sex', 'Fare']]
y = titanic1['Survived']
开发KNN分类器 (Develop a KNN Classifier)
To start with, we need to split the dataset into two sets: a training set and a test set.
We will use the training set to train the model where the model will memorize both the input features and the output variable.
Then we will use the test set to see that if the model can predict if the passenger survived using the ‘P-class’, ‘Sex’, and, ‘Fare’.
然后,我们将使用测试集查看模型是否可以使用“ P级”,“性别”和“票价”来预测乘客是否还幸存下来。
The method ‘train_test_split’ is going to help to split the data. By default, this function uses 75% data for the training set and 25% data for the test set. If you want you can change that and you can specify the ‘train_size’ and ‘test_size’.
方法“ train_test_split ”将有助于拆分数据。 默认情况下,此功能将75%的数据用于训练集,将25%的数据用于测试集。 如果需要,可以更改它,可以指定“ train_size”和“ test_size”。
If you put train_size 0.8, the split will be 80% training data and 20% test data. But for me the default value 75% is good. So, I am not using train_siz or test_size parameters.
如果将train_size设置为0.8,则拆分将是80%的训练数据和20%的测试数据。 但是对我来说,默认值75%是好的。 因此,我没有使用train_siz或test_size参数。
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Remember to use the same value for ‘random_state’. That way, every time you will do this split, it will take the same data for the training set and test set.
请记住对“ random_state”使用相同的值。 这样,每次进行拆分时,训练集和测试集将使用相同的数据。
I chose random_state as 0. You can choose a number of your choice.
Python’s scikit -learn library, already have a KNN classifier model. I will import that.
Python的scikit -learn库已经具有KNN分类器模型。 我将导入。
from sklearn.neighbors import KNeighborsClassifier
Save this classifier in a variable.
knn = KNeighborsClassifier(n_neighbors = 5)
Here, n_neighbors is 5.
That means when we will ask our trained model to predict the survival chance of a new instance, it will take 5 closest training data.
Based on the labels of those 5 training data, the model will predict the label of the new instance.
Now, I will fit the training data to the model so that model can memorize them.
knn.fit(X_train, y_train)
You may think that as it memorized that training data it can predict the label of 100% of the training features correctly. But that’s not certain. Why?
您可能会认为,它记住训练数据可以正确预测100%训练功能的标签。 但这还不确定。 为什么?
Look, whenever we give input and ask it to predict the label it will take a vote from the 5 closest neighbors even if it has the exact same feature memorized.
Let’s see how much accuracy it can give us on training data
knn.score(X_train, y_train)
The training data accuracy I got is 0.83 or 83%.
Remember, we have a test dataset that our model has never seen. Now check, how much accurately it can predict the label of the test dataset.
记住,我们有一个模型从未见过的测试数据集。 现在检查,它可以准确地预测测试数据集的标签。
knn.score(X_test, y_test)
The accuracy came out to be 0.78 or 78%.
Combining the codes above, here is the 4 lines of code that makes your classifier:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
knn.score(X_train, y_train)
knn.score(X_test, y_test)
Congrats! You developed a KNN classifier!
恭喜! 您开发了KNN分类器!
Notice, the training set accuracy is a bit higher than the test set accuracy. That’s overfitting.
注意,训练集的准确性比测试集的准确性高一点。 那太合身了。
What is Overfitting?
Sometimes the model learns the training set so well that it can predict the training dataset labels very well. But when we ask the model to predict with a test dataset or a dataset that it did not see before, it does not perform as well as the training dataset. This phenomenon is called overfitting.
有时,模型会很好地学习训练集,因此可以很好地预测训练数据集标签。 但是,当我们要求模型使用测试数据集或之前从未见过的数据集进行预测时,它的性能不如训练数据集。 这种现象称为过度拟合。
In a single sentence, when the training set accuracy is higher than the test set accuracy, we call it overfitting.
预测 (Prediction)
If you want to see the predicted output for the test dataset, here is how to do that:
y_pred = knn.predict(X_test)y_pred
array([0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,
0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 1, 1], dtype=int64)
Or you can just input one single example and find the label.
I want to see when a person is traveling in ‘P-class’ 3, ‘Sex’ is female that means 1, and, paid a ‘Fare’ of 25, if she could survive as per our model.
我想看看一个人何时以“ P级” 3旅行,“性别”是女性,即1,如果按照我们的模型可以生存,则支付25的“车费”。
knn.predict([[3, 1, 25]])
Remember to use two brackets, because it requires a 2D array
array([0], dtype=int64)
The output is zero. That means as per our trained model the person could not survive.
输出为零。 这意味着根据我们训练有素的模型,该人无法生存。
Please feel free try wth more different inputs like this one!
如果您想进一步了解KNN分类器 (If You Want to See Some Further Analysis of KNN Classifier)
KNN classifier is highly sensitive to the choice of ‘k’ or n_neighbors. In the example above I used n_neighors 5.
KNN分类器对'k'或n_neighbors的选择非常敏感。 在上面的示例中,我使用了n_neighors 5。
For different n_neighbors, the classifier will perform differently.
Let’s check how it performs on the training dataset and test dataset for different n_neighbors value. I choose 1 to 20.
让我们检查一下它在不同n_neighbors值的训练数据集和测试数据集上的表现。 我选择1到20。
Now, we will calculate the training set accuracy and the test set accuracy for each n_neighbors value from 1 to 20,
training_accuracy = []
test_accuracy = []
for i in range(1, 21):
knn = KNeighborsClassifier(n_neighbors = i)
knn.fit(X_train, y_train)
training_accuracy.append(knn.score(X_train, y_train))
test_accuracy.append(knn.score(X_test, y_test))
After running this code snippet, I got the training and test accuracy for different n_neighbors.
Now, let’s plot the training and test set accuracy against n_neighbors in the same plot.
plt.plot(range(1, 21), training_accuracy, label='Training Accuarcy')
plt.plot(range(1, 21), test_accuracy, label='Testing Accuarcy')
plt.title('Training Accuracy vs Test Accuracy')
plt.ylim([0.7, 0.9])
Analyze the Graph Above
In the beginning, when the n_neighbors were 1, 2, or 3, training accuracy was a lot higher than test accuracy. So, the model was suffering from high overfitting.
开始时,当n_neighbors为1、2或3时,训练精度比测试精度高很多。 因此,模型存在过度拟合的问题。
After that training and test accuracy became closer. That is the sweet spot. We want that.
在那之后,培训和测试准确性变得更加接近。 那是最好的地方。 我们想要那个。
But when n_neighbors was going even higher, both training and test set accuracy was going down. We do not need that.
但是,当n_neighbors越来越高时,训练和测试集的准确性都将下降。 我们不需要那个。
From the graph above, the perfect n_neighbors for this particular dataset and model should be 6 or 7.
That is a good classifier!
Look at the graph above! When n_neighbors is about 7, both training and testing accuracy was above 80%.
看上面的图! 当n_neighbors约为7时,训练和测试的准确性均高于80%。
Here is a link to the complete code:
I hope you learned to build a nice KNN classifier and will try it on different datasets. Please do not hesitate to ask, if you have any questions and share if you do any new project with this.
我希望您学会了构建一个不错的KNN分类器,并将在不同的数据集上进行尝试。 如果您有任何疑问,请不要犹豫,并分享您是否对此进行任何新项目。
Thank you so much for reading this article! Here is the titanic dataset I used for this tutorial:
非常感谢您阅读本文! 这是我在本教程中使用的钛酸数据集:
Recommended Reading:
翻译自: https://towardsdatascience.com/simple-knn-classifier-with-four-lines-of-code-for-beginners-machine-learning-5344d125360f
机器学习实战 knn代码