KNN “物以类聚,人以群分”:
给定测试样本,基于某种距离度量找出训练集中与其最靠近的k个训练样本,然后基于这k个邻居的信息来进行预测。
KNN算法既可以分类又可以回归。
在分类任务中使用投票法,选择这k个样本中出现最多的类别标记作为预测结果。
在回归任务中使用平均法,将这k个样本的实值输出标记的平均值作为预测结果。
KNN算法可用于估算分类变量和连续变量的缺失值。
传统上,KNN算法采用的是欧氏距离
Euclidean and Manhattan 距离用于连续变量,
Hamming距离用于分类变量
维数灾难(Curse of Dimensionality)
请参考别人写的文章
https://blog.csdn.net/zbc1090549839/article/details/38929215
与大量特征相比,KNN在具有较少特征的情况下表现更好。且当维数增加时,会导致过拟合。解决方法:PCA降维、特征选择
Summary
1.KNN算法在测试上花的计算比训练上要多
因为该算法的训练阶段仅包括存储训练样本的特征向量和类别标签。
3.当k值增加时,偏差增加,方差减小。
当数据集中出现噪声,可以尝试增加k值。
4.在KNN中,由于维数灾难很可能导致过拟合,解决方法是可以减少维数、特征选择。
5.当k值增加,决策边界会变得更平滑6.使用KNN算法,数据需要先min-max标准化或
z-score标准化
例子
# First Feature
weather=['Sunny','Sunny','Overcast','Rainy','Rainy',
'Rainy','Overcast','Sunny','Sunny','Rainy',
'Sunny','Overcast','Overcast','Rainy']
# Second Feature
temp=['Hot','Hot','Hot','Mild','Cool',
'Cool','Cool','Mild','Cool','Mild',
'Mild','Mild','Hot','Mild']
# Label or target varible
play=['No','No','Yes','Yes','Yes',
'No','Yes','No','Yes','Yes',
'Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print(weather_encoded) #Overcast:0, Rainy:1, and Sunny:2
# converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
#combinig weather and temp into single listof tuples
features=list(zip(weather_encoded,temp_encoded))
#Using KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3) #k=3
model.fit(features,label) #Train the model
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print(predicted)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
#Load dataset
wine = datasets.load_wine()
# print the names of the features
print(wine.feature_names)
# print the label species
print(wine.target_names)
# print the wine data (first five)
print(wine.data[0:5])
print(wine.target)
print(wine.data.shape)
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3)
# Define the scaler for normalization
scaler = StandardScaler().fit(X_train)
# Normalize the training dataset
X_train = scaler.transform(X_train)
# Normalize the test dataset
X_test = scaler.transform(X_test)
#create knn Classifier
knn = KNeighborsClassifier(n_neighbors=7)
# train the model using training set
knn.fit(X_train,y_train)
#predict the response
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(accuracy)
training_accuracy=[]
testing_accuracy=[]
neighbors_setting= range(1,11)
for n_neighbors in neighbors_setting:
knn=KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_train,y_train)
training_accuracy.append(knn.score(X_train,y_train))
testing_accuracy.append(knn.score(X_test,y_test))
plt.plot(neighbors_setting,training_accuracy,label='training')
plt.plot(neighbors_setting,testing_accuracy,label='testing')
plt.legend()
plt.xlabel('neighbors_setting')
plt.ylabel('accuracy')
plt.show()