ML 监督学习 分类 K近邻算法

KNN “物以类聚,人以群分”:
给定测试样本,基于某种距离度量找出训练集中与其最靠近的k个训练样本,然后基于这k个邻居的信息来进行预测。

Example
Example

KNN算法既可以分类又可以回归。
在分类任务中使用投票法,选择这k个样本中出现最多的类别标记作为预测结果。
在回归任务中使用平均法,将这k个样本的实值输出标记的平均值作为预测结果。

KNN算法可用于估算分类变量和连续变量的缺失值。

传统上,KNN算法采用的是欧氏距离

Euclidean and Manhattan 距离用于连续变量,

Hamming distance

Hamming距离用于分类变量

维数灾难(Curse of Dimensionality)

请参考别人写的文章
https://blog.csdn.net/zbc1090549839/article/details/38929215

与大量特征相比,KNN在具有较少特征的情况下表现更好。且当维数增加时,会导致过拟合。解决方法:PCA降维、特征选择

Summary

1.KNN算法在测试上花的计算比训练上要多
因为该算法的训练阶段仅包括存储训练样本的特征向量和类别标签。

2.在下面的图像中,假设使用的算法是KNN,k的最佳值是10,因为验证误差最小

3.当k值增加时,偏差增加,方差减小。
当数据集中出现噪声,可以尝试增加k值。

4.在KNN中,由于维数灾难很可能导致过拟合,解决方法是可以减少维数、特征选择。

5.当k值增加,决策边界会变得更平滑

6.使用KNN算法,数据需要先min-max标准化或
z-score标准化


例子
# First Feature
weather=['Sunny','Sunny','Overcast','Rainy','Rainy',
         'Rainy','Overcast','Sunny','Sunny','Rainy',
         'Sunny','Overcast','Overcast','Rainy']
# Second Feature
temp=['Hot','Hot','Hot','Mild','Cool',
      'Cool','Cool','Mild','Cool','Mild',
      'Mild','Mild','Hot','Mild']

# Label or target varible
play=['No','No','Yes','Yes','Yes',
      'No','Yes','No','Yes','Yes',
      'Yes','Yes','Yes','No']

# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print(weather_encoded) #Overcast:0, Rainy:1, and Sunny:2

# converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)

#combinig weather and temp into single listof tuples
features=list(zip(weather_encoded,temp_encoded))

#Using KNN Classifier
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3) #k=3
model.fit(features,label) #Train the model

#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print(predicted)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
#Load dataset
wine = datasets.load_wine()
# print the names of the features
print(wine.feature_names)
# print the label species
print(wine.target_names)
# print the wine data (first five)
print(wine.data[0:5])
print(wine.target)
print(wine.data.shape)

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3)

# Define the scaler for normalization
scaler = StandardScaler().fit(X_train)

# Normalize the training dataset
X_train = scaler.transform(X_train)

# Normalize the test dataset
X_test = scaler.transform(X_test)

#create knn Classifier
knn = KNeighborsClassifier(n_neighbors=7)

# train the model using training set
knn.fit(X_train,y_train)

#predict the response
y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test,y_pred)
print(accuracy)

training_accuracy=[]
testing_accuracy=[]
neighbors_setting= range(1,11)
for n_neighbors in neighbors_setting:
    knn=KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train,y_train)
    training_accuracy.append(knn.score(X_train,y_train))
    testing_accuracy.append(knn.score(X_test,y_test))

plt.plot(neighbors_setting,training_accuracy,label='training')
plt.plot(neighbors_setting,testing_accuracy,label='testing')
plt.legend()
plt.xlabel('neighbors_setting')
plt.ylabel('accuracy')
plt.show()

总结
Conclusion

你可能感兴趣的:(ML 监督学习 分类 K近邻算法)