算法背景
K最近邻(K-Nearest Neighbor,KNN)算法,是著名的模式识别统计学方法,在机器学习分类算法中占有相当大的地位。它是一个理论上比较成熟的方法。既是最简单的机器学习算法之一,也是基于实例的学习方法中最基本的,又是最好的文本分类算法之一。
KNN工作原理
存在一个样本数据集合,也称为训练样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类对应的关系。输入没有标签的数据后,将新数据中的每个特征与样本集中数据对应的特征进行比较,提取出样本集中特征最相似数据(最近邻)的分类标签。一般来说,我们只选择样本数据集中前k个最相似的数据,这就是k近邻算法中k的出处,通常k是不大于20的整数。最后选择k个最相似数据中出现次数最多的分类作为新数据的分类。
说明:KNN没有显示的训练过程,它是“懒惰学习”的代表,它在训练阶段只是把数据保存下来,训练时间开销为0,等收到测试样本后进行处理。
KNN算法的优缺点
KNN算法的优点:
1)简单、有效。
2)重新训练的代价较低(类别体系的变化和训练集的变化,在Web环境和电子商务应用中是很常见的)。
3)计算时间和空间线性于训练集的规模(在一些场合不算太大)。
4)由于KNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,KNN方法较其他方法更为适合。
5)该算法比较适用于样本容量比较大的类域的自动分类,而那些样本容量较小的类域采用这种算法比较容易产生误分。
KNN算法缺点:
1)KNN算法是懒散学习方法(lazy learning,基本上不学习),一些积极学习的算法要快很多。
2)类别评分不是规格化的(不像概率评分)。
3)输出的可解释性不强,例如决策树的可解释性较强。
4)该算法在分类时有个主要的不足是,当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有可能导致当输入一个新样本时,该样本的K个邻居中大容量类的样本占多数。该算法只计算“最近的”邻居样本,某一类的样本数量很大,那么或者这类样本并不接近目标样本,或者这类样本很靠近目标样本。无论怎样,数量并不能影响运行结果。可以采用权值的方法(和该样本距离小的邻居权值大)来改进。
5)计算量较大。目前常用的解决方法是事先对已知样本点进行剪辑,事先去除对分类作用不大的样本。
Python实现
采用均值方差归一化方法,数据进行归一化之后,考虑距离加权,计算鸢尾花的分类。
源程序
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import numpy as np
from math import sqrt
from collections import Counter
from sklearn import datasets
from sklearn.metrics import accuracy_score
class KNN:
def __init__(self, k, p):
"""初始化kNN分类器"""
assert k >= 1, "k must be valid"
self.k = k
self.p = p
self._X_train = None
self._y_train = None
self.y_possible = None
def data_split (self, X, y, test_ratio=0.2, seed=None):
"""根据test_ratio分隔数据集"""
assert X.shape[0] == y.shape[0],\
"the size of X must equal to the size of y"
assert 0.0 <=test_ratio <=1.0, \
"test_ratio must be valid"
if seed:
np.random.seed(seed)
#将原始数据打乱
shuffle_indexes = np.random.permutation(len(X))
test_size = int(len(X)*test_ratio)
test_indexes = shuffle_indexes[:test_size]
train_indexes = shuffle_indexes[test_size:]
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]
#y的所有可能取值
y_list = y_train.tolist() #将训练y值矩阵转换为列表
y_possible_list = sorted(set(y_list),key=y_list.index) #去除列表中重复数据,数据排列不变
self.y_possible = np.array(y_possible_list) #将列表转换为矩阵
# 数据归一化
X_train, X_test= self._scaling(X_train, X_train), self._scaling(X_train, X_test)
return X_train, y_train, X_test, y_test
def _scaling(self, X_train, original_X):
# """最值归一化"""
# scaling_X = np.ones(shape=original_X.shape)
# for i in range(original_X.shape[1]):
# scaling_X[:,i] = (original_X[:,i] - np.min(X_train[:,i]))/(np.max(X_train[:,i])-np.min(X_train[:,i]))
"""均值方差归一化"""
scaling_X = np.zeros(shape=original_X.shape)
for i in range(original_X.shape[1]):
scaling_X[:,i] = (original_X[:,i] - np.mean(X_train[:,i]))/(np.std(X_train[:,i]))
return scaling_X
def fit(self, X_train, y_train):
"""根据训练数据集X_train和y_train训练kNN分类器"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
assert self.k <= X_train.shape[0], \
"the size of X_train must be at least k."
self._X_train = X_train
self._y_train = y_train
return self
def predict(self, X_predict):
"""给定待预测数据集X_predict,返回表示X_predict的结果向量"""
assert self._X_train is not None and self._y_train is not None, \
"must fit before predict!"
assert X_predict.shape[1] == self._X_train.shape[1], \
"the feature number of X_predict must be equal to X_train"
y_predict = [self._predict(x) for x in X_predict]
return np.array(y_predict)
def _predict(self, x):
"""给定单个待预测数据x,返回x的预测结果值"""
assert x.shape[0] == self._X_train.shape[1], \
"the feature number of x must be equal to X_train"
distances = [pow(np.sum((abs(x_train - x)) ** self.p), 1/self.p)
for x_train in self._X_train]
nearest = np.argsort(distances)
topK_y = [self._y_train[i] for i in nearest[:self.k]]
topK_distance = [distances[i] for i in nearest[:self.k]]
votes_count = np.zeros(shape=(self.y_possible.shape))
for j in range(self.k):
vote_id = np.argwhere(self.y_possible == topK_y[j])
votes_count[vote_id] += 1/(topK_distance[j]+1e-8) #防止分母为零
predict_value_id = np.argmax(votes_count)
predict_value = self.y_possible[predict_value_id]
return predict_value
def score(self, X_test, y_test):
"""根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""
y_predict = self.predict(X_test)
return accuracy_score(y_test, y_predict)
def __repr__(self):
return "KNN(k=%d)" % self.k
#KNN算法用于鸢尾花种类判定
iris = datasets.load_iris()
X = iris.data
y = iris.target
iris_knn = KNN(3,7)
X_train, y_train, X_test, y_test = iris_knn.data_split (X, y)
iris_knn.fit(X_train, y_train)
score = iris_knn.score(X_test, y_test)
print(score)
调用sklearn
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
if __name__=='__main__':
iris = datasets.load_iris()
print(iris.DESCR) #显示数据具体信息
X = iris.data
y = iris.target
#数据分割
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=None)
#数据归一化
StandardScaler = StandardScaler()
StandardScaler.fit(X_train)
X_train = StandardScaler.transform(X_train)
X_test = StandardScaler.transform(X_test)
#KNN算法处理
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
print(knn_clf.score(X_test, y_test))
param_grid = [
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,6)]
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,6)],
'p':[i for i in range(1,4)]
}
]
#优化超参数
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
knn_clf = grid_search.best_estimator_
print(knn_clf.score(X_test, y_test))
python tips
将原始数据打乱:
shuffle_indexes = np.random.permutation(len(X))
test_size = int(len(X)*test_ratio)
test_indexes = shuffle_indexes[:test_size]
train_indexes = shuffle_indexes[test_size:]
列表和矩阵的互相转换:
y_list = y_train.tolist() #将训练y值矩阵转换为列表
self.y_possible = np.array(y_possible_list) #将列表转换为矩阵