[MLReview] k-NearestNeighbor k邻近算法代码实现

一、分类模型(classification)

近邻算法大约是最简单的算法之一,但是在许多场景中却出奇地好用。

knn的核心思想是样本在特征空间中的k个最近邻的样本,然后让这k的样本“投票表决”待分类样本的类别。

 

二、缺点:

1、原始样本数据不均衡,某一类样本数量很大

2、可理解性差,对比决策树

 

三、算法及数学推导(截图出来看,引用自李航老师的《统计学习方法》)

[MLReview] k-NearestNeighbor k邻近算法代码实现_第1张图片

显而易见,在knn中并没有看见所谓的“学习算法”。

周志华老师在《机器学习》中称这种为“lazy learning”。

这里引用ZH ZHOU的一篇论文ML-KNN:A lazy learning approach to multi-label learning

这里周老师对比贝叶斯分类器,证明knn的泛化错误率不超过bayes的两倍。

[MLReview] k-NearestNeighbor k邻近算法代码实现_第2张图片

证明的论文地址Nearest Neighbor Pattern Classification

 

四、重要参数

1、k值选取

k过大:此时能有更大区域决定待预测的样本,bias会增大(近似误差),variance减小(估计误差)

k过小:在较小的邻域内预测。容易过拟合

选取合适k值:交叉验证(cross validation)选取

 

2、补充:交叉验证crossvalidation

[MLReview] k-NearestNeighbor k邻近算法代码实现_第3张图片

10折交叉验证比较常见,就是把数据集分成n份,每次训练选取其中n-1份为训练集,留一份为测试集,计算平均误差(estimated generalization error)。

详细的论文见A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

**不过最近看见有LOOCV(Least-One-Out CV)

就是每次拿出来一个样本做测试,不过这样过拟合确定不会爆吗。。好吧其实knn在过拟合问题上感觉考虑的不多

这样做最大的缺点是会计算量巨多

再po一个cross validation的pptcmu_cross_validation

比较跑题,这里提到了就顺便写完

[MLReview] k-NearestNeighbor k邻近算法代码实现_第4张图片

好习惯,给个cv(区别下视觉的cv)的code

sklearn会有一些很好用的package:看看官方的tutorial真的很嗨皮scikit-learn——cross_validation

 

例如:

①train_test_split 用的很多 按百分比随机划分数据集

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = 
    train_test_split(data, target, test_size=0.4, random_state=0)

② cross_val_score、cross_val_predicted 交叉验证并评估验证

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, data, target, cv=5)
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, data, target, cv=10)

③kfold

import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

#[2 3] [0 1]
#[0 1] [2 3]

④LOO 其实跟kfold差不多 就是kfold的一个特殊形式

from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
print("%s %s" % (train, test))

#[1 2 3] [0]
#[0 2 3] [1]
#[0 1 3] [2]
#[0 1 2] [3]

⑤当然还有时间序列的分割  TimeSeriesSplit

from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1,2],[3,4],[1,2],[3,4],[1,2],[3,4]])

tscv = TimeSeriesSplit(n_splits=3)

for train, test in tscv.split(X):
    print train, test
   
#[0 1 2] [3]
#[0 1 2 3] [4]
#[0 1 2 3 4] [5]

官方源码真的很不错,当然这些package也可以自己写咯~

ShuffleSplit:有放回的抽样 类似kfold

总而言之 这里CV的作用是tell我们哪一个k值model是最好哒~!

 

3、距离的问题(可以用在很多个场景)

在KNN算法中,常用的距离有三种,分别为曼哈顿距离、欧式距离和闵可夫斯基距离。 
设特征空间XX是n维实数向量空间RnRnxi,xj∈X,xi=(x(1)i,x(2)i,...,x(n)i)Txi,xj∈X,xi=(xi(1),xi(2),...,xi(n))Txj=(x(1)j,x(2)j,...,x(n)j)Txj=(xj(1),xj(2),...,xj(n))Txi,xjxi,xjLpLp距离定义为: 

Lp(xi,xj)=(∑nl=1|x(l)i−x(l)j|p)1pLp(xi,xj)=(∑l=1n|xi(l)−xj(l)|p)1p


这里 p≥1p≥1 
p=1p=1时,称为曼哈顿距离(Manhattan distance), 公式为: 

L1(xi,xj)=∑nl=1|x(l)i−x(l)j|L1(xi,xj)=∑l=1n|xi(l)−xj(l)|


p=2p=2时,称为欧式距离(Euclidean distance),即 

 

L2(xi,xj)=(∑nl=1|x(l)i−x(l)j|2)12L2(xi,xj)=(∑l=1n|xi(l)−xj(l)|2)12


p=∞p=∞时,它是各个坐标距离的最大值,计算公式为: 

 

L∞(xi,xj)=maxl|x(l)i−x(l)j|

 

五、knn算法codes

先从底层的开始:knn.py

#dataset数据集 inX分类向量 k邻近数目
def classify(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]

    #欧式距离计算
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    
    #选择距离最小的k个点
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        #排序
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

so easy. 代码解释:

这里先把训练集的每一个样本的features与对应的inX做减法,然后计算曼哈顿距离,对这些点进行升序排序。

接着声明一个字典型classCount,keys为最近k个点的labels,每当遇到相同的keys,values自增1。

对字典的values降序排列,返回第一个字典第一个值的key即为“投票选举”的结果

ps: argsort(x) 返回将x降序排列的index列表 然后根据index在labels中搜索对于的y值。

注意下py3将iteritems()替换为items(),以及cmp函数也被取消,sorted中不再支持定义函数比较排序,可以自己重写一个方法。

 

可以利用先前的split测试,我们load data

from sklearn.model_selection import train_test_split
from sklearn import datasets  

def createDataSet(): 
    iris = datasets.load_iris()  
    iris_X = iris.data  
    iris_y = iris.target   
    X_train, X_test, y_train, y_test = 
        train_test_split(iris_X , iris_y, test_size=0.4, random_state=0)
    return X_train, X_test, y_train, y_test

我们同时把训练集合测试集输入model

def predict():
    X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.4, random_state=0)

    k = 0
    p = np.zeros(y_test.shape)
    for i in range(X_test.shape[0]):
	p[i] = (ks.classify0(X_test[i:i+1], X_train, y_train, 3))
	if(p[i]==y_test[i]):k+=1

    accuracy = k/p.shape[0]
    print(accuracy)

这样就完成了水仙花数据的knn测试,结果为93.333%

现在我们对这个结果不是很满意,希望用CV测试出准确度最高的k值

def kfold_predict():
    iris = datasets.load_iris()  
    iris_X = iris.data  
    iris_y = iris.target   

    kf = KFold(n_splits = 10)

    accuracy = 0
    for train_index, test_index in kf.split(iris_X) :
	#print("TRAIN:", train_index, "TEST:",test_index)
	X_train, X_test = iris_X[train_index],iris_X[test_index]
	y_train, y_test = iris_y[train_index], iris_y[test_index]
	#print(y_train,y_test)
	k = 0
	p = np.zeros(y_test.shape)
	for i in range(X_test.shape[0]):
		p[i] = (ks.classify0(X_test[i:i+1], X_train, y_train, 3))
		if(p[i]==y_test[i]):k+=1
	accuracy += k/p.shape[0]

    print("%s" % (accuracy/10))

好吧 再写一个for loop去试试看k值咯 这里就不再继续纠结了

 

六、knn算法实例

这里引用几个machinelearninginaction书中的codes进行说明。

1、在约会网站上使用knn,改进配对结果

准备工作:

①解析文件数据工具 file.py

 

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    fr = open(filename)
    index = 0
    #解析文件到列表
    for line in fr.readlines():
        line = line.strip() #截取回车字符
        listFromLine = line.split('\t') #将整行数据分割为元素列表
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

②画散点图 matplotlib 一个神奇的包 这个之后再码 时间有点晚了

③归一化normalization 弱智方法 直接放codes

def autoNorm(dataSet):  
    minVals = dataSet.min(0)  #每列的最小值  参数0可以从列中选取最小值而不是选取当前行的最小值  
    maxVals = dataSet.max(0)    
    ranges = maxVals - minVals  #为了归一化特征值,必须使用当前值减去最小值,然后除以取值范围  
    normDataSet = zeros(shape(dataSet))  #使用numpy中tile函数将变量内容复制成输入矩阵同样大小的矩阵  
    m = dataSet.shape[0]  
    normDataSet = dataSet - tile(minVals, (m,1))  
    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide  
    return normDataSet, ranges, minVals  

④预测

def datingClassTest():  
    hoRatio = 0;10  
  
    datingDataMat,datingLables = file2matrix('datingTestSet.txt')  
    normMat, ranges, minVals = autoNorm(datingDataMat)  
    m = normMat.shape[0]  
    numTestVecs = int(m*hoRatio)  
    errorCount = 0.0  
    for i in range(numTestVecs):  
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)  
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])  
        if (classifierResult != datingLabels[i]): errorCount += 1.0  
    print "the total error rate is: %f" % (errorCount/float(numTestVecs))  
    print errorCount  
def classifyPerson():  
    resultList = ['not at all','in small doses','in large doses']  
    percentTats = float(raw_input(\  
                  "percentage of time spent playing video games?"))  
    ffMiles = float(raw_input("frequent flier miles earned per year?"))  
    iceCream = float(raw_input("liters of ice cream consumed per year?"))  
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')  
    normMat,ranges,minVals = autoNorm(datingDataMat)  
    inArr = array([ffMiles,percentTats,iceCream])  
    classifierResult = int(classify0((inArr-\  
                                  minVals)/ranges,normMat,datingLabels,3))  
    print "You will probably like this person:",\  
          resultList[classifierResult - 1]  

 

2、比较经典的手写体识别 可以用一万种方法做

①处理图像数据 img.py

def img2vector(filename):  
    returnVect = zeros((1,1024))  
    fr = open(filename)  
    for i in range(32):  
        lineStr = fr.readline()  
        for j in range(32):  
            returnVect[0,32*i+j] = int(lineStr[j])  
    return returnVect  

②手写体main.py 导入knn module

def handwritingClassTest():  
    hwLabels = []  
    trainingFileList = listdir('trainingDigits')           #load the training set  
    m = len(trainingFileList)  
    trainingMat = zeros((m,1024))  
    for i in range(m):  
        fileNameStr = trainingFileList[i]  
        fileStr = fileNameStr.split('.')[0]     #take off .txt  
        classNumStr = int(fileStr.split('_')[0])  
        hwLabels.append(classNumStr)  
        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)  
    testFileList = listdir('testDigits')        #iterate through the test set  
    errorCount = 0.0  
    mTest = len(testFileList)  
    for i in range(mTest):  
        fileNameStr = testFileList[i]  
        fileStr = fileNameStr.split('.')[0]     #take off .txt  
        classNumStr = int(fileStr.split('_')[0])  
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)  
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)  
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)  
        if (classifierResult != classNumStr): errorCount += 1.0  
    print "\nthe total number of errors is: %d" % errorCount  
    print "\nthe total error rate is: %f" % (errorCount/float(mTest))  

 

七、sikit-learn 

直接有knn包,sry久等了,不过谢谢源码也是极好的哇。

#KNN调用  
import numpy as np  
from sklearn import datasets  
iris = datasets.load_iris()  
iris_X = iris.data  
iris_y = iris.target  
np.unique(iris_y)  
# Split iris data in train and test data  
# A random permutation, to split the data randomly  
np.random.seed(0)  
# permutation随机生成一个范围内的序列  
indices = np.random.permutation(len(iris_X))  
# 通过随机序列将数据随机进行测试集和训练集的划分  
    
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X , iris_y, test_size=0.4, random_state=0) 
# Create and fit a nearest-neighbor classifier  
from sklearn.neighbors import KNeighborsClassifier  
knn = KNeighborsClassifier()  
knn.fit(iris_X_train, iris_y_train)   
  
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',  
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,  
           weights='uniform')  

 

KNeighborsClassifier方法中含有8个参数(以下前两个常用):

n_neighbors : int, optional (default = 5):K的取值,默认的邻居数量是5;

weights:确定近邻的权重,“uniform”权重一样,“distance”指权重为距离的倒数,默认情况下是权重相等。也可以自己定义函数确定权重的方式;

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'},optional:计算最近邻的方法,可根据需要自己选择;

再放一波官方tutorial sklearn.neighbors.KNeighborsClassifier

 

4月23日码 有点肝不动 最近还和小伙伴做做时间序列分析 感觉会很有趣~

有时间再回来补充

顺便补充一个 刚刚发现openCV出问题了 估计很多人都会有

open_cv_轮子 下载好后cd到文件夹 pip or conda都ok

 

 

你可能感兴趣的:(机器学习十大算法,knn,machine,learning)