它的工作原理是,存在一个样本数据集合,也称作训练有样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类对应的关系.输入没有标签的新数据后,将新数据的每个特征与样本集中的数据对应的特征进行比较,然后算法提取样本集中也正最相似数据(最近邻)d的分类标签,一般来说,我们只选择样本数据集中前k个最相似的数据,通常k不大于20.最后选择k个相似数据中出现最多次数的分类,作为新数据的分类.
一般根据数据的坐标通过欧氏距离公式,计算两点的距离来对比两点的相似度:
d = ( x 1 − x 0 ) 2 + ( y 1 − y 0 ) 2 d=\sqrt{(x_1-x_0)^2+(y_1-y_0)^2} d=(x1−x0)2+(y1−y0)2
下面我们根据例子来实现这个算法
首先还是要弄清楚K-近邻算法的一般流程:
使用k-近邻算法来改进约会网站的配对效果
海伦一直使用在线约会网站寻找适合自己的约会对象,尽管约会网站会推荐不同的人选,但是她没有从中找到喜欢的人,经过一番总结,她发现曾经交往过三种类型的人:
现在需要使用kNN来更好地帮助她把匹配对象划分到确切的分类中,海伦约会已经有一段时间,她把这些数据都存放在文本文件datingTestSet.txt中,总共1000行,每行包括三个特征数据
我们可以写一个python方法,用来导入数据,也可以自己写一个用来初始化数据,本文用到的数据集
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines()) #get the number of lines in the file
returnMat = zeros((numberOfLines,3)) #prepare matrix to return
classLabelVector = [] #prepare labels return
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
import matplotlib.pyplot as plot
from numpy import *
import numpy as np
import operator
datingDateSet,datingLabels=file2matrix('datingTestSet2.txt')
fig=plot.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDateSet[:, 1], datingDateSet[:, 2],15.0*array(datingLabels),15.0*array(datingLabels))
plot.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dWndq3Wa-1591773381383)(output_7_0.png)]
在上面的三个特征数据中,我们会发现,每年的获得飞行常客里程数会对计算结果的影响组大.我们需要处理这种不同取值方位的特征值,通常采用的方法是数值归一化.如果将取值范围处理为0到1或者-1到1之间.下面的公式可以可以将任意取值范围的特征值转化为0到1区间内的值;
N e w V a l u e = ( o l d V a l u e − m i n ) ( m a x − m i n ) NewValue=\frac{(oldValue-min)}{(max-min)} NewValue=(max−min)(oldValue−min)
def autoNorm(dataSet):
minVals=dataSet.min(0) # numpy.min(x) 根据参数的x来比较维度的最小,
maxVals=dataSet.max(0)
ranges=maxVals-minVals
normalDataSet=np.zeros(np.shape(dataSet))
m=dataSet.shape[0]
normalDataSet=dataSet-np.tile(minVals,(m,1))
normalDataSet=normalDataSet/np.tile(ranges,(m,1))
return normalDataSet, ranges,minVals
归一化数值主要的作用时间少局部特征数值差异过大导致影响结果的权重过甚.
我们通过上面的欧氏距离公式来编写算法函数,通过上面的函数装填好数据后,在进行数据分析,对特征值进行归一化处理后,尽可以使用算法进行计算了.
# 简单kNN 算法
def knn_classifier(inX, dataSet, lables, k): # inX是需要分类新数据,dataSet测试数据,label是训练数据的结果
dataSetSize = dataSet.shape[0]
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances ** 0.5
sortedDistIndicies = distances.argsort()
classCount = {
}
for i in range(k):
voteILabel = lables[sortedDistIndicies[i]]
classCount[voteILabel] = classCount.get(voteILabel, 0) + 1
sortClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortClassCount[0][0]
算法过程: 我们首先得到数组的数据量(列数),将需要测试的数据进行数据量维度的复制扩容到跟训练数据的维度一样.然后进行相减.然后将数据集合进行平方,在对数列纵向求sum,求和后开根号,argsort函数 ,argsort函数返回的是数组值从小到大的索引值.返回后就可以根据k来计算最近距离的k个数据的分类是什么,最多的那个就是我们想要的结果.
一个算法设计完成以后,我们需要使用错误率来检验这个分类器的性能.错误率是分类器地次数除以测试数据的总数.
p = C o u n t T C o u n t A p=\frac{Count_T}{Count_A} p=CountACountT
下面给出一个测试错误率的代码:
def datingClassTest():
hoRatino=0.10
datingDataMat,datingLabels=file2matrix("datingTestSet2.txt")
normMat,ranges,minVals=autoNorm(datingDataMat)
m=normMat.shape[0]
numTestVecs=int(m*hoRatino)
errorcount=0.0
for i in range(numTestVecs):
classifierResult=knn_classifier(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print("the classifier came back with:%d,the real answer is %d" %(classifierResult,datingLabels[i]))
if classifierResult!=datingLabels[i]:
errorcount+=1.0
print("this total error rate is: %f" % (errorcount/float(numTestVecs)))
datingClassTest()
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 1
the classifier came back with:1,the real answer is 1
this total error rate is: 0.080000