机器学习5:k-近邻算法

K-近邻算法(kNN)

工作原理

它的工作原理是,存在一个样本数据集合,也称作训练有样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类对应的关系.输入没有标签的新数据后,将新数据的每个特征与样本集中的数据对应的特征进行比较,然后算法提取样本集中也正最相似数据(最近邻)d的分类标签,一般来说,我们只选择样本数据集中前k个最相似的数据,通常k不大于20.最后选择k个相似数据中出现最多次数的分类,作为新数据的分类.

数学函数式

一般根据数据的坐标通过欧氏距离公式,计算两点的距离来对比两点的相似度:
d = ( x 1 − x 0 ) 2 + ( y 1 − y 0 ) 2 d=\sqrt{(x_1-x_0)^2+(y_1-y_0)^2} d=(x1x0)2+(y1y0)2

下面我们根据例子来实现这个算法

算法实践

首先还是要弄清楚K-近邻算法的一般流程:

  • 收集数据:可以使用任何方法
  • 准备数据:距离计算所需要的数值,最好是结构化的数据
  • 分析数据:可以使用任何方法
  • 测试算法:计算错误率
  • 使用算法:首先需要输入样本数据和仅够花的输出结果,然后运行k-邻近算法判定输入数据分别属于那个分类,最后应用对计算出的分类执行后续结果.

场景介绍

使用k-近邻算法来改进约会网站的配对效果
海伦一直使用在线约会网站寻找适合自己的约会对象,尽管约会网站会推荐不同的人选,但是她没有从中找到喜欢的人,经过一番总结,她发现曾经交往过三种类型的人:

  • 不喜欢的人
  • 魅力一般的人
  • 极具魅力的人

现在需要使用kNN来更好地帮助她把匹配对象划分到确切的分类中,海伦约会已经有一段时间,她把这些数据都存放在文本文件datingTestSet.txt中,总共1000行,每行包括三个特征数据

  • 每年获得飞行常客的里程数
  • 玩视频游戏所耗时间百分比
  • 每周消费的冰淇淋公斤数

使用python准备数据

我们可以写一个python方法,用来导入数据,也可以自己写一个用来初始化数据,本文用到的数据集

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

分析数据:使用Matplotlib创建散点图


import matplotlib.pyplot as plot
from numpy import *
import numpy as np
import operator
datingDateSet,datingLabels=file2matrix('datingTestSet2.txt')
fig=plot.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDateSet[:,  1], datingDateSet[:, 2],15.0*array(datingLabels),15.0*array(datingLabels))
plot.show()

机器学习5:k-近邻算法_第1张图片

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dWndq3Wa-1591773381383)(output_7_0.png)]

准备数据:归一化数值

在上面的三个特征数据中,我们会发现,每年的获得飞行常客里程数会对计算结果的影响组大.我们需要处理这种不同取值方位的特征值,通常采用的方法是数值归一化.如果将取值范围处理为0到1或者-1到1之间.下面的公式可以可以将任意取值范围的特征值转化为0到1区间内的值;

N e w V a l u e = ( o l d V a l u e − m i n ) ( m a x − m i n ) NewValue=\frac{(oldValue-min)}{(max-min)} NewValue=(maxmin)(oldValuemin)

def autoNorm(dataSet):
    minVals=dataSet.min(0)        # numpy.min(x)  根据参数的x来比较维度的最小,
    maxVals=dataSet.max(0)
    ranges=maxVals-minVals
    normalDataSet=np.zeros(np.shape(dataSet))
    m=dataSet.shape[0]
    normalDataSet=dataSet-np.tile(minVals,(m,1))
    normalDataSet=normalDataSet/np.tile(ranges,(m,1))
    return normalDataSet, ranges,minVals


归一化数值主要的作用时间少局部特征数值差异过大导致影响结果的权重过甚.

kNN 算法

我们通过上面的欧氏距离公式来编写算法函数,通过上面的函数装填好数据后,在进行数据分析,对特征值进行归一化处理后,尽可以使用算法进行计算了.

# 简单kNN 算法
def knn_classifier(inX, dataSet, lables, k):      # inX是需要分类新数据,dataSet测试数据,label是训练数据的结果
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()
    classCount = {
     }
    for i in range(k):
        voteILabel = lables[sortedDistIndicies[i]]
        classCount[voteILabel] = classCount.get(voteILabel, 0) + 1
        sortClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        return sortClassCount[0][0]



算法过程: 我们首先得到数组的数据量(列数),将需要测试的数据进行数据量维度的复制扩容到跟训练数据的维度一样.然后进行相减.然后将数据集合进行平方,在对数列纵向求sum,求和后开根号,argsort函数 ,argsort函数返回的是数组值从小到大的索引值.返回后就可以根据k来计算最近距离的k个数据的分类是什么,最多的那个就是我们想要的结果.

测试算法:作为完整程序验证分类器

一个算法设计完成以后,我们需要使用错误率来检验这个分类器的性能.错误率是分类器地次数除以测试数据的总数.
p = C o u n t T C o u n t A p=\frac{Count_T}{Count_A} p=CountACountT

下面给出一个测试错误率的代码:


def datingClassTest():
    hoRatino=0.10
    datingDataMat,datingLabels=file2matrix("datingTestSet2.txt")
    normMat,ranges,minVals=autoNorm(datingDataMat)
    m=normMat.shape[0]
    numTestVecs=int(m*hoRatino)
    errorcount=0.0
    for i in range(numTestVecs):
        classifierResult=knn_classifier(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print("the classifier came back with:%d,the real answer is %d" %(classifierResult,datingLabels[i]))
        if classifierResult!=datingLabels[i]:
            errorcount+=1.0
    print("this total error rate is: %f" % (errorcount/float(numTestVecs)))


datingClassTest()
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 3
the classifier came back with:1,the real answer is 1
the classifier came back with:2,the real answer is 2
the classifier came back with:1,the real answer is 1
the classifier came back with:3,the real answer is 3
the classifier came back with:3,the real answer is 3
the classifier came back with:2,the real answer is 2
the classifier came back with:2,the real answer is 1
the classifier came back with:1,the real answer is 1
this total error rate is: 0.080000

你可能感兴趣的:(人工智能,机器学习,python)