机器学习实战k近邻２

发布的代码都是经本人调试，在Python2.7上可以正常运行的。后面还会附带一些自己在写代码过程中遇到的问题。

#!/usr/bin/python

#-*- coding:UTF-8-*-

from numpyimport *

import operator

from osimport listdir

def classify0(inX, dataSet, labels, k):

dataSetSize = dataSet.shape[0]

diff = tile(inX, (dataSetSize,1))-dataSet

sqDiff = diff **2

sqDistances = sqDiff.sum(axis=1)

Distance = sqDistances **0.5

sortedDisIndices = Distance.argsort()

classCount = {}

for iin range(k):

voteLabel = labels[sortedDisIndices[i]]

classCount[voteLabel] =classCount.get(voteLabel,0)+1

sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse =True)

return sortedClassCount[0][0]

#将文本记录转换为Numpy的解析程序

def file2matrix(filename):

#从文件中读入训练集，并将其存储为矩阵形式

fr =open(filename)

arrayOfLines = fr.readlines()

numberOfLines =len (arrayOfLines)#得到文件的行数

returnMat = zeros((numberOfLines,3))

#创建一个２维矩阵用来存放训练样本数据集，每一行存放三个数据

classLabelVector = []#创建一个一维数组用来存放训练样本的标签

index =0 #这里的Index指的是第几行

for linein arrayOfLines:

line = line.strip()#去掉文本尾部的换行符

listFromLine = line.split('\t')#将每行数据按空格来分割

# 以‘/t’来分割字符，文本中是以tab来分割的

#split()返回一个List对象

#returnMat[index, :] = listFromLine[0:3]

returnMat[index, : ] = listFromLine[0:3]

#这里的index指的是returnMat这个矩阵的第几行

#List[0]表示List矩阵的第一行

#List[2，：]表示List矩阵中第三行的所有元素

#List[2,0:2]表示List矩阵中第三行下标为０和１的两个元素，２表示结束不算在内

classLabelVector.append(int(listFromLine[-1]))

#Python语言中可以使用索引值－１表示列表中的最后一个元素（这里就是标签），把它存储到LabelVector中

#这里的int ,是因为必须要告诉解释器列表中存储的元素为整型，否则Python 就会按照字符串来处理

index +=1

return returnMat,classLabelVector

#关于数值的归一化：　简单来说，就是将所有的数据变为　０－１之间的数

def autoNorm(dataSet):

minVals = dataSet.min(0)

#min(0)是取列中的最小值；min(1)是取行中的最小值；max()同理

maxVals = dataSet.max(0)

ranges = maxVals - minVals

normDataSet = zeros(shape(dataSet))#定义一个空的矩阵

#shape(A)就是返回一个具有矩阵A 维度的矩阵

m = dataSet.shape[0]#每一行元素的个数,即矩阵的列数

normDataSet = dataSet - tile(minVals,(m,1))#tile(A,(m,1))先将矩阵A复制m行列保持不变

normDataSet = normDataSet/tile(ranges,(m,1))

return normDataSet, ranges, minVals

#分类器测试代码

def datingClassTest():

hoRatio =0.1 #测试数据集所占的比例

datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')

normMat, ranges, minVals = autoNorm(datingDataMat)

m = normMat.shape[0]#返回数据集的行数

numTestVecs =int (m * hoRatio)#测试卷的数量

errorCount =0.0

for iin range(numTestVecs):#循环读取每行数据

classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)

#对每条数据进行分类

print "the classifier came back with: %d，　the real answer is %d" %(classifierResult, datingLabels[i])

if (classifierResult != datingLabels[i]):

errorCount +=1.0

print "the total error rate is: %f" %(errorCount/float(numTestVecs))

print errorCount

#约会网站预测函数

def classifyperson():

resultList =['not at all','in small doses','in large doses']

percentPlay =float(raw_input("percentage of time spent playing video games?"))

ffMovies =float(raw_input(("frequent flier miles earned per year")))

iceCream =float(raw_input("liters of icecream consumed per year"))

datingDataMat,datingLabels = file2matrix("datingTestSet2.txt")

normMat, ranges, minVals =autoNorm(datingDataMat)

inArr =array([ffMovies, percentPlay, iceCream])#将输入数据写到数组中

classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels,3)

#classifierResult = classify0(inArr , normMat, datingLabels, 3)

print "You will probably like this person :", resultList[classifierResult -1]

if __name__ =="__main__":

#datingClassTest()

classifyperson()

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

problem1: 《机器学习实战》这本书上面有个小错误，datingDataSet.txt应该改为datingDataSet2.txt

problem2: 在将文本记录转换为Numpy的解析程序过程中，要注意：

1) line.strip() 截取掉所有的回车字符，然后使用tab字符\t将上一步得到的郑航数据分割成一个元素列表 (注意不少/t，这个错误我改了好久才发现)

２）如果没有通知解释器，列表中存储的元素值是整型，Python语言会将这些元素当作字符串来处理

problem3: classifyperson()函数中，datingTestSet2.txt要和problem1中的数据集一样

备注：需要datingDataSet.txt 等数据可以emal me at :[email protected],欢迎技术交流

机器学习实战k近邻２－约会网站

你可能感兴趣的:(机器学习实战k近邻２－约会网站)